In scientific studies, there arises ned to test whether the underlying distributions of two or more populations are different from each other on the basis of independent samples. To test against specific alternative hypotheses with or without parametric assumptions, test methods are available. To test against a completely general alternative hypothesis, a commonly used tool is the two- sample Kolmogorov-Smirnov test. Quadratic empirical distribution function (EDF) based tests including the Anderson-Darling test and the Cramer-von Mises test have been developed and shown to be more powerful than the Kolmogorov-Smirnov test. A closely related nonparametric testing problem is the one sample problem, which tests whether a set of observations is drawn from a given probability distribution. One-sample Kolmogorov-Smirnov, Anderson-Darling, and Cramer-von Mises tests are available for this purpose. Another classical method for the one-sample testing problem is Pearson's chi- squared test. The chi-squared test will lose its power if observations are split into too many intervals, which represents a typical dilemma encountered by discretization approaches. For the two-sample testing problem, Miller and Siegmund (Ref. 1) studied the maximally selected chi-square statistic, which compares two samples by selecting an optimal cut point on the range of the observed values. The K-sample testing problem can also be viewed as a dependence test between a continuous random variable and a categorical one. Recently, there has many methods developed to capture complicated dependence structures between pairs of random variables. The statistical power of different methods in detecting associations between a pair of continuous random variables has previously been studied through extensive simulations with various functional relationships and noise levels. This article describes a dynamic discretization approach based on the likelihood-ratio testing framework with regularization. For the two-sample test, the approach can be viewed as a generalization of the maximally selected chi-square statistic of (Ref.1)by allowing for multiple cut points. To prevent over-slicing, the proposed K-sample test statistic regularizes mutual information with a penalty term on the number of slices and maximizes over all possible discretization schemes of the underlying continuous random variable. An efficient dynamic programming algorithm called dynamic slicing is proposed to determine the optimal slicing scheme.
展开▼