Regression problems with too many observations are now commonplace. Such voluminous data require modeling approaches that are different from those used in classical analysis. Though some newly developed approaches are feasible theoretically, they lack computational ease. This article considers the problem of verifying a pre-specified parametric model and massive data sets employing scalable nonparametric tests. Assume a set of independent observations coming from a population in which the unknown regression function is assumed to be smooth. To justify the use a parametric model, a specification test on the functional form of the regression is needed. Given a parametric family of known real functions g(x,θ) the null and alternate hypotheses are The problem is to assess the validity of a given model for an observed data. A massive data source will usually not be unique and it is necessary to verify the correctness of parametric models for data sets from different sources. If the hypotheses are accepted, then some modeling method can be found by an aggregation mechanism. If the volume of the data involved is unimaginably large, then computation may be a problem even with current high-speed parallel processing. The article proposes simple strategies for construction test statistics that can avoid detailed computations for model checking. This is made possible by partitioning the massive data set into K subsets with equal sample size while K varies with the total data size N. A test statistic is created based on each subset and the results are aggregated by taking an average. This makes the computation far easier.
展开▼