Abstract
This paper presents experimental results with both real and artificial data combining unsupervised learning algorithms using stacking. Specifically, stacking is used to form a linear combination of finite mixture model and kernel density estimators for non-parametric multivariate density estimation. The method outperforms other strategies such as choosing the single best model based on cross-validation, combining with uniform weights, and even using the single best model chosen by “Cheating” and examining the test set. We also investigate (1) how the utility of stacking changes when one of the models being combined is the model that generated the data, (2) how the stacking coefficients of the models compare to the relative frequencies with which cross-validation chooses among the models, (3) visualization of combined “effective” kernels, and (4) the sensitivity of stacking to overfitting as model complexity increases.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.
Breiman, L. (1996a). Bagging predictors. Machine Learning, 26(2), 123–140.
Breiman, L. (1996b). Stacked regressions. Machine Learning, 24, 49–64.
Breiman, L. (1998). Arcing classifiers. Annals of Statistics, 26(3), 824–832.
Buntine, W. (1991). Bayesian back-propagation. Complex Systems, 5, 603–643.
Chan, P.K., & Stolfo, S.J. (1996). Sharing learned models among remote database partitions by local meta-learning. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (pp. 2–7). Menlo Park, CA: AAAI Press.
Draper, D. (1995). Assessment and propagation of model uncertainty (with discussion). Journal of the Royal Statistical Society B, 57, 45–97.
Escobar, M.D., & West, M. (1995). Bayesian density estimation and inference with mixtures. JASA, 90, 577–588.
Hall, P. (1987). On Kullback-Leibler loss and density estimation. Ann. Stat., 15, 1491.
Hansen, L.K., & Salamon, P. (1990). Neural network ensembles. IEEE Trans. Patt. Anal. Mach. Int., 12, 993–1001.
Jones, M.C., Marron, J.S., & Sheather, S.J. (1996). A brief survey of bandwidth selection schemes for density estimation. J. Am. Stat. Assoc., 91(433), 401–407.
Jordan, M.I., & Jacobs, R.A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214.
Kim, K., & Bartlett, E.B. (1995). Error estimation by series association for neural network systems. Neural Computation, 7, 799.
Leblanc, M., & Tibshirani, R.J. (1996). Combining estimates in regression and classification, J. Amer. Statist. Assoc., 91(436), 1641–1650.
Lehmann, E.L. (1986). Testing statistical hypotheses (2nd ed.). New York: John Wiley and Sons.
Macready, W., & Wolpert, D.H. (1996). Combining stacking with bagging to improve a learning algorithm, submitted.
Madigan, D., & Raftery, A.E. (1994). Model selection and accounting for model uncertainty in graphical models using Occam's window. J. Am. Stat. Assoc., 89, 1535–1546.
Marchette, D.J., Priebe, C.E., Rogers, G.W., & Solka, J.L. (1996). Filtered kernel density estimation. Comp. Stats., 11(2), 95–112.
Merz, C.J., & Pazzani, M.J. (1997). Combining neural network regression estimates with regularized linear weights. In M. Mozer, M.I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems 9 (pp. 564–570).
Neal, R.M. (1993). Bayesian learning via stochastic dynamics. In S.J. Hanson et al. (Eds.), Advances in neural information processing systems 5. Morgan Kauffman.
Perrone, M. (1993). Improving regression estimation. Ph.D. thesis, Brown University, Department of Physics.
Ormeneit, D., & Tresp, V. (1996). Improved Gaussian mixture density estimates using Bayesian penalty terms and network averaging. Advances in neural information processing 8 (pp. 542–548). MIT Press.
Rao J.S., & Tibshirani, R. (1996). The out-of-bootstrap method for model averaging and selection. University of Toronto, Statistics Department, preprint.
Ripley, B.D. (1994). Neural networks and related methods for classification (with discussion). J. Roy. Stat. Soc. B, 56, 409–456.
Roeder, K. (1990). Density estimation with confidence sets exemplified by superclusters and voids in the galaxies. J. Am. Stat. Assoc., 85(411), 617–624.
Scott, D.W. (1992). Multivariate density estimation: Theory, practice, and visualization. New York: John Wiley.
Silverman, B.W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall.
Smyth, P. (1996). Clustering using Monte-Carlo cross-validation. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (pp. 126–133). Menlo Park, CA: AAAI Press.
Smyth, P., & Wolpert, D. (1998). An evaluation of linearly combining density estimators. ICS TR–98–25. Information and Computer Science, University of California at Irvine.
Snedecor, G.W., & Cochran, W.G. (1989). Statistical methods (8th ed.). Ames, IO: Iowa State University Press.
Titterington, D.M., Smith, A.F.M., & Makov, U.E. (1985). Statistical analysis of finite mixture distributions. Chichester, UK: John Wiley and Sons.
Wand, M.P., & Jones, M.C. (1995). Kernel smoothing. London: Chapman and Hall.
Wolpert, D. (1992). Stacked generalization. Neural Networks, 5, 241–259.
Wolpert, D.H. (1993). Combining generalizers by using partitions of the learning set. In L. Nadel et al. (Eds.), 1992 lectures in complex systems. Addison-Wesley.
Wolpert, D.H. (1994). Bayesian backpropagation over i-o functions rather than weights. In J. Cowan et al. (Eds.), Advances in neural information processing systems 6. Morgan Kauffman.
Wolpert, D.H. (1995). On the Bayesian ‘Occam's factors’ argument for Occam's razor. In T. Petsche et al. (Eds.), Computational learning theory and natural learning systems III.
Wolpert, D.H. (1996a). The existence of a priori distinctions between learning algorithms. Neural Computation, 8, 1391–1420.
Wolpert, D.H. (1996b). Reconciling Bayesian and non-Bayesian analysis. In G.R. Heidbreder (Eds.), Maximum entropy and Bayesian methods. Kluwer Academic Publishers.
Wolpert, D.H., & Wolf, D.R. (1995). Estimating functions of probability distributions from a finite set of samples. Phys. Rev. E, 52, 6841.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Smyth, P., Wolpert, D. Linearly Combining Density Estimators via Stacking. Machine Learning 36, 59–83 (1999). https://doi.org/10.1023/A:1007511322260
Issue Date:
DOI: https://doi.org/10.1023/A:1007511322260