Artificial Intelligence in Medicine
Volume 45, Issue 2 , Pages 151-162 , February 2009

Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors

  • Oleg Okun

      Affiliations

    • University of Oulu, Department of Electrical and Information Engineering, P.O. Box 4500, Oulu 90014, Finland
    • Corresponding Author InformationCorresponding author. Tel.: +358 8 5532898; fax: +358 8 5532612.
  • ,
  • Helen Priisalu

      Affiliations

    • Tallinn University of Technology, Institute of Cybernetics, Akadeemia Tee 21, Tallinn 12618, Estonia

Received 12 November 2007 ,Revised 5 August 2008 ,Accepted 6 August 2008.

References 

  1. Dougherty ER, Shmulevich I, Chen J, Wang ZJ, editors. Genomic signal processing and statistics. Hindawi Publishing Corporation, New York, Cairo; 2005.
  2. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences. 1999;96:6745–6750
  3. Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature. 2002;415:436–442
  4. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002;1:203–209
  5. Dudoit S, Fridlyand J. Classification in microarray experiments. In:  Speed T editors. Statistical analysis of gene expression microarray data. Boca Raton: Chapman & Hall/CRC Press; 2003;p. 93–158
  6. Sklar A. Fonctions de répartition à n dimensions et leurs marges. University of Paris: Publications of the Institute of Statistics; 1959;p. 229–231
  7. Nelsen RB. An introduction to copulas. New York: Springer Science+Business Media; 2006;
  8. Joe H. Multivariate models and dependence concepts. Boca Raton: Chapman & Hall/CRC Press; 1997;
  9. Zar JH. Biostatistical analysis. Upper Saddle River: Prentice-Hall; 1999;
  10. Braga-Neto U, Dougherty ER. Bolstered error estimation. Pattern Recognition. 2004;37:1267–1281
  11. Long PM, Vega VB. Boosting and microarray data. Machine Learning. 2003;52:31–44
  12. Tan AC, Gilbert D. Ensemble machine learning on gene expression data for cancer classification. Applied Bioinformatics. 2003;2:75–S83
  13. Díaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7:3
  14. Cho S-B, Ryu J. Classifying gene expression data of cancer using classifier ensemble with mutually exclusive features. Proceedings of the IEEE. 2002;90:1744–1753
  15. Cho S-B, Won H-H. Data mining for gene expression profiles from DNA microarray. International Journal of Software Engineering and Knowledge Engineering. 2003;13:593–608
  16. Hong J-H, Cho S-B. The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming. Artificial Intelligence in Medicine. 2006;36:43–58
  17. Kim K-J, Cho S-B. DNA gene expression classification with ensemble classifiers optimized by speciated genetic algorithm. In:  Pal SK,  Bandyopadhyay S,  Biswas S editor. Lecture notes in computer science, vol. 3776: proceedings of the first international conference on pattern recognition and machine intelligence. Kolkata, India. Berlin/Heidelberg: Springer; 2005;p. 649–653
  18. Park C, Cho S-B. Evolutionary computation for optimal ensemble classifier in lymphoma cancer classification. In:  Zhong N,  Ras ZW,  Tsumoto S,  Suzuki E editor. Lecture notes in computer science, vol. 2871: proceedings of the 14th international symposium on methodologies for intelligent systems. Maebashi City, Japan. Berlin/Heidelberg: Springer; 2003;p. 521–530
  19. Paik M, Yang Y. Combining nearest neighbor classifiers versus cross-validation selection, statistical applications in genetics and molecular biology 3:1. Article. 2004;12:(available at http://www.bepress.com/sagmb/vol3/iss1///art12)
  20. Valentini G. Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles. Artificial Intelligence in Medicine. 2002;26:281–304
  21. Moon H, Ahn H, Kodell RL, Baek S, Lin C-J, Chen JJ. Ensemble methods for classification of patients for personalized medicine with high-dimensional data. Artificial Intelligence in Medicine. 2007;41:197–207
  22. Linder R, Dew D, Sudhoff H, Theegarten D, Remberger K, Pöppl SJ. The ‘subsequent artificial network’ (SANN) approach might bring more classificatory power to ANN-based DNA microarray analyses. Bioinformatics. 2004;20:3544–3552
  23. Liu B, Cui Q, Jiang T, Ma S. A combinational feature selection and ensemble neural network method for classification of gene expression data. BMC Bioinformatics. 2004;5:136;[27 September]
  24. Blanco A, Martín-Merino M, De Las Rivas J. Combining dissimilarity based classifiers for cancer prediction using gene expression profiles. BMC Bioinformatics. 2007;8(Suppl 8):S3;[20 November]
  25. Gandrillon O. Guide to the gene expression data. In:  Berka P,  Crémilleux B editor. Proceedings of the ECML/PKDD discovery challenge workshop. Italy: Pisa; 2004;p. 116–120
  26. Ho TK, Basu M. Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2002;24:289–300
  27. Bø TH, Jonassen I. New feature subset selection procedures for classification of expression profiles. Genome Biology. 2002;3:0017.1–001711.001711
  28. Schweizer B, Wolff EF. On nonparametric measures of dependence for random variables. Annals of Statistics. 1981;9:879–885
  29. Ein-Dor L, Kela I, Getz G, Givol D, Domany E. Outcome signature genes in breast cancer: is there a unique set?. Bioinformatics. 2005;21:171–178
  30. Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet. 2005;365:488–492
  31. Kuncheva L, Whitaker CJ. Measures of diversity in classifier ensembles. Machine Learning. 2003;51:181–207
  32. Kuncheva L. Combining pattern classifiers: methods and algorithms. Hoboken: John Wiley & Sons; 2004;
  33. Sima C, Attoor S, Braga-Neto U, Lowey J, Suh E, Dougherty ER. Impact of error estimation on feature selection. Pattern Recognition. 2005;38:2472–2482
  34. Bay S. Nearest neighbor classification from multiple feature sets. Intelligent Data Analysis. 1999;3:191–209
  35. Okun O, Priisalu H. Ensembles of k-nearest neighbors and dimensionality reduction. In: Proceedings of the 2008 international joint conference on neural networks. Piscataway, NJ: IEEE Press; 2008;p. 451–458
  36. Okun O, Valentini G. Dataset complexity can help to generate accurate ensembles of k-nearest neighbors. In: Proceedings of the 2008 international joint conference on neural networks. Piscataway, NJ: IEEE Press; 2008;p. 2033–2040
  37. Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences. 2002;99:6567–6572
  38. Yu L. Feature selection for genomic data analysis. In:  Liu H,  Motoda H editor. Computational methods of feature selection. Boca Raton: Chapman & Hall/CRC; 2008;p. 337–354

PII: S0933-3657(08)00111-5

doi: 10.1016/j.artmed.2008.08.004

Artificial Intelligence in Medicine
Volume 45, Issue 2 , Pages 151-162 , February 2009