Article citationsMore >>

Guyon I. and Elisseeff A. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research. 1157-1182.

has been cited by the following article:

Article

Features Selection in Statistical Classification of High Dimensional Image Derived Maize (Zea Mays L.) Phenomic Data

1Department of Physical Sciences, Chuka University, P.O Box 109-60400, Chuka, Kenya

2Department of Plant Sciences, Chuka University, P.O Box 109-60400, Chuka, Kenya


American Journal of Applied Mathematics and Statistics. 2022, Vol. 10 No. 2, 44-51
DOI: 10.12691/ajams-10-2-2
Copyright © 2022 Science and Education Publishing

Cite this paper:
Peter Gachoki, Moses Muraya, Gladys Njoroge. Features Selection in Statistical Classification of High Dimensional Image Derived Maize (Zea Mays L.) Phenomic Data. American Journal of Applied Mathematics and Statistics. 2022; 10(2):44-51. doi: 10.12691/ajams-10-2-2.

Correspondence to: Peter  Gachoki, Department of Physical Sciences, Chuka University, P.O Box 109-60400, Chuka, Kenya. Email: pkgachoki@gmail

Abstract

Phenotyping has advanced with the application of high throughput phenotyping techniques such automated imaging. This has led to derivation of large quantities of high dimensional phenotypic data that could not have been achieved using manual phenotyping in a single run. Hence, the need for parallel development of statistical techniques that can appropriately handle such large and/or high dimensional data set. Moreover, there is need to come up with a statistical criteria for selecting the best image derived phenotypic features that can be used as best predictors in modelling plant growth. Information on such criteria is limited. The objective of this study is to apply feature importance, feature selection with Shapley values and LASSO regression techniques to find the subset of features with the highest predictive power for subsequent use in modelling maize plant growth using high-dimensional image derived phenotypic data. The study compared the statistical power of these features extraction methods by fitting an XGBoost model using the best features from each selection method. The image derived phenomic data was obtained from Leibniz Institute of Plant Genetics and Crop Plant Research, -Gatersleben, Germany. Data analysis was performed using R-statistical software. The data was subjected to data imputation using k Nearest Neighbours technique. Features extraction was performed using feature importance, Shapley values and LASSO regression. The Shapley values extracted 25 phenotypic features, feature importance extracted 31 features and LASSO regression extracted 12 features. Of the three techniques, the feature importance criterion emerged the best feature selection technique, followed by Shapley values and LASSO regression, respectively. The study demonstrated the potential of using feature importance as a selection technique in reduction of input variables in of high dimensional growth data set.

Keywords