American Journal of Public Health Research
ISSN (Print): 2327-669X ISSN (Online): 2327-6703 Website: https://www.sciepub.com/journal/ajphr Editor-in-chief: Apply for this position
Open Access
Journal Browser
Go
American Journal of Public Health Research. 2025, 13(3), 90-102
DOI: 10.12691/ajphr-13-3-1
Open AccessArticle

Using Statistical Machine Learning to Find Complex Interactions and Important CVD Risk Factors When Predicting General Health in Adults

Peter D. Hart1, 2,

1Health Promotion Research, Havre, Montana, USA

2Kinesmetrics Lab, Tallahassee, Florida, USA

Pub. Date: May 07, 2025

Cite this paper:
Peter D. Hart. Using Statistical Machine Learning to Find Complex Interactions and Important CVD Risk Factors When Predicting General Health in Adults. American Journal of Public Health Research. 2025; 13(3):90-102. doi: 10.12691/ajphr-13-3-1

Abstract

Background: Cardiovascular disease (CVD) is the leading cause of premature mortality among U.S. adults. Many risk factors for CVD are established and widely used in health promotion and preventive medicine. However, the extent to which the major CVD risk factors interrelate in relation to health outcomes is less understood. The purpose of this study was to use statistical machine learning to identify complex interactions and important variables when predicting general health (GH) with CVD risk factors. Methods: The analysis plan included five objectives. First, a decision (regression) tree was built on training data and fine-tuned using validation data. Second, ordinary least squares (OLS) regression was used to confirm terminal splits provided by the decision tree algorithm. Third, new test data were used to evaluate generalization of the decision tree branches. Fourth, a random forest was run and examined for consistency with decision tree fit performance using training, validation, and test data. Fifth, CVD risk factor variable importance was assessed along with a sensitivity analysis to examine stability in rankings. The 2017-2018 NHANES (N = 3,487) was used for training and validation and 2015-2016 NHANES (N = 3,897) for testing. A residualized self-assessed GH T-score with age, race/ethnicity, sex, and income removed, served as the outcome variable (aka., target). Eight CVD risk factors inspired by Life’s Essential 8 (LE8) were used as predictors (aka., features or inputs) and included healthy eating index (HEI; 0-100), moderate-to-vigorous-physical activity (MVPA; min/week), nicotine exposure (NE; non-smoker, quit smoker, other nicotine device user, smoker), sleep time (ST; hr/day), body mass index (BMI; kg/m2), non-high density lipoprotein cholesterol (NHDL; mg/dL), glycohemoglobin (A1C; %), and mean arterial pressure (MAP; mmHg). SAS HPSPLIT and HPFOREST were the primary reporting procedures. The variable importance sensitivity analysis was performed using R (train and randomForest) and Python (DecisionTreeRegressor and RandomForestRegressor). Results: The decision tree built on training data and 10-fold cross validation resulted in a 16-leaf tree with a 6-node depth (ASE Training = 86.3, ASE Validation = 91.1, Δ = 5.6%). BMI split first with A1C splitting next for high BMI (BMI ≥ 30.1) and MVPA splitting next for low BMI (BMI < 30.1). OLS regression confirmed (ps < .05) all terminal splits in the training data. Greatest GH (Mean = 56.2) was observed in those with low BMI (BMI < 30.1), high MVPA (MVPA ≥ 137.2 min/day), low NE (non-smoker, quit smoker, other nicotine device user), low A1C (A1C < 6.2%), and high HEI (HEI ≥ 37.8). Lowest GH (Mean = 43.8) was observed in those with high BMI (BMI ≥ 30.1) and high A1C (A1C ≥ 6.8%). The decision tree generalized well (ASE Test = 93.1, Δ = 8.0%) with OLS regression confirming (ps < .05) majority of terminal splits. Decision tree variable importance rankings were consistent with the random forest (r Spearman = .83, p = .011) and robust against the sensitivity analysis (avg r Spearman = .84, p = .009, ICC(3,6) = 0.97). Conclusion: This study demonstrated a novel use of machine learning that complements conventional statistical analyses. Decision trees along with random forests can identify extremely complex patterns in data and identify variables that contribute the most to group separation of an outcome variable. BMI, MVPA, A1C, and NE are likely the more important predictors of GH in this population.

Keywords:
Data science Machine learning Cardiovascular disease (CVD) Population health

Creative CommonsThis work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

References:

[1]  Bastian B, Tejada Vera B, Arias E, et al. Mortality trends in the United States, 1900–2018. National Center for Health Statistics. 2020.
 
[2]  Ahmad FB, Cisewski JA, Anderson RN. Mortality in the United States — Provisional Data, 2023. MMWR Morb Mortal Wkly Rep 2024; 73: 677–681.
 
[3]  Martin SS, Aday AW, Allen NB, et al. 2025 Heart Disease and Stroke Statistics: A Report of US and Global Data From the American Heart Association. Circulation. Published online January 27, 2025.
 
[4]  Kazi DS, Elkind MSV, Deutsch A, et al. Forecasting the Economic Burden of Cardiovascular Disease and Stroke in the United States Through 2050: A Presidential Advisory From the American Heart Association. Circulation. 2024; 150(4): e89-e101.
 
[5]  Cardiovascular disease burden in the Region of the Americas, 2000-2019. ENLACE data portal. Pan American Health Organization. 2021.
 
[6]  Lui JNM, Williams C, Keng MJ, et al. Impact of New Cardiovascular Events on Quality of Life and Hospital Costs in People With Cardiovascular Disease in the United Kingdom and United States. J Am Heart Assoc. 2023; 12(19): e030766.
 
[7]  Allen NB, Badon S, Greenlund KJ, Huffman M, Hong Y, Lloyd-Jones DM. The association between cardiovascular health and health-related quality of life and health status measures among U.S. adults: a cross-sectional study of the National Health and Nutrition Examination Surveys, 2001-2010. Health Qual Life Outcomes. 2015; 13: 152. Published 2015 Sep 22.
 
[8]  Lloyd-Jones DM, Allen NB, Anderson CAM, et al. Life's Essential 8: Updating and Enhancing the American Heart Association's Construct of Cardiovascular Health: A Presidential Advisory From the American Heart Association. Circulation. 2022; 146(5): e18-e43.
 
[9]  Office of Disease Prevention and Health Promotion. (n.d.). Heart Disease and Stroke. Healthy People 2030. U.S. Department of Health and Human Services. https:// odphp.health.gov/ healthypeople/objectives-and-data/browse-objectives/heart-disease-and-stroke.
 
[10]  Cuccia AF, DiPietro L, Hayman LL, Whiteley JA, Napolitano MA. Longitudinal Changes in Cardiovascular Health among Young Adults with Overweight and Obesity. J Cardiovasc Nurs. Published online December 31, 2024.
 
[11]  Brewer LC, Jenkins S, Hayes SN, et al. Community-Based, Cluster-Randomized Pilot Trial of a Cardiovascular Mobile Health Intervention: Preliminary Findings of the FAITH! Trial. Circulation. 2022; 146(3): 175-190.
 
[12]  Gall SL, Feigin V, Thrift AG, et al. Personalized knowledge to reduce the risk of stroke (PERKS-International): Protocol for a randomized controlled trial. Int J Stroke. 2023; 18(4): 477-483.
 
[13]  Krishnamurthi R, Hale L, Barker-Collo S, et al. Mobile Technology for Primary Stroke Prevention: A Proof-of-Concept Pilot Randomized Controlled Trial. Stroke. 2019; 50(1): 196-198.
 
[14]  Dramé M, Cantegrit E, Godaert L. Self-Rated Health as a Predictor of Mortality in Older Adults: A Systematic Review. Int J Environ Res Public Health. 2023; 20(5): 3813. Published 2023 Feb 21.
 
[15]  Tanaka T, Morishita S, Nakano J, et al. Relationship between patient-reported health-related quality of life as measured with the SF-36 or SF-12 and their mortality risk in patients with diverse cancer type: a meta-analysis. Int J Clin Oncol. 2025; 30(2): 252-266.
 
[16]  Herraiz-Adillo Á, Ahlqvist VH, Daka B, et al. Life's Essential 8 in relation to self-rated health and health-related quality of life in a large population-based sample: the SCAPIS project. Qual Life Res. 2024; 33(4): 1003-1014.
 
[17]  Ratner, B. (2017). Statistical and machine-learning data mining: Techniques for better predictive modeling and analysis of big data. Chapman and Hall/CRC.
 
[18]  Pinheiro, Carlos Andre Reis and Mike Patetta. 2021. Introduction to Statistical and Machine Learning Methods for Data Science. Cary, NC: SAS Institute Inc.
 
[19]  Akinbami LJ, Chen TC, Davy O, Ogden CL, Fink S, Clark J, et al. National Health and Nutrition Examination Survey, 2017–March 2020 prepandemic file: Sample design, estimation, and analytic guidelines. National Center for Health Statistics. Vital Health Stat 2(190). 2022.
 
[20]  Chen TC, Clark J, Riddles MK, Mohadjer LK, Fakhouri THI. National Health and Nutrition Examination Survey, 2015-2018: Sample Design and Estimation Procedures. Vital Health Stat 2. 2020; (184): 1-35.
 
[21]  Flom P. An introduction to classification and regression trees with PROC HPSPLIT. In Midwest SAS Users Group (MWSUG) Conference Proceedings. Paper AA-42, 2018.
 
[22]  Breiman, L., Friedman, J., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees.
 
[23]  Nord C, Keeley J. An Introduction to the HPFOREST Procedure and its Options. In Midwest SAS Users Group (MWSUG) Conference Proceedings; Paper AA20, 2016.
 
[24]  Belmont, CA: Wadsworth SAS Institute Inc. 2015. The HPSPLIT Procedure. SAS/STAT® 14.1 User’s Guide. Cary, NC: SAS Institute Inc.
 
[25]  SAS Institute Inc. 2016. SAS® Enterprise Miner™ 14.2: High-Performance Procedures. Cary, NC: SAS Institute Inc.
 
[26]  Liaw A, Wiener M (2002). Classification and Regression by randomForest. R News, 2(3), 18-22. https://CRAN.R-project.org/doc/Rnews/.
 
[27]  Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5), 1–26.
 
[28]  Manual AB. An introduction to statistical learning with applications in R.
 
[29]  Liu YH. Python machine learning by example: unlock machine learning best practices with real-world use cases. Packt Publishing Ltd; 2024 Jul 31.
 
[30]  VanderPlas J. Python data science handbook: Essential tools for working with data. "O'Reilly Media, Inc."; 2016 Nov 21.
 
[31]  Flegal, K. M., Kit, B. K., Orpana, H., & Graubard, B. I. (2013). Association of all-cause mortality with overweight and obesity using standard body mass index categories: a systematic review and meta-analysis. JAMA, 309(1), 71–82.
 
[32]  Piercy, K. L., Troiano, R. P., Ballard, R. M., Carlson, S. A., Fulton, J. E., Galuska, D. A., George, S. M., & Olson, R. D. (2018). The Physical Activity Guidelines for Americans. JAMA, 320(19), 2020–2028.
 
[33]  American Diabetes Association Professional Practice Committee (2022). 2. Classification and Diagnosis of Diabetes: Standards of Medical Care in Diabetes-2022. Diabetes care, 45(Suppl 1), S17–S38.
 
[34]  Cao, Y., Zhang, X., Fearon, I. M., Li, J., Chen, X., Xiong, Y., Zheng, F., Zhang, J., Sun, X., & Liu, X. (2024). The effects of electronic cigarette use patterns on health-related symptom burden and quality of life: analysis of US prospective longitudinal cohort study data. Frontiers in public health, 12, 1433678.
 
[35]  Weir, C. B., & Jan, A. (2023). BMI Classification Percentile and Cut Off Points. In StatPearls. StatPearls Publishing.
 
[36]  Deng J, Ji W, Liu H, et al. Development and validation of a machine learning-based framework for assessing metabolic-associated fatty liver disease risk. BMC Public Health. 2024; 24(1): 2545. Published 2024 Sep 18.
 
[37]  Ma X, Wu Y, Zhang L, et al. Comparison and development of machine learning tools for the prediction of chronic obstructive pulmonary disease in the Chinese population. J Transl Med. 2020; 18(1): 146. Published 2020 Mar 31.
 
[38]  Hu X, Yang Z, Ma Y, et al. Development and validation of a machine learning-based predictive model for secondary post-tonsillectomy hemorrhage. Front Surg. 2023; 10: 1114922. Published 2023 Feb 7.
 
[39]  Hill AB. The environment and disease: association or causation? Proc R Soc Med. 58:295-300, 1965.
 
[40]  DeSalvo KB, Fisher WP, Tran K, Bloser N, Merrill W, Peabody J. Assessing measurement properties of two single-item general health measures. Qual Life Res. 2006; 15(2): 191-201.
 
[41]  Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill.
 
[42]  Yeager DS, Krosnick JA. The validity of self-reported nicotine product use in the 2001-2008 National Health and Nutrition Examination Survey. Med Care. 2010; 48(12): 1128-1132.
 
[43]  Cleland CL, Hunter RF, Kee F, Cupples ME, Sallis JF, Tully MA. Validity of the global physical activity questionnaire (GPAQ) in assessing levels and change in moderate-vigorous physical activity and sedentary behaviour. BMC Public Health. 2014; 14: 1255. Published 2014 Dec 10.
 
[44]  Lee PH. Validation of the National Health and Nutritional Survey (NHANES) single-item self-reported sleep duration against wrist-worn accelerometer. Sleep Breath. 2022; 26(4): 2069-2075.