Two Nearest Means Method: Regression through Searching in the Data

Author(s): Farrokh Alemi, Madhukar Reddy Vongala, Sri Surya Krishna Rama Taraka Naren Durbha, Manaf Zargoush

Background: Historically, fitting a regression equation has been done through minimizing sum of squared residuals.

Objective: We present an alternative approach that fits regression equations through searches for specific cases in the database. Case-based reasoning predicts outcomes based on matching to training cases, and without modeling the relationship between features and outcome. This study compares the accuracy of the two nearest means (2NM), a search and case-based reasoning approach, to regression, a feature-based reasoning.

Data Sources: The accuracy of the two methods was examined in predicting mortality of 296,051 residents in Veterans Health Affairs nursing homes. Data was collected from 1/1/2000 to 9/10/2012. Data was randomly divided into training (90%) and validation (10%) samples.

Study Design: Cohort observational study.

Data Collection/Extraction Methods: In the 2NM algorithm, first data were transformed so that all features are monotonely related to the outcome. Second, all means that violate monotone order were set aside; to be processed as exceptions to the general algorithm. Third, for predicting a new case, the means in the training set are divided into “excessive” and “partial” means, based on how they match a new case. Fourth, the outcome for the new case is predicted as the average of two means: the excessive mean with minimum outcome and the partial mean with maximum outcome. To evaluate, we predicted the accuracy of linear logistic regression and the proposed procedure in predicting mortality from age, gender, and 10 daily living disabilities.

Principal Findings: In cases set aside for validation, the 2NM had a McFadden Pseudo R-squared of 0.51. The linear logistic regression, trained on the same training sample and predicting to the same validation cases, had a McFadden Pseudo R-squared of 0.09. The 2NM was significantly more accurate (alpha <0.001) than linear logistic regression. A procedure is described for how to construct a non-linear regression that accomplishes the same level of accuracy as the 2NM.

Conclusions: 2NM, a Case-Based reasoning method, captured nonlinear interactions in the data.

© 2016-2024, Copyrights Fortune Journals. All Rights Reserved