Robustness concerns in high-dimensional data analyses and potential solutions

Document Type

Book Chapter

Publication Title

Big Data Analytics in Chemoinformatics and Bioinformatics: with Applications to Computer-Aided Drug Design, Cancer Biology, Emerging Pathogens and Computational Toxicology

Abstract

This chapter provides an overview of the parametric statistical procedures for high-dimensional data, focusing primarily on the robustness aspects against data contamination. In particular, we consider the problem of simultaneous variable selection and parameter estimation under the high-dimensional regression set-ups. It is discussed that, for this purpose, the standard methods based on the squared-error or the likelihood-based loss functions are extremely nonrobust in the presence of outliers or other contamination in the sample data. This motivated the needs of appropriate robust statistical methodologies for deriving stable and correct inferences from noisy high-dimensional data. We provide a brief review of the existing methods for robust and sparse estimation under the high-dimensional linear and generalized linear models (GLMs), including the class of penalized M-estimators and the minimum penalized density power divergence estimators (MPDPDEs). In this context, we further derive the oracle consistency and asymptotic normality of MPDPDEs under the ultrahigh dimensional GLMs, which were not yet available in the existing literature. Finally, the application and usefulness of these robust high-dimensional procedures are illustrated via an interesting real-life problem of identifying important structural and chemical descriptors of amines in explaining their mutagenic activity on the Salmonella typhimurium strain TA98.

First Page

37

Last Page

60

DOI

10.1016/B978-0-323-85713-0.00032-3

Publication Date

1-1-2022

This document is currently not available here.

Share

COinS