Date of Submission


Date of Award


Institute Name (Publisher)

Indian Statistical Institute

Document Type

Doctoral Thesis

Degree Name

Doctor of Philosophy

Subject Name



Theoretical Statistics and Mathematics Unit (TSMU-Kolkata)


Ghosh, Anil Kumar (TSMU-Kolkata; ISI)

Abstract (Summary of the Work)

The advancement of data acquisition technologies and computing resources have greatly facilitated the analysis of massive data sets in various fields of sciences. Researchers from different disciplines rigorously investigate these data sets to extract useful information for new scientific discoveries. Many of these data sets contain large number of features but small number of observations. For instance, in the fields of chemometrics (see e.g., Schoonover et al. (2003)), medical image analysis (see e.g., Yushkevich et al. (2001)) and microarray gene expression data analysis (see e.g., Eisen and Brown (1999), Alter et al. (2000)), we often deal with data of dimensions higher than several thousands but sample sizes of the order of a few hundreds or even less. Such high dimension, low sample size (HDLSS) data present a substantial challenge to the statistics community. Many well known classical multivariate methods cannot be used in such situations. For example, because of the singularity of the estimated pooled dispersion matrix, the classical Hotelling’s T 2 statistic (see e.g., Anderson (2003)) cannot be used for twosample test when the dimension of the data exceeds the combined sample size. Over the last few years, researchers are getting more interested in developing statistical methods that are applicable to HDLSS data. In this thesis, we develop some nonparametric methods that can be used for high dimensional two-sample problems involving two independent samples as well as those involving matched pair data.In a two-sample testing problem, one usually tests the equality of two d-dimensional probability distributions F and G based on two sets of independent observations x1, x2, . . . , xn1 from F and y1 , y2 , . . . , yn2 from G. This problem is well investigated in the literature, and several parametric and nonparametric tests are available for it.Parametric methods assume a common parametric form for F and G, where we test the equality of the parameter values (which could be scalar or finite dimensional vector valued) in two distributions. For instance, if F and G are assumed to be normal (Gaussian) with a common but unknown dispersion, one uses the Fisher’s t statistic (when d = 1) or the Hotelling’s T 2 statistic (when d > 1) to test the equality of their locations (see e.g., Mardia et al. (1979); Anderson (2003)). Though these tests have several optimality properties for data having normal distributions, they are not robust against outliers and can mislead our inference if the underlying distributions are far from being normal. Since the performance of parametric methods largely depends on the validity of underlying model assumptions, nonparametric methods are often preferred because of their flexibility and robustness.In the univariate set up, rank based nonparametric tests like the WilcoxonMann-Whitney test, the Kolmogorov-Smirnov maximum deviation test and the WaldWolfowitz run test (see e.g., Hollander and Wolfe (1999); Gibbons and Chakraborti (2003)) are often used. These tests are distribution-free, and they outperform the Fisher’s t test for a wide variety of non-Gausssian distributions.


ProQuest Collection ID:

Control Number


Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.


Included in

Mathematics Commons