Generalisation in Natural Language Clustering Through Non-parametric Statistical Approach

Article Type

Research Article

Publication Title

SN Computer Science

Abstract

Robust statistical methodologies are imperative for effectively analysing data and quantifying specific phenomena, particularly when attempting to comprehend intricate events. The present research endeavours to introduce and assess potent non-parametric statistical approaches that are compatible with heterogeneous data structures. The primary focus lies in their application within the domains of language clustering and natural language processing. A central objective is to refine our understanding of language clustering and its potential implications, including the emergence of linguistic regions known as sprachbunds. To achieve this, the study delves into diverse non-parametric facets of linguistic data processing and exploration. Building upon the foundation established by previous work (Chattopadhyay et al. in International conference on soft computing and its engineering applications, Springer, Cham, 2022), this study extends its scope by proposing a novel framework for structuring language families. This is accomplished through the incorporation of typological and areal characteristics, enriching the accuracy and depth of language classification. The utilisation of non-parametric techniques takes centre stage throughout this process. Notably, multidimensional scaling (MDS) is harnessed to transform resulting data into a Cartesian framework, enabling the deployment of data-depth-based methods for reliable outlier identification. This proves invaluable for effectively categorising a wide array of languages situated on the fringes of existing classifications. Furthermore, it opens avenues for reevaluating established language categorisation schemes in light of newfound insights.

DOI

10.1007/s42979-023-02389-6

Publication Date

1-1-2024

Share

COinS