Author (Researcher Name)

Date of Submission

7-14-2025

Date of Award

2-4-2026

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Doctoral Thesis

Degree Name

Doctor of Philosophy

Subject Name

Statistics

Department

Theoretical Statistics and Mathematics Unit (TSMU-Kolkata)

Supervisor

Das, Swagatam & Chaudhuri, Probal

Abstract (Summary of the Work)

Generative modeling focuses on the task of producing new data samples that closely resemble those drawn from an original, unknown distribution. Despite being well-known in statistical estimation theory, the approach has gained substantial traction in recent years, driven by groundbreaking results in areas such as image synthesis, natural language generation, and network modeling. The complexity of modern-era data domains and the ensuing adaptations that suitable models must undergo have presented new challenges. These advances raise several fundamental questions, the first of which is: When do generative models accurately approximate the true data distribution? One may also ask: How well do these models perform under contaminated data? This work explores these questions through the lens of generative modeling frameworks that, by design, involve distinct data spaces.

We focus on two major classes of such models that blend optimal transport and representation learning in their objectives: Wasserstein autoencoders (WAE) and Cycle-consistent cross-domain translators. WAE, on its way to regeneration, learns a latent code, which in turn aids the simulation of newer pseudo-random replicates. By providing statistical characterizations of the latent distribution and the transforms inducing a dimensionality reduction in the process, we present a detailed error analysis underlying WAEs. From a non-parametric density estimation perspective, we establish deterministic bounds on the latent and reconstruction errors that adapt to the intrinsic dimensions of input data. We also study the extent of distortion that WAE-generated samples suffer when learned using contaminated data. Key takeaways for practitioners from our analysis include specific architectural suggestions that foster near-perfect sampling. The framework developed thus far fittingly extends to unpaired cycle-consistent cross-domain models. We show that the sufficient conditions for successful data translation under Sobolev and Hölder-smooth distributions resemble those in the case of WAEs. Our analysis also suggests error upper bounds due to ill-posed transformations and validates the choice of divergences used in objectives.

Finally, in search of a consolidated solution to the robustification problem, we present parallel formulations based on the Gromov-Wasserstein (GW) distance. Due to the equivalence of Gromov-Monge samplers (GW), following GW, and cross-domain translation models, including WAE and GWAE, this answers the second question. We study the robust recovery guarantees, concentration, and tractable computational properties of the newly introduced distance measures under diverse contamination.

Control Number

TH675

DOI

https://dspace.isical.ac.in/items/b0ccfba8-8b1b-4152-b317-0b3dccadf677

DSpace Identifier

http://hdl.handle.net/10263/7646

Share

COinS