On Zero-Shot Recognition of Unseen State-Object Composition
Date of Submission
1-1-2025
Date of Award
4-11-2025
Institute Name (Publisher)
Indian Statistical Institute
Document Type
Doctoral Thesis
Degree Name
Doctor of Philosophy
Subject Name
Computer Science
Department
Electronics and Communication Sciences Unit (ECSU-Kolkata)
Supervisor
Mukherjee, Dipti Prasad (ECSU-ISI Kolkata)
Abstract (Summary of the Work)
Compositional Zero-Shot Learning (CZSL) attempts to recognise images of new (unseen) compositions of states and objects, when images of only a subset of stateobject compositions are available as training data. Thus a CZSL model should recognise a young dog when the model has seen images of the state-object compositions young bear, old bear and old dog. There are multiple challenges to solve the CZSL problem. It is difficult to disentangle the visual features of object dog and its state young from its compositional image young dog. The features of a state are observed to have high variation in visual features across compositions. For example, the state sliced has different visual features in compositions sliced apple and sliced tomato. In the second chapter of the thesis, we attempt to disentangle the visual features of state and object using a two-stage sequential recognition approach. In next chapter of the thesis, we work on the open-world CZSL problem where no prior information about the feasibility of a state-object composition is available. We use a Graph Convolutional Network based architecture along with a frequency-based feasibility prediction approach for the open-world CZSL problem. Another challenge in CZSL lies in the fact that the extent of association between the features of a state and an object vary significantly in different images of the same composition. For example, in different images of peeled orange, the oranges may be peeled to a different extent. Thus the visual features of images of peeled orange may vary. In the fourth chapter, a novel Knowledge-guided Transformer Network is proposed to better process the partial association between the visual features of state and object. In the fifth chapter, we attempt the partially supervised CZSL (pCZSL) problem, where for each state-object compositional image, either the state or the object annotation is available. We propose a novel vision transformer based architecture with Locality Preserving Neighbourhood Aggregation approach in the fifth chapter. Effective identification of the discriminative features of state and object often depends on the scale of the object in the image. For example, in the images of the two compositions, young bear and old bear, the identification of the states young and old may depend on recognising the scale (or size) of the object bear in the image. In the sixth chapter, we leverage Vision Language Model (VLM) to estimate the scale-aware features in CZSL. Extensive experiments on C-GQA, MIT-States and UT-Zappos50k datasets demonstrate the effectiveness of the approaches in this thesis, when compared to the stateof- the-art in the closed-world CZSL, open-world CZSL and pCZSL settings. As concluding remarks, we discuss the future scope of research in CZSL..
Control Number
ISI-Lib-TH637
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.
DSpace Identifier
http://hdl.handle.net/10263/7550
Recommended Citation
Panda, Aditya, "On Zero-Shot Recognition of Unseen State-Object Composition" (2025). Doctoral Theses. 615.
https://digitalcommons.isical.ac.in/doctoral-theses/615
Comments
157p.