On Zero-Shot Recognition of Unseen State-Object Composition

Date of Submission

1-1-2025

Date of Award

4-11-2025

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Doctoral Thesis

Degree Name

Doctor of Philosophy

Subject Name

Computer Science

Department

Electronics and Communication Sciences Unit (ECSU-Kolkata)

Supervisor

Mukherjee, Dipti Prasad (ECSU-ISI Kolkata)

Abstract (Summary of the Work)

Compositional Zero-Shot Learning (CZSL) attempts to recognise images of new (unseen) compositions of states and objects, when images of only a subset of stateobject compositions are available as training data. Thus a CZSL model should recognise a young dog when the model has seen images of the state-object compositions young bear, old bear and old dog. There are multiple challenges to solve the CZSL problem. It is difficult to disentangle the visual features of object dog and its state young from its compositional image young dog. The features of a state are observed to have high variation in visual features across compositions. For example, the state sliced has different visual features in compositions sliced apple and sliced tomato. In the second chapter of the thesis, we attempt to disentangle the visual features of state and object using a two-stage sequential recognition approach. In next chapter of the thesis, we work on the open-world CZSL problem where no prior information about the feasibility of a state-object composition is available. We use a Graph Convolutional Network based architecture along with a frequency-based feasibility prediction approach for the open-world CZSL problem. Another challenge in CZSL lies in the fact that the extent of association between the features of a state and an object vary significantly in different images of the same composition. For example, in different images of peeled orange, the oranges may be peeled to a different extent. Thus the visual features of images of peeled orange may vary. In the fourth chapter, a novel Knowledge-guided Transformer Network is proposed to better process the partial association between the visual features of state and object. In the fifth chapter, we attempt the partially supervised CZSL (pCZSL) problem, where for each state-object compositional image, either the state or the object annotation is available. We propose a novel vision transformer based architecture with Locality Preserving Neighbourhood Aggregation approach in the fifth chapter. Effective identification of the discriminative features of state and object often depends on the scale of the object in the image. For example, in the images of the two compositions, young bear and old bear, the identification of the states young and old may depend on recognising the scale (or size) of the object bear in the image. In the sixth chapter, we leverage Vision Language Model (VLM) to estimate the scale-aware features in CZSL. Extensive experiments on C-GQA, MIT-States and UT-Zappos50k datasets demonstrate the effectiveness of the approaches in this thesis, when compared to the stateof- the-art in the closed-world CZSL, open-world CZSL and pCZSL settings. As concluding remarks, we discuss the future scope of research in CZSL..

Comments

157p.

Control Number

ISI-Lib-TH637

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

DSpace Identifier

http://hdl.handle.net/10263/7550

Share

COinS