Author (Researcher Name)

Date of Submission

6-11-2026

Date of Award

6-19-2026

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Master's Dissertation

Degree Name

Master of Technology

Subject Name

Computer Science

Department

Computer Vision and Pattern Recognition Unit (CVPR-Kolkata)

Supervisor

Bhattacharya, Ujjwal

Abstract (Summary of the Work)

In this work I build a system that recognizes isolated American Sign Language (ASL) words, and I use it to ask one fairly direct question: when training data is scarce, is it better to look at the video pixels or at the geometry of the signer’s body? To find out, I train two very different models on exactly the same clips. The first is appearance-based. Every frame is run through standard preprocessing and a ResNet50 backbone pre-trained on ImageNet, which turns it into a 2048-dimensional feature vector, and a Bidirectional LSTM then reads that sequence over time. The second model never sees a pixel. It works only on Media Pipe key points, the tracked coordinates of the body and the two hands, and feeds them to a Transformer encoder. So both models have to learn the same two things, the shape of the hands in each frame and the way those shapes move across frames, and both are trained, validated and tested under one identical protocol. What I care about throughout is a recognizer that is accurate but still light enough to be useful in practice, so it could eventually make communication a little easier between people who sign and people who do not.

Control Number

CS2426

DOI

https://dspace.isical.ac.in/items/f6544d28-6b9d-48ef-a19e-ab4d50891463

DSpace Identifier

http://hdl.handle.net/10263/7738

Share

COinS