View Count Prediction of a Video Through Deep Neural Network Based Analysis of Subjective Video Attributes.

Date of Submission

December 2020

Date of Award

Winter 12-12-2021

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Master's Dissertation

Degree Name

Master of Technology

Subject Name

Computer Science


Electronics and Communication Sciences Unit (ECSU-Kolkata)


Mukherjee, Dipti Prasad (ECSU-Kolkata; ISI)

Abstract (Summary of the Work)

This research work is aimed to design a method for predicting the view count of a video using deep neural network based analysis of subjective video attributes. With more and more companies turning to online video content influencers to capture the millennial audience, getting people to watch your videos on online platforms is becoming increasingly lucrative. So we provide a solution to the problem by building a model of our own. Our model takes four subjective video attributes as input and predicts the probable view of the video as output. The attributes are the thumbnail image, the title caption, the audio associated with the video and the video itself. We preprocess each of the attributes separately to obtain the feature vectors. Our model contains four branches to deal with these attributes. We pass the feature vectors of each of the component to the respective branches of the model to capture the salient features with regards to the thumbnail image using a pre-trained CNN architecture, AlexNet; the sentimental feature with regards to the title caption using Sentiment Intensity Analyzer; the temporal feature with regards to the audio waveform using LSTM and both the temporal and salient features with regards to the video using Convolutional LSTM. Since a user, clicks a video based on the title and the thumbnail associated with the video on most online platforms, the model tries to generate a click affinity feature depicting the affinity of the user to click the video. After the user clicked the video, the user decides to view the video based on the audio and the video itself, so the view count of the video is predicted by taking into account the click affinity feature along with the temporal feature of the audio waveform and the spatial - temporal feature of the video using a regressor network called the viral video-prediction network. A loss function designed from this regression values is used to train the last two stages of the pipeline. We obtain a test accuracy as high as 95.89%.


ProQuest Collection ID:

Control Number


Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.


This document is currently not available here.