Understanding the Performance of Expanded Queries.

Date of Submission

December 2013

Date of Award

Winter 12-12-2014

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Master's Dissertation

Degree Name

Master of Technology

Subject Name

Computer Science


Computer Vision and Pattern Recognition Unit (CVPR-Kolkata)


Mitra, Mandar (CVPR-Kolkata; ISI)

Abstract (Summary of the Work)

Information Retrieval hardly needs any introduction today. Surveys show that about 85% of the users of the internet use popular interactive search engines to satisfy their information need. Such is the impact of information retrieval, particularly search engines, in our daily lives that the word google has been added to the Oxford English Dictionary as a verb, whereby Google it now means search it!1.1 Brief Introduction To Information Retrieval Informationretrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources [2]. Searches can be based on metadata or on full-text (or other content-based) indexing.Definition 1.1. Information retrieval is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers) [12].1.1.1 Document, Collection And QueryA document is a file containing significant text content. It has some minimal structures e.g. title, author, date, subject etc.. Examples of documents are web pages, email,books, news, stories, scholarly papers, text messages, MSWord documents, MSPowerpoint documents, PDF documents, forum postings, blogs etc.A set of similar documents is called collection. Generally all activities of an IR system is performed on a collection of documents with a pre-defined structure or format (e.g. normal text file, pdf, MSWord etc.).An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy. Vector Space ModelIn vector space model documents and queries are represented as vectors. This model is very commonly used.dj = (w1,j , w2,j , . . . , wt,j )q = (w1,q, w2,q, . . . , wt,q)Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is tf-idf weighting [3]. TF means term frequency which is the no. of times a term occurs in a document. IDF means inverse document frequency which is the log of the ratio of collection size and no. of documents containing that term.The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If the words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words occurring in the corpus).Vector operations can be used to compare documents with queries.


ProQuest Collection ID: http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:28843310

Control Number


Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.



This document is currently not available here.