Using Negative Information for Text Retrieval.
Date of Submission
December 2009
Date of Award
Winter 12-12-2010
Institute Name (Publisher)
Indian Statistical Institute
Document Type
Master's Dissertation
Degree Name
Master of Technology
Subject Name
Computer Science
Department
Computer Vision and Pattern Recognition Unit (CVPR-Kolkata)
Supervisor
Mitra, Mandar (CVPR-Kolkata; ISI)
Abstract (Summary of the Work)
Information retrievalInformation retrieval is a wide, often loosely-defined term unfortunately the word information can be very misleading. Nevertheless, information retrieval has become accepted as a description of the kind of work published by Cleverdon, Salton, Sparck Jones, Lancaster and others. An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely provides information on the existence (or non-existence) and whereabouts of documents relating to his request.1.1.1 DefinitionThe discipline of information retrieval is almost as old as the computer itself. An old, if not the oldest, definition of information retrieval is the following by Mooers(1950)[2](recited from Savino and Sebastiani,1998[3l).Information retrieval is the name of the process or method whereby a prospective user of information is able to convert his need for information into an actual list of citations to documents in storage containing information useful to him.In modern day terminology, an information retrieval system is a software program that stores and manages information on documents. The svstem assists users in finding the information thev neet.1.1.2 How an IR system worksThe information retrieval process consists of several steps. The basic steps of information retrieval are as follows1.1.2.1 Stop word removalStop words are words with little meaning that are removed from the index and the query. Words might carry little meaning from a frequency (or information theoretic) point of view, or alternatively from a linguistic point of view. Words that occur in many of the documents in the collection carry little meaning from a frequency point of view. The words which have a grammatical function but do not convey any information about the subject matter of the document might be removed whether their frequency in the collection is high or low. In fact, they should especially be removed if their frequency is low, because these words affect document scores the most. Removing stop words for linguistic reasons can be done by using a stop list that enumerates all words with little meaning, like for instance the, it and a. Stop lists are used in many systems, but the lengths of the various stop lists may vary considerably. For instance, the Smart stop list contains 571 words[5], whereas the Okapi system uses a moderate stop list of about 220 words (Robertson and Walker)[6].1.1.2.2 StemmingA stemmer applies morphological rules of humb to normalize words. The stemmers commonly used are those by Lovins[7] and Porter[8]. A stemmer can produce undesirable effects, for it may conflate two words with very different meanings to the same stem. For example *operate, operating and operations are all stemmesd to oper. As a result, a query operating systems can fetch documents related to *operations research.1.1.2.3 Phrase extractionDuring indexing and automatic query formulation, multiple words may be treated as one processing token. The meaning of a phrase might be quite different from what the two words independently suggest.
Control Number
ISI-DISS-2009-246
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.
DOI
http://dspace.isical.ac.in:8080/jspui/handle/10263/6403
Recommended Citation
Banik, Aritra, "Using Negative Information for Text Retrieval." (2010). Master’s Dissertations. 46.
https://digitalcommons.isical.ac.in/masters-dissertations/46
Comments
ProQuest Collection ID: http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:28843059