Using Negative Information for Text Retrieval.

Date of Submission

December 2009

Date of Award

Winter 12-12-2010

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Master's Dissertation

Degree Name

Master of Technology

Subject Name

Computer Science

Department

Computer Vision and Pattern Recognition Unit (CVPR-Kolkata)

Supervisor

Mitra, Mandar (CVPR-Kolkata; ISI)

Abstract (Summary of the Work)

Information retrievalInformation retrieval is a wide, often loosely-defined term unfortunately the word information can be very misleading. Nevertheless, information retrieval has become accepted as a description of the kind of work published by Cleverdon, Salton, Sparck Jones, Lancaster and others. An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely provides information on the existence (or non-existence) and whereabouts of documents relating to his request.1.1.1 DefinitionThe discipline of information retrieval is almost as old as the computer itself. An old, if not the oldest, definition of information retrieval is the following by Mooers(1950)[2](recited from Savino and Sebastiani,1998[3l).Information retrieval is the name of the process or method whereby a prospective user of information is able to convert his need for information into an actual list of citations to documents in storage containing information useful to him.In modern day terminology, an information retrieval system is a software program that stores and manages information on documents. The svstem assists users in finding the information thev neet.1.1.2 How an IR system worksThe information retrieval process consists of several steps. The basic steps of information retrieval are as follows1.1.2.1 Stop word removalStop words are words with little meaning that are removed from the index and the query. Words might carry little meaning from a frequency (or information theoretic) point of view, or alternatively from a linguistic point of view. Words that occur in many of the documents in the collection carry little meaning from a frequency point of view. The words which have a grammatical function but do not convey any information about the subject matter of the document might be removed whether their frequency in the collection is high or low. In fact, they should especially be removed if their frequency is low, because these words affect document scores the most. Removing stop words for linguistic reasons can be done by using a stop list that enumerates all words with little meaning, like for instance the, it and a. Stop lists are used in many systems, but the lengths of the various stop lists may vary considerably. For instance, the Smart stop list contains 571 words[5], whereas the Okapi system uses a moderate stop list of about 220 words (Robertson and Walker)[6].1.1.2.2 StemmingA stemmer applies morphological rules of humb to normalize words. The stemmers commonly used are those by Lovins[7] and Porter[8]. A stemmer can produce undesirable effects, for it may conflate two words with very different meanings to the same stem. For example *operate, operating and operations are all stemmesd to oper. As a result, a query operating systems can fetch documents related to *operations research.1.1.2.3 Phrase extractionDuring indexing and automatic query formulation, multiple words may be treated as one processing token. The meaning of a phrase might be quite different from what the two words independently suggest.

Comments

ProQuest Collection ID: http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:28843059

Control Number

ISI-DISS-2009-246

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

DOI

http://dspace.isical.ac.in:8080/jspui/handle/10263/6403

This document is currently not available here.

Share

COinS