To clean or not to clean: Document preprocessing and reproducibility

Article Type

Research Article

Publication Title

Journal of Data and Information Quality


Web document collections such as WT10G, GOV2, and ClueWeb are widely used for text retrieval experiments. Documents in these collections contain a fair amount of non-content-related markup in the form of tags, hyperlinks, and so on. Published articles that use these corpora generally do not provide specifc details about how this markup information is handled during indexing. However, this question turns out to be important: Through experiments, we fnd that including or excluding metadata in the index can produce signifcantly different results with standard IR models. More importantly, the effect varies across models and collections. For example, metadata fltering is found to be generally benefcial when using BM25, or language modeling with Dirichlet smoothing, but can signifcantly reduce retrieval effectiveness if language modeling is used with Jelinek-Mercer smoothing. We also observe that, in general, the performance differences become more noticeable as the amount of metadata in the test collections increase. Given this variability, we believe that the details of document preprocessing are signifcant from the point of view of reproducibility. In a second set of experiments, we also study the effect of preprocessing on query expansion using RM3. In this case, once again, we fnd that it is generally better to remove markup before using documents for query expansion.



Publication Date


This document is currently not available here.