Estimating Gaussian mixture models in the local neighbourhood of embedded word vectors for query performance prediction
Information Processing and Management
The study of query performance prediction (QPP) in information retrieval (IR) aims to predict retrieval effectiveness. The specificity of the underlying information need of a query often determines how effectively can a search engine retrieve relevant documents at top ranks. The presence of ambiguous terms makes a query less specific to the sought information need, which in turn may degrade IR effectiveness. In this paper, we propose a novel word embedding based pre-retrieval feature which measures the ambiguity of each query term by estimating how many 'senses’ each word is associated with. Assuming each sense roughly corresponds to a Gaussian mixture component, our proposed generative model first estimates a Gaussian mixture model (GMM) from the word vectors that are most similar to the given query terms. We then use the posterior probabilities of generating the query terms themselves from this estimated GMM in order to quantify the ambiguity of the query. Previous studies have shown that post-retrieval QPP approaches often outperform pre-retrieval ones because they use additional information from the top ranked documents. To achieve the best of both worlds, we formalize a linear combination of our proposed GMM based pre-retrieval predictor with NQC, a state-of-the-art post-retrieval QPP. Our experiments on the TREC benchmark news and web collections demonstrate that our proposed hybrid QPP approach (in linear combination with NQC) significantly outperforms a range of other existing pre-retrieval approaches in combination with NQC used as baselines.
Roy, Dwaipayan; Ganguly, Debasis; Mitra, Mandar; and Jones, Gareth J.F., "Estimating Gaussian mixture models in the local neighbourhood of embedded word vectors for query performance prediction" (2019). Journal Articles. 859.