Fairly Retrieving Documents of All Lengths: A study of Document Length Normalization using the Language Modeling approach

Normalizing document length is widely recognized as an important factor for adjusting retrieval systems. Previous studies have shown that tuning the retrieval model so that the lengths of retrieved documents are similar to the lengths of relevant documents will result in substantially better performance. However, the goal of Document Length Normalization is to "fairly'' retrieve documents of all lengths. In this paper, we consider this proposition against the previous findings in the context of the Language Modeling approach for ad hoc information retrieval, and study the impact of the smoothing method and parameter setting on the length of documents retrieved. Our study reveals that fairly retrieving documents results in a mediocre performing parameter estimates, while using the relevant documents delivers excellent estimates. While this re-confirms previous findings, we discover that this discrepancy appears to stem from the fact that relevant documents are drawn from a biased sample, the set of assessed documents.

keywords: Information Retrieval, Document Length Normalization, Parameter Tuning, Probabilistic Language Models