I ran into an interesting blog posting and related paper not long ago that touches on an issue that comes up in the NLP community from time to time, and that's at what point do standardized test collections start to cause more harm than good. In general we've come to value shared test sets, because they allow us to compare to other methods using the same data. However, this can go badly wrong, and it appears that this might be the case with the problem of ad-hoc retrieval, which is one of the classic problems from Information Retrieval. The authors argue that there really has been no progress on this problem in more than a decade, despite a large number of papers reporting exciting new results. The problem of course isn't really the data, it's how the data is being used.
http://blog.codalism.com/?p=1029 And the related paper... Improvements That Don't Add Up: Ad-Hoc Retrieval Results Since 1998 http://ww2.cs.mu.oz.au/~wew/papers/amwz09_cikm.pdf Offered as food for thought. Enjoy, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse