Introduction. Evaluation is highly important for designing, developing and maintaining effective information retrieval or search systems as it allows the measurement of how successfully an information retrieval system meets its goal of helping users fulfil their information needs. But what does it mean to be successful? It might refer to whether an information retrieval system retrieves relevant (compared with non-relevant) documents; how quickly results are returned; how well the system supports users' interactions; whether users are satisfied with the results; how easily users can use the system; whether the system helps users carry out their tasks and fulfil their information needs; whether the system impacts on the wider environment; how reliable the system is etc. Evaluation of information retrieval systems has been actively researched for over 50 years and continues to be an area of discussion and controversy.
Test collections. In this paper we discuss system-oriented evaluation that focuses on measuring system effectiveness: how well an information retrieval system can separate relevant from non-relevant documents for a given user query. We discuss the construction and use of standardised benchmarks - test collections - for evaluating information retrieval systems.
Research directions. The paper also describes current and future research directions for test collection-based evaluation, including efficient gathering of relevance assessments, the relationship between system effectiveness and user utility, and evaluation across user sessions.
Conclusions. This paper describes test collections which have been widely used in information retrieval evaluation and provide an approach for measuring system effectiveness.