Journal article

Examining additivity and weak baselines


We present a study of which baseline to use when testing a new retrieval technique. In contrast to past work, we show that measuring a statistically significant improvement over a weak baseline is not a good predictor of whether a similar improvement will be measured on a strong baseline. Sometimes strong baselines are made worse when a new technique is applied. We investigate whether conducting comparisons against a range of weaker baselines can increase confidence that an observed effect will also show improvements on a stronger baseline. Our results indicate that this is not the case -- at best, testing against a range of baselines means that an experimenter can be more confident that the new technique is unlikely to significantly harm a strong baseline. Examining recent past work, we present evidence that the information retrieval (IR) community continues to test against weak baselines. This is unfortunate as, in light of our experiments, we conclude that the only way to be confident that a new technique is a contribution is to compare it against nothing less than the state of the art.

Publication Details