Evaluation is instrumental to developing and managing effective information retrieval systems. For this process, enlisting crowdsourcing has proven viable. However, less understood are crowdsourcing's limits for evaluation, particularly for domain-specific search. The authors compare relevance assessments gathered using crowdsourcing with those from a domain expert to evaluate different search engines in a large government archive. Although crowdsourced judgments rank the tested search engines in the same order as expert judgments, crowdsourced workers appear unable to distinguish different levels of highly accurate search results the way expert assessors can.