Text retrieval engines, popularly known as search engines, return a list of documents (the hitlist) for a query. Typically there are some good documents in the list and some bad ones. The quality of a search engine is measured in terms of the proportion of good hits in the list, the positions of good hits relative to bad ones, and the proportion of good documents missing from the list. Ideally, a search engine must return all the good documents and only the good documents. Such an engine has very good quality and is said to have high precision, recall, and utility. Real search engines are only able to return some of the good documents in the collection along with some bad ones.
This paper explains quality metrics such as precision and recall. It also describes the TREC quality benchmark and explains how to interpret TREC results.
Every structured query exactly defines a set of rows to be retrieved. Every row retrieved by the database for a structured query is always correct. There is never a bad hit for a structured query.
Why do text queries return bad hits? Why isn't every document in the hitlist exactly what the user was looking for? Do all search engines have incorrect algorithms? Not at all. The problem lies in the very nature of natural languages. Text queries do not define a hitlist exactly, except in trivial cases. They do not provide a complete specification of the user's intention. Every query has an implied intention that cannot be described fully in the query and hence is not available to the retrieval engine. For example, if a user is looking for books by Stephen King, a structured query such as
SELECT title FROM book_table WHERE author = `Stephen King';
will not return any bad hits (unless there are multiple authors with that name and the user was only interested in one of them). On the other hand, the text query
SELECT title FROM book_review WHERE
CONTAINS(review_text, `Stephen King')>0;
is very likely to return titles of books not authored by Stephen King, but had the name mentioned in the review for some reason. From the user's point of view, such documents are not useful. They are bad hits, or, nonrelevant to the user's intended query. In the absence of structured data that puts authors' names and only authors' names in a separate column, the user has no choice but to issue such a query. It is impossible for the user to know all the possible patterns of words in which `Stephen King' could appear in a book review to write an exact query that enables a search engine to return only relevant documents.
From the engine's point of view, the hitlist is correct because each hit contains the string `Stephen King'. But from the user's point of view, not all of them are relevant and given unstructured data, there is no way, in general, to write a better query that returns only relevant documents. Adding a description of the user's intention and using it as a free text query only makes things worse in many cases. Introducing more words and more free text into a query invariably introduces more bad hits.
Relevance is somewhat subjective. So is any interpretation of a natural language text. This does not mean that it is impossible or futile to measure the quality of text retrieval engines [Cooper, 1997]. Relevance judgments of hitlists made by users can be used to compare different retrieval engines. They can also be used to measure the quality of a particular engine within the limits of statistical validation techniques [Harter, 1996; Jones & Willett, 1997; Voorhees & Harman, 1996].
Measures of quality are based on the underlying idea of cost and benefit of a retrieved document to the user. A good hit has a certain benefit to the user. A nonrelevant hit has a certain cost to the user: time spent reading it, cost of getting the document translated, edited, and so on.
The cost or benefit of a hit also depends on its position in the hitlist. The top few hits are especially important since users are very likely to read (or otherwise process) them. A relevant document at the top of the hitlist has a higher benefit than one down the list. A nonrelevant document at the top has a higher cost. Many nonrelevant hits at the top may altogether discourage the user from going down the list to find relevant documents. Most search engines attempt to order the hits in a hitlist in decreasing order of relevance to the query (as determined by the engine). There is also a cost associated with not retrieving a relevant document. The user may miss important information contained in such documents.
The utility of a hitlist is a measure of the overall benefit or cost of the hitlist to the user. A general definition of utility is:
utility = c1 * N1 - (c2 * N2 + c3 * N3)
c1 is the benefit of retrieving a relevant document,
N1 is the number of relevant documents in the hitlist,
c2 is the cost of a nonrelevant document in the hitlist,
N2 is the number of nonrelevant documents in the hitlist,
c3 is the cost of not retrieving a relevant document, and
N3 is the number of relevant documents in the collection that are not retrieved (i.e., are missing from the hitlist)
There are several problems with such a general measure of utility. It has an unbounded range, including a negative range that depends on the size of the document collection and hitlist. This makes it difficult to compare utility across queries. It assumes that costs and benefits are the same for every position in hitlist. In the above definition of utility, the cost of a bad hit at the top of the list is the same as that of hit number 1000. This is not a realistic measure of user costs and benefits. It is easy for an engine to achieve higher utility by retrieving less hits. For example, an engine that returns no hits at all for a query may have a higher utility measure (= 0 if c3 = 0) than one that retrieves some but ends up with a negative utility number. Utility numbers are also very sensitive to the relative values of c1, c2, and c3.
Recall and precision are more useful measures that have a fixed range and are easy to compare across queries and engines. Precision is a measure of the usefulness of a hitlist; recall is a measure of the completeness of the hitlist.
Recall is defined as:
recall = # relevant hits in hitlist / # relevant documents in the collection
Recall is a measure of how well the engine performs in finding relevant documents. Recall is 100% when every relevant document is retrieved. In theory, it is easy to achieve good recall: simply return every document in the collection for every query! Therefore, recall by itself is not a good measure of the quality of a search engine.
Precision is defined as:
precision = # relevant hits in hitlist / # hits in hitlist
Precision is a measure of how well the engine performs in not returning nonrelevant documents. Precision is 100% when every document returned to the user is relevant to the query. There is no easy way to achieve 100% precision other than in the trivial case where no document is ever returned for any query!
Both precision and recall have a fixed range: 0.0 to 1.0 (or 0% to 100%).
A retrieval engine must have a high recall to be admissible in most applications. The key in building better retrieval engines is to increase precision without sacrificing recall. Most search engines on the Web, for example, give a reasonably good recall but poor precision, i.e., they find some relevant documents but they also return too many nonrelevant hits that users do not want to read.
Recall and precision are measures for the entire hitlist. They do not account for the quality of ranking the hits in the hitlist. Users want the retrieved documents to be ranked according to their relevance to the query instead of just being returned as a set. The most relevant hits must be in the top few documents returned for a query. Relevance ranking can be measured by computing precision at different cut-off points. For example, if the top 10 documents are all relevant to the query and the next ten are all nonrelevant, we have 100% precision at a cut off of 10 documents but a 50% precision at a cut off of 20 documents. Relevance ranking in this hitlist is very good since all relevant documents are above all the nonrelevant ones. Sometimes the term
recall at n is used informally to refer to the actual number of relevant documents up to that point in the hitlist, i.e., recall at n is the same as
( n * precision at n).
Precision at n does not measure recall. A new measure called average precision that combines precision, relevance ranking, and overall recall is defined as follows:
Let n be the number of hits in the hitlist;
let h[i] be the ith hit in the hitlist;
let rel[i] be 1 if h[i] is relevant and 0 otherwise;
let R be the total number of relevant documents in the collection for the query.
Then precision at hit j is:
precision[j] = k = 1..j rel[k] / j
Average precision = j = 1..n (precision[j] * rel[j]) / R
That is, average precision is the sum of the precision at each relevant hit in the hitlist divided by the total number of relevant documents in the collection.
Average precision is an ideal measure of the quality of retrieval engines. To get an average precision of 1.0, the engine must retrieve all relevant documents (i.e., recall = 1.0) and rank them perfectly (i.e., precision at R = 1.0). Notice that bad hits do not add anything to the numerator, but they reduce the precision at all points below and thereby reduce the contribution of each relevant document in the hitlist that is below the bad one. However, average precision is not a general measure of utility since it does not account any cost for bad hits that appear below all relevant hits in the hitlist.
A common way to depict the degradation of precision at n as one traverses the hitlist is to plot interpolated precision numbers against percentage recall. A percentage recall of say 50% is the position in the hitlist at which 50% of the relevant documents in the collection (i.e., 0.5 * R) have been retrieved. It is a measure of the number of hits you have to read before you have seen a certain percentage of relevant documents.
Precision-recall graphs have a classical concave shape. Figure 1 shows a typical graph. The graph shows the trade-off between precision and recall. Trying to increase recall typically introduces more bad hits into the hitlist, thereby reducing precision (i.e., moving to the right along the curve). Trying to increase precision typically reduces recall by removing some good hits from the hitlist (i.e., moving left along the curve). An ideal goal for a retrieval engine is to increase both precision and recall by making improvements to the engine, i.e., the entire curve must move up and out to the right so that both recall and precision are higher at every point along the curve. Average precision is indeed the area under the precision-recall graph. By moving the curve up and to the right, the area under the graph increases, thereby increasing the average precision of the engine.
A few other measures are explained here for the sake of completeness. R-precision is the precision at R where R is the number of relevant documents in the collection for the query. An R-precision of 1.0 is equivalent to perfect relevance ranking and perfect recall. However, a typical value of R-precision which is far below 1.0, does not indicate the actual value of recall (since some of the relevant documents may be present in the hitlist beyond point R).
Initial precision is the precision at recall 0% in the interpolated precision-recall graph. It is an indication of relevance ranking of the top few hits. Similarly, one can define a final precision that is the precision at 100% recall. Final precision indicates how far down one needs to go in the hitlist to find all relevant documents (if indeed recall is 100%).
Fallout is a measure of how quickly precision drops as recall is increased. Fallout is defined as
Fallout = # nonrelevant hits in hitlist / # nonrelevant hits in the collection
i.e., the portion of bad documents that was retrieved. Recall-fallout curves are sometimes used. However, they are more difficult to interpret from a user's point of view since fallout depends on the size of the document collection which does not matter to the user as much as the size of the hitlist.
The above measures assume that a hitlist ordered by relevance is available. In classification applications, such as routing and filtering, binary decisions are made for each document as it enters the system making it impossible to order documents by relevance rank. Hence different measures of normalized utility are used to evaluate routing and filtering engines [Hull, 1998; Lewis, 1995]. Utility measures are still under debate in the information retrieval community and there is no consensus on an ideal measure of utility.
TREC is an annual conference for academic and industrial text retrieval systems conducted by the National Institute of Standards and Technology. (There are several tracks in TREC that cater to special interests and applications. The following is a brief description of the "main" tracks which is called the ad hoc retrieval track.
TREC participants receive a 2 GB document collection with about half a million documents (528, 155 documents). This includes news articles from LA Times and Financial Times, documents from the Foreign Broadcast Information Service, and the Federal Register. At a later date, participants receive 50 topics. Each topic describes a query in English. A topic has a title, a brief description, and a narrative that explains the intention of the topic. For example, Topic 436 used in TREC8 reads:
<num> Number: 436
<title> railway accidents
What are the causes of railway accidents throughout the world?
A relevant document provides data on railway accidents of any sort (i.e., locomotive, trolley, streetcar) where either the railroad system or the vehicle or pedestrian involved caused the accident. Documents that discuss railroading in general, new rail lines, new technology for safety, and safety and accident prevention are not relevant, unless an actual accident is described.
Participants have two months to submit results for the 50 topics. The result for each topic is a hitlist of the top 1000 hits for that topic in decreasing order of relevance. NIST evaluates the results and publishes precision and recall numbers for each participant, averaged over all 50 topics. Submissions are compared with each other primarily using the mean average precision over all 50 topics. NIST also publishes the median recall and average precision numbers for each topic. Shown below are official NIST results for Oracle's submission to Trec8 (
Total number of documents over all queries
Interpolated Recall - Precision Averages:
at 0.00 0.9279
at 0.10 0.7747
at 0.20 0.6610
at 0.30 0.5448
at 0.40 0.4542
at 0.50 0.3882
at 0.60 0.3168
at 0.70 0.2565
at 0.80 0.2067
at 0.90 0.1332
at 1.00 0.0791
Average precision (non-interpolated) over all rel docs
At 5 docs: 0.7840
At 10 docs: 0.7220
At 15 docs: 0.6587
At 20 docs: 0.5990
At 30 docs: 0.5367
At 100 docs: 0.3440
At 200 docs: 0.2282
At 500 docs: 0.1206
At 1000 docs: 0.0677
R-Precision (precision after R (= num_rel for a query) docs retrieved):
It may be noted that the overall recall from the above data is 3384/4728 = 71.57%.
Recall numbers in TREC are not absolute recall numbers. Since the document collection is quite large, it is not possible to read the entire collection to make relevance judgments for all documents. NIST makes relevance judgments for only the top 100 hits in one submission by each participating group. The resulting answer key (called qrels) is thus incomplete and recall numbers are computed using this answer key. However, statistical studies have shown that increasing the coverage of the answer key does not alter the relative measures of submissions significantly [Voorhees & Harman, 1996].
TREC results must be used in accordance with NIST guidelines. They may not be used in advertisements and press releases. Cross-system comparisons with other named systems may not be made for individual results. Any claims made using TREC results must be "accurate, the evaluation measures used to substantiate these claims must be stated, and a reference must be made to the full conference proceedings... Informal, qualitative comparisons must be clearly stated to be such and thus open to statistical reassessment."
[Cooper, 1997] On Selecting a Measure of Retrieval Effectiveness, Cooper, W. S., in Readings in Information Retrieval, Jones, K. S. and Willett, P., (ed.), Morgan Kaufmann, 1997,
[Frakes & Baeza-Yates, 1992] Information Retrieval: Data Structures & Algorithms, Frakes, W. B., and Baeza-Yates, R. (ed.), Prentice Hall, 1992.
[Grossman & Frieder, 1998] Information Retrieval: Algorithms and Heuristics, Grossman, D. A. and Frieder, O., Kluwer Academic Publishers, 1998.
[Harman, 1993] Overview of the Second Text REtrieval Conference (TREC-2), Harman, D, 1993.
[Harter, 1996] Variations in relevance assessments and the measurement of retrieval effectiveness, Harter, S. P., Journal of the American Society for Information Science, 47(1):37-49, 1996.
[Hull, 1998] The TREC-7 Filtering Track: Description and Analysis, Hull, D. A., 1998.
[Jones & Willett, 1997] Readings in Information Retrieval, Jones, K. S. and Willett, P., (ed.), Morgan Kaufmann, 1997, (See Chapter 4: Evaluation).
[Lewis, 1995] Evaluating and Optimizing Autonomous Text Classification Systems. Lewis, D. D. In Proc. 18th ACM/SIGIR Conference, pages 246-254, 1995.
[Salton & McGill, 1983] Introduction to Modern Information Retrieval, Salton, G. and McGill, M. McGraw-Hill, 1983.
[Voorhees & Harman, 1996] Overview of the Fifth Text REtrieval Conference (TREC-5), Voorhees, E. M. and Harman, D., 1996,