Text Retrieval Quality: A Primer
Kavi Mahesh
Oracle Corporation
Text retrieval engines, popularly known as search engines, return a list of documents
(the hitlist) for a query. Typically there are some good documents in the list and
some bad ones. The quality of a search engine is measured in terms of the proportion of
good hits in the list, the positions of good hits relative to bad ones, and the proportion
of good documents missing from the list. Ideally, a search engine must return all the good
documents and only the good documents. Such an engine has very good quality and is said to
have high precision, recall, and utility. Real search engines are
only able to return some of the good documents in the collection along with some bad ones.
This paper explains quality metrics such as precision and recall. It also describes the
TREC quality benchmark and explains how to interpret TREC results.
Good and Bad: Correctness vs. Relevance
Every structured query exactly defines a set of rows to be retrieved. Every row
retrieved by the database for a structured query is always correct. There is never a bad
hit for a structured query.
Why do text queries return bad hits? Why isn't every document in the hitlist exactly
what the user was looking for? Do all search engines have incorrect algorithms? Not at
all. The problem lies in the very nature of natural languages. Text queries do not define
a hitlist exactly, except in trivial cases. They do not provide a complete specification
of the user's intention. Every query has an implied intention that cannot be described
fully in the query and hence is not available to the retrieval engine. For example, if a
user is looking for books by Stephen King, a structured query such as
SELECT title FROM book_table WHERE author = `Stephen King';
will not return any bad hits (unless there are multiple authors with that name and the
user was only interested in one of them). On the other hand, the text query
SELECT title FROM book_review WHERE
CONTAINS(review_text, `Stephen King')>0;
is very likely to return titles of books not authored by Stephen King, but had the name
mentioned in the review for some reason. From the user's point of view, such documents are
not useful. They are bad hits, or, nonrelevant to the user's intended query. In the
absence of structured data that puts authors' names and only authors' names in a separate
column, the user has no choice but to issue such a query. It is impossible for the user to
know all the possible patterns of words in which `Stephen King' could appear in a book
review to write an exact query that enables a search engine to return only relevant
documents.
From the engine's point of view, the hitlist is correct because each hit contains the
string `Stephen King'. But from the user's point of view, not all of them are relevant and
given unstructured data, there is no way, in general, to write a better query that returns
only relevant documents. Adding a description of the user's intention and using it as a
free text query only makes things worse in many cases. Introducing more words and more
free text into a query invariably introduces more bad hits.
Relevance is somewhat subjective. So is any interpretation of a natural language text.
This does not mean that it is impossible or futile to measure the quality of text
retrieval engines [Cooper, 1997]. Relevance judgments of hitlists made by users can be
used to compare different retrieval engines. They can also be used to measure the quality
of a particular engine within the limits of statistical validation techniques [Harter,
1996; Jones & Willett, 1997; Voorhees & Harman, 1996].
Utility and Relevance Ranking
Measures of quality are based on the underlying idea of cost and benefit of a retrieved
document to the user. A good hit has a certain benefit to the user. A nonrelevant hit has
a certain cost to the user: time spent reading it, cost of getting the document
translated, edited, and so on.
The cost or benefit of a hit also depends on its position in the hitlist. The top few
hits are especially important since users are very likely to read (or otherwise process)
them. A relevant document at the top of the hitlist has a higher benefit than one down the
list. A nonrelevant document at the top has a higher cost. Many nonrelevant hits at the
top may altogether discourage the user from going down the list to find relevant
documents. Most search engines attempt to order the hits in a hitlist in decreasing order
of relevance to the query (as determined by the engine). There is also a cost associated
with not retrieving a relevant document. The user may miss important information contained
in such documents.
The utility of a hitlist is a measure of the overall benefit or cost of the
hitlist to the user. A general definition of utility is:
utility = c1 * N1 - (c2 * N2 + c3 * N3)
where
c1 is the benefit of retrieving a relevant document,
N1 is the number of relevant documents in the hitlist,
c2 is the cost of a nonrelevant document in the hitlist,
N2 is the number of nonrelevant documents in the hitlist,
c3 is the cost of not retrieving a relevant document, and
N3 is the number of relevant documents in the collection that are not retrieved (i.e., are
missing from the hitlist)
There are several problems with such a general measure of utility. It has an unbounded
range, including a negative range that depends on the size of the document collection and
hitlist. This makes it difficult to compare utility across queries. It assumes that costs
and benefits are the same for every position in hitlist. In the above definition of
utility, the cost of a bad hit at the top of the list is the same as that of hit number
1000. This is not a realistic measure of user costs and benefits. It is easy for an engine
to achieve higher utility by retrieving less hits. For example, an engine that returns no
hits at all for a query may have a higher utility measure (= 0 if c3 = 0) than one that
retrieves some but ends up with a negative utility number. Utility numbers are also very
sensitive to the relative values of c1, c2, and c3.
Recall and Precision
Recall and precision are more useful measures that have a fixed range and are easy to
compare across queries and engines. Precision is a measure of the usefulness of a hitlist;
recall is a measure of the completeness of the hitlist.
Recall is defined as:
recall = # relevant hits in hitlist / # relevant documents in the collection
Recall is a measure of how well the engine performs in finding relevant documents.
Recall is 100% when every relevant document is retrieved. In theory, it is easy to achieve
good recall: simply return every document in the collection for every query! Therefore,
recall by itself is not a good measure of the quality of a search engine.
Precision is defined as:
precision = # relevant hits in hitlist / # hits in hitlist
Precision is a measure of how well the engine performs in not returning nonrelevant
documents. Precision is 100% when every document returned to the user is relevant to the
query. There is no easy way to achieve 100% precision other than in the trivial case where
no document is ever returned for any query!
Both precision and recall have a fixed range: 0.0 to 1.0 (or 0% to 100%).
A retrieval engine must have a high recall to be admissible in most applications. The
key in building better retrieval engines is to increase precision without sacrificing
recall. Most search engines on the Web, for example, give a reasonably good recall but
poor precision, i.e., they find some relevant documents but they also return too many
nonrelevant hits that users do not want to read.
Precision at n
Recall and precision are measures for the entire hitlist. They do not account for the
quality of ranking the hits in the hitlist. Users want the retrieved documents to be
ranked according to their relevance to the query instead of just being returned as a set.
The most relevant hits must be in the top few documents returned for a query. Relevance
ranking can be measured by computing precision at different cut-off points. For example,
if the top 10 documents are all relevant to the query and the next ten are all
nonrelevant, we have 100% precision at a cut off of 10 documents but a 50% precision at a
cut off of 20 documents. Relevance ranking in this hitlist is very good since all relevant
documents are above all the nonrelevant ones. Sometimes the term recall at n is
used informally to refer to the actual number of relevant documents up to that point in
the hitlist, i.e., recall at n is the same as
(n * precision at n).
Average Precision
Precision at n does not measure recall. A new measure called average
precision that combines precision, relevance ranking, and overall recall is defined as
follows:
Let n be the number of hits in the hitlist;
let h[i] be the ith hit in the hitlist;
let rel[i] be 1 if h[i] is relevant and 0 otherwise;
let R be the total number of relevant documents in the collection for the query.
Then precision at hit j is:
precision[j] = k
= 1..j rel[k] / j
Average precision = j = 1..n (precision[j] * rel[j]) / R
That is, average precision is the sum of the precision at each relevant hit in the
hitlist divided by the total number of relevant documents in the collection.
Average precision is an ideal measure of the quality of retrieval engines. To get an
average precision of 1.0, the engine must retrieve all relevant documents (i.e., recall =
1.0) and rank them perfectly (i.e., precision at R = 1.0). Notice that bad hits do not add
anything to the numerator, but they reduce the precision at all points below and thereby
reduce the contribution of each relevant document in the hitlist that is below the bad
one. However, average precision is not a general measure of utility since it does not
account any cost for bad hits that appear below all relevant hits in the hitlist.
Precision-Recall Graphs
A common way to depict the degradation of precision at n as one traverses the
hitlist is to plot interpolated precision numbers against percentage recall. A percentage
recall of say 50% is the position in the hitlist at which 50% of the relevant documents in
the collection (i.e., 0.5 * R) have been retrieved. It is a measure of the number of hits
you have to read before you have seen a certain percentage of relevant documents.
Precision-recall graphs have a classical concave shape. Figure 1 shows a typical graph.
The graph shows the trade-off between precision and recall. Trying to increase recall
typically introduces more bad hits into the hitlist, thereby reducing precision (i.e.,
moving to the right along the curve). Trying to increase precision typically reduces
recall by removing some good hits from the hitlist (i.e., moving left along the curve). An
ideal goal for a retrieval engine is to increase both precision and recall by making
improvements to the engine, i.e., the entire curve must move up and out to the right so
that both recall and precision are higher at every point along the curve. Average
precision is indeed the area under the precision-recall graph. By moving the curve up and
to the right, the area under the graph increases, thereby increasing the average precision
of the engine.
Other Measures: R-Precision, Initial Precision, and Fallout
A few other measures are explained here for the sake of completeness. R-precision
is the precision at R where R is the number of relevant documents in the collection for
the query. An R-precision of 1.0 is equivalent to perfect relevance ranking and perfect
recall. However, a typical value of R-precision which is far below 1.0, does not indicate
the actual value of recall (since some of the relevant documents may be present in the
hitlist beyond point R).
Initial precision is the precision at recall 0% in the interpolated
precision-recall graph. It is an indication of relevance ranking of the top few hits.
Similarly, one can define a final precision that is the precision at 100% recall.
Final precision indicates how far down one needs to go in the hitlist to find all relevant
documents (if indeed recall is 100%).
Fallout is a measure of how quickly precision drops as recall is increased.
Fallout is defined as
Fallout = # nonrelevant hits in hitlist / # nonrelevant hits in the collection
i.e., the portion of bad documents that was retrieved. Recall-fallout curves are
sometimes used. However, they are more difficult to interpret from a user's point of view
since fallout depends on the size of the document collection which does not matter to the
user as much as the size of the hitlist.
Utility Measures for Routing and Filtering
The above measures assume that a hitlist ordered by relevance is available. In
classification applications, such as routing and filtering, binary decisions are made for
each document as it enters the system making it impossible to order documents by relevance
rank. Hence different measures of normalized utility are used to evaluate routing and
filtering engines [Hull, 1998; Lewis, 1995]. Utility measures are still under debate in
the information retrieval community and there is no consensus on an ideal measure of
utility.
Quality Benchmark: Text REtrieval Conference (TREC)
TREC is an annual conference for academic and industrial text retrieval systems
conducted by the National Institute of Standards and
Technology. (There are several tracks in TREC that cater to special interests and
applications. The following is a brief description of the "main" tracks which is
called the ad hoc retrieval track.
TREC participants receive a 2 GB document collection with about half a million
documents (528, 155 documents). This includes news articles from LA Times and Financial
Times, documents from the Foreign Broadcast Information Service, and the Federal Register.
At a later date, participants receive 50 topics. Each topic describes a query in English.
A topic has a title, a brief description, and a narrative that explains the intention of
the topic. For example, Topic 436 used in TREC8 reads:
<top>
<num> Number: 436
<title> railway accidents
<desc> Description:
What are the causes of railway accidents throughout the world?
<narr> Narrative:
A relevant document provides data on railway accidents of any sort (i.e., locomotive,
trolley, streetcar) where either the railroad system or the vehicle or pedestrian involved
caused the accident. Documents that discuss railroading in general, new rail lines, new
technology for safety, and safety and accident prevention are not relevant, unless an
actual accident is described.
</top>
Participants have two months to submit results for the 50 topics. The result for each
topic is a hitlist of the top 1000 hits for that topic in decreasing order of relevance.
NIST evaluates the results and publishes precision and recall numbers for each
participant, averaged over all 50 topics. Submissions are compared with each other
primarily using the mean average precision over all 50 topics. NIST also publishes the
median recall and average precision numbers for each topic. Shown below are official NIST
results for Oracle's submission to Trec8 (ad hoc):
Total number of documents over all queries
Retrieved: 50000
Relevant: 4728
Rel_ret: 3384
Interpolated Recall - Precision Averages:
at 0.00 0.9279
at 0.10 0.7747
at 0.20 0.6610
at 0.30 0.5448
at 0.40 0.4542
at 0.50 0.3882
at 0.60 0.3168
at 0.70 0.2565
at 0.80 0.2067
at 0.90 0.1332
at 1.00 0.0791
Average precision (non-interpolated) over all rel docs
0.4130
Precision:
At 5 docs: 0.7840
At 10 docs: 0.7220
At 15 docs: 0.6587
At 20 docs: 0.5990
At 30 docs: 0.5367
At 100 docs: 0.3440
At 200 docs: 0.2282
At 500 docs: 0.1206
At 1000 docs: 0.0677
R-Precision (precision after R (= num_rel for a query) docs retrieved):
Exact: 0.4357
It may be noted that the overall recall from the above data is 3384/4728 = 71.57%.
Recall numbers in TREC are not absolute recall numbers. Since the document collection
is quite large, it is not possible to read the entire collection to make relevance
judgments for all documents. NIST makes relevance judgments for only the top 100 hits in
one submission by each participating group. The resulting answer key (called qrels)
is thus incomplete and recall numbers are computed using this answer key. However,
statistical studies have shown that increasing the coverage of the answer key does not
alter the relative measures of submissions significantly [Voorhees & Harman, 1996].
TREC results must be used in accordance with NIST guidelines. They may not be used in
advertisements and press releases. Cross-system comparisons with other named systems may
not be made for individual results. Any claims made using TREC results must be
"accurate, the evaluation measures used to substantiate these claims must be stated,
and a reference must be made to the full conference proceedings... Informal, qualitative
comparisons must be clearly stated to be such and thus open to statistical
reassessment."
Further Reading
[Cooper, 1997] On Selecting a Measure of Retrieval Effectiveness, Cooper, W. S., in Readings
in Information Retrieval, Jones, K. S. and Willett, P., (ed.), Morgan Kaufmann, 1997,
[Frakes & Baeza-Yates, 1992] Information Retrieval: Data Structures &
Algorithms, Frakes, W. B., and Baeza-Yates, R. (ed.), Prentice Hall, 1992.
[Grossman & Frieder, 1998] Information Retrieval: Algorithms and Heuristics,
Grossman, D. A. and Frieder, O., Kluwer Academic Publishers, 1998.
[Harman, 1993] Overview
of the Second Text REtrieval Conference (TREC-2), Harman, D, 1993.
[Harter, 1996] Variations in relevance assessments and the measurement of retrieval
effectiveness, Harter, S. P., Journal of the American Society for Information Science,
47(1):37-49, 1996.
[Hull, 1998] The TREC-7
Filtering Track: Description and Analysis, Hull, D. A., 1998.
[Jones & Willett, 1997] Readings in Information Retrieval, Jones, K. S. and
Willett, P., (ed.), Morgan Kaufmann, 1997, (See Chapter 4: Evaluation).
[Lewis, 1995] Evaluating and Optimizing Autonomous Text Classification Systems. Lewis,
D. D. In Proc. 18th ACM/SIGIR Conference, pages 246-254, 1995.
[Salton & McGill, 1983] Introduction to Modern Information Retrieval, Salton, G.
and McGill, M. McGraw-Hill, 1983.
TREC overview
TREC publications
[Voorhees & Harman, 1996] Overview of the Fifth Text
REtrieval Conference (TREC-5), Voorhees, E. M. and Harman, D., 1996,
|