> From dominich@... Thu Nov 15 05:54:36 2001
> Thoughts Fragments to Testing in Web IR
>
>
> I have been following with great interest the "e-discussion" on testing,
> experimentation,
> and spam in IR, which has been unfolding in the last period of time (and
> especially since
> the last ACM SIGIR Conference in New Orleans this year). As I have been
> performing
> testing in Interaction IR (both using standard test collections and in vivo
> in Web IR) in the
> last years, I find this discussion useful and exciting, worth to be
> continued.
Here's a proposal that I've had floating around for a while,
seeing whether there's the resources to do it. It looks like
there is; it's now a question of priority (can't do too many new
things at once in a track!) It specifically addresses the issue
of a new paradigm for testing in a dynamic environment, basically
by throwing lots of queries at it.
--------------------------
The proposal should be pretty orthogonal to most of Dave's recent
list of suggestions. Form of this message is
Short form of proposal.
Background and justification
More details of proposal.
Short Form: I propose the web track participants turn in the top
1000 documents for each of 2000 topics. 50 of those topics will
be standard topics, formed as they were this year. The other
1950 topics will be short queries (2-5 words) for which one
relevant document is known (known to the evaluators, not the
participants.) Evaluation will be standard TREC on the 50
topics, and comparing ranks of where the single relevant document
was retrieved for the other 1950.
----
Background and Justification: Like many of you, I've been trying
to come up with ways of practically evaluating retrieval in a web
environment with dynamic collections and dynamic documents. It's
obvious that our standard test collection approach isn't well
suited. My view of one possibly good methodology for evaluation:
Construct a large set of un-biased query-document (QD) pairs, where
the document (web-page) is relevant to the query. Base comparative
evaluation of two systems on a single query would be Boolean:
does system 1 retrieve the relevant document at a better rank
than system 2. There's any number of aggregate and absolute
evaluation measures that can done (eg mean recipricol rank), but
the basic notion of goodness remains the comparative ranks of the
known relevant document.
This approach accomodates dynamic collections: it doesn't care if
other relevant documents are added or deleted - recall and
precision are not issues. If the known relevant document is
deleted, then neither system will retrieve it, and the system
comparison will just be based on one fewer query. The set of
query-document pairs will gradually become less powerful in
distinguishing as the known documents become fewer.
The approach can handle evaluating dynamic documents (we'll
ignore the fact that current systems can't index or retrieve them.)
We obviously need a large set of QD pairs. Single document
evaluation on a query is much less powerful in distinguishing
systems than full relevance judging. The dynamic nature of web
collections means the set of QD pairs will become smaller.
The requirement that the QD pairs be unbiased towards any system
is a major stumbling block. I think we can approximate
unbiasedness, but first we need to show the evaluation approach
works at all. Thus this year's proposal.
----
Details of proposal: NIST will construct its set of 50 topics as
normal (using actual user needs as the queries or seeds). In
addition, assesors will look at random pages of the collection
and give a 2-5 word query for which that page would be relevant.
(Obviously not all pages would be able to have a reasonable query
constructed for them. I'm hopeful that maybe 1 in 3-4 would.)
1950 of these queries would be constructed. The two sets would
be randomly mixed; participants would not know which of the 2000
topics would be subject to relevance judging.
Participants would turn in the top 1000 documents for each topic.
We would furnish a version of "check_input" that would do
compression of results for them. I figure with a 16 million page
collection, the full results for 2000 queries should be about 6
MBytes.
Assessors would judge the 50 normal topics; they would not have
to do any judging for the other 1950 topics. The primary evaluation
of the 1950 would be simply number of other systems that this
system beat on a query, averaged over all queries (or some
equivalent of this.) Other measures would be calculated (I can
think of a number of them!), and compared against both the
comparative evaluation, and the normal TREC evaluation of the 50
topics.
We should get all sorts of information out of this. The fact we
can compare normal TREC evaluation with the new evaluation
methodology is a major plus. Just having 2000 sets of retrieved
documents is nice (all sorts of things can be done). Hopefully,
2000 topics is many more than we need to accurately be able to
compare systems; we'll be able to figure out some error rates for
using fewer topics. If the approach proves viable, there's all
kinds of joint-participant experiments that can be done outside
the provinces of TREC. QD pairs should be much easier to
construct than (almost) full relevance judgements ala TREC. But,
we need to show the approach is valid first.
Worries about task:
1. Can participants do so many topics? I think it should be
fine, much less of a challange than the initial TRECs were, way
back in prehistory.
2. Can we get reasonable queries out of the assessors? I don't
know what guidelines we should furnish them. But with 2000
queries, we can accomodate a fair bit of noise. Perhaps we can
subset the queries after TREC and come up with a smaller set of
reasonable queries.
3. Will the queries be representative of user queries? Probably
not, though the assessors have seen plenty of user queries by now.
But I can't think of another way of getting lots of unbiased relevant
documents. For this first test, I think it's more important to
have an unbiased set than to have representative queries (and
having representative queries is always questionable in TREC
anyways).
4. Can NIST afford the assessor time? At least the assessor time
is in the spring, making up the queries.
---
That's enough for now. Is it worth pursuing this? Note that
I'm not going to be a co-ordinator for this, given my history!!
I'm willing to do things like write the needed version of
check_input, but others will have to do the "advertising" and
urging folks on, and that means these others have to be convinced
this is a good idea! Can it happen?
ChrisB
> I would like to add some thoughts fragments to this, which may not relate
> directly to specific testings,
> rather to the philosphy of scientific testing and experimentation.
>
> It is well-known and widely accepted that one of the major requirements in
> scientific experimentation
> and testing relates to the concept of repeatability, i.e., an experiment
> should be repeatable and if repeated
> under the same - main - conditions the results obtained should be the same.
> (Classical IR Testing based
> on test collections satisfy these conditions). But, with the Web, at this is
> a point a problem may arise:
> the same Web search engine (which is, at least in principle or hopefully, the
> same for
> a longer period of time) can be repeatedly used BUT the content of the Web
> is changing continuously,
> which may have the consequence that the results (hits returned to a query )
> may be even very different
> (in number, order and content) from those obtained during an experiment
> (testing) performed previously
> (a week or month or year, etc. earlier). Specifically, for example, the
> authors of a reasearch paper
> report on experiments carried out to test some effectiveness (e.g., the
> precision,spam, etc.) of some
> Web search engine; if now the reviewers of their paper wish to repeat
> (typcally after months or perhaps even
> a year later) their experiment (in order to check the results, or out of
> curiousity, etc.) will they get the same
> answers? Quite likely they won't... So should they conclude that the
> experiment was not scientific or badly planned or ...?
>
> I think they shouldn't necessarily.
>
> A typical (and well planned and executed) in vivo testing of a Web search
> engine implies that some selection
> of human subjects, search engine, evaluation methodology, queries, relevance
> categories, measures computation,
> etc. is carefully done, performed, analysed and discussed, and conclusions
> are drawn. The results typically or
> usually are in terms of, e.g., average or first n precison, etc.. In other
> words, the results are statistical in nature,
> they have a probabilistic interpretation, and are a measure of an expected
> behaviour exhibited by the system
> under felxible conditions (unlike in classical test collections, which are
> static).
>
> So one might think that the concept of scientific experimentation does not
> apply to Web IR testing, and hence
> this latter may not have any scientific value. (Unfortunately, I know several
> people highly ranked in 'classical'
> sciences who really think like that!) In my view, our line of thinking should
> be the following: The Web does exist,
> hence I am not sure whether the classical criterion should tell us what
> should exist (i.e., what should be science
> or scientific), rather we should perhaps re-consider the concept of
> scientific testing in such a way that it remain valid!
> We could do this in several ways. For example, we could re-formulate or
> generalise the notion of scientific
> experimentation and repeatability, and give it a probabilistic interpretation
> (somehow similar to what Zadeh
> did to classical sets with fuzzy sets, or Bolyai did to Euclidean geometry
> with his non-Euclidean geometry, etc.):
>
>
> Probabilistic interpretation of repeatability in scientific experimentation:
> if the experiment is repeated under the
> same statistical conditions then the expected results are with high
> probability similar. This generalised formulation
> becomes the classical one if the 'high probability" is equal to one, and the
> "statistical conditions" mean the
> same conditions. In the case of Web IR testing, the "statistical conditions"
> mean a selection and group of people
> to carry out the searches, and that the Web has not changed somehow
> dramatically in size or content; the "high
> probability" mean that the results should be the same in average.
>
> Thus, I think, with the principle of probabilistic generalisation of
> repeatability we can make Web IR testing
> compatible with classical experimentation (perhaps mainly in the eyes of
> classical sciences).
>
> Very best wishes,
> sandor
> _________________________________
> Dr Sandor Dominich
> http://www.dcs.vein.hu
> University of Veszprem, Hungary
> ________________________________