Wednesday, December 21, 2011

The MediaEval Model for Benchmarking Evaluation

Currently, I'm working to put together the survey for MediaEval 2012. This survey will be used to decide on the tasks that will run in 2012 and also will help to gather information that we will use to further refine the MediaEval model: the set of values and organizational principles that we use to run the benchmark.

At the workshop, someone came up to me and mentioned that he had made use of the model in a different setting, 'I hope you don't mind', he said. Mind? No. Quite to the contrary. We hope that other benchmarking initiatives pick up the MediaEval model and elements of it and put them to use.

I have resolved to be more articulate about exactly what the MediaEval model is. There's no secret sauce or anything -- it's just a set of points that are important to us and that we try to observe as a community.

The MediaEval Model
The MediaEval model for benchmarking evaluation is an evolution of the classic model for an information retrieval benchmarking initiative (used by TREC, CLEF, TRECVid). It runs on a yearly cycle and culminates with a workshop where participants gather to discuss their results and plan future work.

The MediaEval attempts to push beyond existing practice, by maximizing the community involvement in the benchmark. Practically, we do this by emphasizing the following points:
  1. Tasks are chosen using a survey, which gathers the opinion of the largest possible number of community members and potential MediaEval participants for the upcoming year.
  2. Tasks follow the same overall schedule, but individual tasks are otherwise very autonomous and are managed by the Task Organizers.
  3. The Tasks Organizers are encouraged to submit runs to their own tasks, but these runs do not count in the official ranking.
  4. The Task Organizers are supported by a group of five core participants who pledge to complete the task "come hell or high water".
  5. Each task has an official quantitative evaluation metric, which is used to rank the algorithms of the participating teams. The task also, however, promotes qualitative measures of algorithm goodness: i.e., the extent to which an algorithm embodies a creative and promising extension of the state of the art. These qualitative measures are recognized informally by awarding a set of prizes.
In interview footage from the MediaEval 2012 workshop, I discuss the challenge of forging consensus within the community.



One of the important parts of consensus building is collecting detailed and high-coverage information at the beginning of the year about what everyone in the community (and also potential members of the community) thinks. And so I am working here, going through not only the tasks proposals, but also other forms of feedback we've gotten from last year (questionnaires, emails) in order to make sure that we get the appropriate questions on the survey.

It always takes so much longer than I predict -- but it's definitely worth the effort.