Showing posts with label evaluation. Show all posts
Showing posts with label evaluation. Show all posts

Tuesday, April 30, 2013

The Five Runs Rule: Less is More in Multimedia Benchmarking

MediaEval is a multimedia benchmarking initiative that offers tasks in the area of multimedia access and retrieval that have a human or a social aspect to them. Teams sign up to participate, carry out the tasks, submit results, and then present their findings at the yearly workshop.

I get a lot of questions about something in MediaEval that is called the "five-runs rule". This blogpost is dedicated to explaining what it is, where it came from, and why we continue to respect it from year to year.

Participating teams in MediaEval develop solutions (i.e., algorithms and systems) that address MediaEval tasks. The results that they submit to a MediaEval task are the output generated by these solutions when they are applied to the task test data set. For example, this output might be a set of predicted genre labels for a set of test videos. A complete set of output generated by a particular solution is called a "run". You can think of a run as the results generated by an experimental condition. The five-run rule states that any given team can only submit of five sets of results to a MediaEval task in any given year.

The simple answer to why MediaEval tasks respect the five-runs rule is "They did it in CLEF'. CLEF is the Cross Language Evaluation Forum, now the Conference and Labs of the Evaluation Forum, cf. http://www.clef-initiative.eu/. MediaEval began as a track of CLEF in 2008 called VideoCLEF and at that time we adopted the five-runs rule and have been using it ever since.

The more interesting answer to why MediaEval tasks respect the five-runs rule is "Because it makes us better by forcing us to make choices". Basically, the five-runs rule forces participants during the process of developing their task solutions to think very carefully about what the best possible approach to the problem would be an focus their effort there. The rule encourages them to use a development set (if the task provides one) in order to inform the optimal design of their approach and select their parameters.

The five-runs rule discourages teams from "trying everything" and submitting a large number of runs, as if evaluation was a lottery. Not that we don't like playing the lottery every once in a while, however, if we choose our best solutions, rather than submitting them all, we help to avoid over-fitting new technologies that we develop to a particular data set and a particular evaluation metric. Also, if we think carefully about why we choose a certain approach when developing a solution, we will have better insight about why the solution worked or failed to work...which gives us a clearer picture of what we need to try next.

The practical advantage of the five-runs rule is that it allows the MediaEval Task Organizers to more easily distill a "main message" from all the runs that are submitted to a task in a given year: the participants have already provided a filter by submitting only the techniques that they find most interesting or predict will yield the best performance.

The five-runs rule also keeps Task Organizers from demanding too much of the participants. Many tasks discriminate between General Runs ("anything goes") and Required Runs that impose specific conditions (such as "exploit metadata" or "speech only"). The purpose of having these different types of runs is to make sure that there are a minimum number of runs submitted to the task that investigate specific opportunities and challenges presented by the data set and the task. For example, in the Placing Task, not a lot of information is to be gained from comparing metadata-only runs directly with pixel-only runs. Instead, a minimum number of teams have to look at both types of approaches in order for us to learn something about how the different modalities contribute to solving the overall task. Obviously, if there are two many required runs, participating teams will be constrained in the dimensions along which they can innovate, and that would hold us back.

Another practical advantage of the five-runs rule has arisen in the past years when tasks, led by Search and Hyperlinking and also Visual Privacy, have started carrying out post hoc analysis. Here, in order to deal with the volumes of the runs that need to be reviewed by human judges (even if we exploit crowdsourcing approaches), it is very helpful to have a small focused set of results.

Many tasks will release the ground truth for their test sets at the workshop, so teams that have generated many runs can still evaluate them. We encourage the practice of extending the two-page MediaEval working notes paper into a full paper for submission to another venue, in particular international conferences and journals. In order to do this, it is necessary to have the ground truth. Some tasks do not release the ground truth for their test sets because the test set of one year becomes the development set of the next year, and we try to keep the playing field as level as possible for new teams that are joining the benchmark (and may not have participated in previous year's editions of a task).

In the end, people generally have the experience that when they are writing up their results in their two-page working notes paper, and trying to get to the bottom of what worked well and why in your failure analysis, they are generally quite happy that they are dealing with not more than five runs.

Wednesday, December 21, 2011

The MediaEval Model for Benchmarking Evaluation

Currently, I'm working to put together the survey for MediaEval 2012. This survey will be used to decide on the tasks that will run in 2012 and also will help to gather information that we will use to further refine the MediaEval model: the set of values and organizational principles that we use to run the benchmark.

At the workshop, someone came up to me and mentioned that he had made use of the model in a different setting, 'I hope you don't mind', he said. Mind? No. Quite to the contrary. We hope that other benchmarking initiatives pick up the MediaEval model and elements of it and put them to use.

I have resolved to be more articulate about exactly what the MediaEval model is. There's no secret sauce or anything -- it's just a set of points that are important to us and that we try to observe as a community.

The MediaEval Model
The MediaEval model for benchmarking evaluation is an evolution of the classic model for an information retrieval benchmarking initiative (used by TREC, CLEF, TRECVid). It runs on a yearly cycle and culminates with a workshop where participants gather to discuss their results and plan future work.

The MediaEval attempts to push beyond existing practice, by maximizing the community involvement in the benchmark. Practically, we do this by emphasizing the following points:
  1. Tasks are chosen using a survey, which gathers the opinion of the largest possible number of community members and potential MediaEval participants for the upcoming year.
  2. Tasks follow the same overall schedule, but individual tasks are otherwise very autonomous and are managed by the Task Organizers.
  3. The Tasks Organizers are encouraged to submit runs to their own tasks, but these runs do not count in the official ranking.
  4. The Task Organizers are supported by a group of five core participants who pledge to complete the task "come hell or high water".
  5. Each task has an official quantitative evaluation metric, which is used to rank the algorithms of the participating teams. The task also, however, promotes qualitative measures of algorithm goodness: i.e., the extent to which an algorithm embodies a creative and promising extension of the state of the art. These qualitative measures are recognized informally by awarding a set of prizes.
In interview footage from the MediaEval 2012 workshop, I discuss the challenge of forging consensus within the community.



One of the important parts of consensus building is collecting detailed and high-coverage information at the beginning of the year about what everyone in the community (and also potential members of the community) thinks. And so I am working here, going through not only the tasks proposals, but also other forms of feedback we've gotten from last year (questionnaires, emails) in order to make sure that we get the appropriate questions on the survey.

It always takes so much longer than I predict -- but it's definitely worth the effort.