Tuesday, April 30, 2013

The Five Runs Rule: Less is More in Multimedia Benchmarking

MediaEval is a multimedia benchmarking initiative that offers tasks in the area of multimedia access and retrieval that have a human or a social aspect to them. Teams sign up to participate, carry out the tasks, submit results, and then present their findings at the yearly workshop.

I get a lot of questions about something in MediaEval that is called the "five-runs rule". This blogpost is dedicated to explaining what it is, where it came from, and why we continue to respect it from year to year.

Participating teams in MediaEval develop solutions (i.e., algorithms and systems) that address MediaEval tasks. The results that they submit to a MediaEval task are the output generated by these solutions when they are applied to the task test data set. For example, this output might be a set of predicted genre labels for a set of test videos. A complete set of output generated by a particular solution is called a "run". You can think of a run as the results generated by an experimental condition. The five-run rule states that any given team can only submit of five sets of results to a MediaEval task in any given year.

The simple answer to why MediaEval tasks respect the five-runs rule is "They did it in CLEF'. CLEF is the Cross Language Evaluation Forum, now the Conference and Labs of the Evaluation Forum, cf. http://www.clef-initiative.eu/. MediaEval began as a track of CLEF in 2008 called VideoCLEF and at that time we adopted the five-runs rule and have been using it ever since.

The more interesting answer to why MediaEval tasks respect the five-runs rule is "Because it makes us better by forcing us to make choices". Basically, the five-runs rule forces participants during the process of developing their task solutions to think very carefully about what the best possible approach to the problem would be an focus their effort there. The rule encourages them to use a development set (if the task provides one) in order to inform the optimal design of their approach and select their parameters.

The five-runs rule discourages teams from "trying everything" and submitting a large number of runs, as if evaluation was a lottery. Not that we don't like playing the lottery every once in a while, however, if we choose our best solutions, rather than submitting them all, we help to avoid over-fitting new technologies that we develop to a particular data set and a particular evaluation metric. Also, if we think carefully about why we choose a certain approach when developing a solution, we will have better insight about why the solution worked or failed to work...which gives us a clearer picture of what we need to try next.

The practical advantage of the five-runs rule is that it allows the MediaEval Task Organizers to more easily distill a "main message" from all the runs that are submitted to a task in a given year: the participants have already provided a filter by submitting only the techniques that they find most interesting or predict will yield the best performance.

The five-runs rule also keeps Task Organizers from demanding too much of the participants. Many tasks discriminate between General Runs ("anything goes") and Required Runs that impose specific conditions (such as "exploit metadata" or "speech only"). The purpose of having these different types of runs is to make sure that there are a minimum number of runs submitted to the task that investigate specific opportunities and challenges presented by the data set and the task. For example, in the Placing Task, not a lot of information is to be gained from comparing metadata-only runs directly with pixel-only runs. Instead, a minimum number of teams have to look at both types of approaches in order for us to learn something about how the different modalities contribute to solving the overall task. Obviously, if there are two many required runs, participating teams will be constrained in the dimensions along which they can innovate, and that would hold us back.

Another practical advantage of the five-runs rule has arisen in the past years when tasks, led by Search and Hyperlinking and also Visual Privacy, have started carrying out post hoc analysis. Here, in order to deal with the volumes of the runs that need to be reviewed by human judges (even if we exploit crowdsourcing approaches), it is very helpful to have a small focused set of results.

Many tasks will release the ground truth for their test sets at the workshop, so teams that have generated many runs can still evaluate them. We encourage the practice of extending the two-page MediaEval working notes paper into a full paper for submission to another venue, in particular international conferences and journals. In order to do this, it is necessary to have the ground truth. Some tasks do not release the ground truth for their test sets because the test set of one year becomes the development set of the next year, and we try to keep the playing field as level as possible for new teams that are joining the benchmark (and may not have participated in previous year's editions of a task).

In the end, people generally have the experience that when they are writing up their results in their two-page working notes paper, and trying to get to the bottom of what worked well and why in your failure analysis, they are generally quite happy that they are dealing with not more than five runs.