Saturday, January 24, 2015

The making of a community survey: Contrastive conditions and critical mass in benchmarking evaluation

Each year, the MediaEval Multimedia Benchmark offers a set of challenges to the research community involving interesting new problems in multimedia. Each challenge is a task consisting of a problem description, a data set, and an evaluation metric.

The tasks are each organized independently, each by a separate group of task organizers. Each task focuses on developing solutions to very different problems. However, they are held together by the common theme of MediaEval: social and human aspects of multimedia. A task has a human aspect if it considers modeling the variation in people’s interpretations of multimedia content, including dependencies on context and intent, is not considered variability that must be controlled, but rather part of the underlying problem to be solved. A task has a social aspect if the task develops technology that supports people in developing and communicating knowledge and understanding using multimedia content.

In addition to the human and social aspects, MediaEval tasks are united by the common goal of moving forward the state of the art in multimedia research. To this end, they strive to achieve both qualitative and quantitive insight into the algorithms that are designed by participating teams to address the challenges. We can call qualitative insight "what works" and quantitative insight "how well it works".

How well an algorithm works must necessarily be measured against something. Most obviously, an algorithm works well if the people who actually have the problem that lies at the root of the task agree that the algorithm solves the problem. These people are referred the "problem holders" or "stakeholders", they are usually a company, or, very often, a set of end users of the multimedia technology. In evaluation campaigns such as MediaEval, the formulation of the problem is represented by the data set and the problem definition. Their opinion of what constitutes a solution is represented by the ground truth (i.e., the reference labels for the data set) and the evaluation metric.
In a living labs set up for algorithm evaluation, both the data set and the ground truth are streams, and move closer to actually instantiating the problem rather than representing the problem. However, we are always directly at understanding whether one algorithm can indeed be considered to give better performance than another, i.e., the state of the art.

In order to be fairly and meaningfully compared, two algorithms must represent "contrastive conditions". This means that there is one, constrained respect in which they differ from each other. If there are two or more major differences between two algorithms, then its unclear of why one performs better than the other. In real life, we might not care why, and simply choose the better performing algorithm. However, if we take the time to investigate contrastive conditions, then we can isolate "what works" from "what doesn't work" and ultimately answer questions like "Have I gotten close to the ceiling of the best possible performance that can be achieved on this challenge?", and "Which types of solutions are just not worth pursuing further?". Such questions also have a key contribution to make for algorithms used in operational settings.

Each year, MediaEval publishes a survey with a long list of questions to be answered by the community. The MediaEval survey is key in ensuring that the work of the teams participating in the challenges gives rise to contrastive conditions.
  • The benchmark organizers can determine whether or not there is a minimum number of people in the research community interested in the task, who would like to participate.
  • The task organizers can make contact with "core participants", teams that declare their intention to participate in the task, including submitting runs and writing the working notes paper, "no matter what". Core teams allow us to ensure that there is a critical mass for any given task, and a higher chance of contrastive conditions.
  • The task organizers can determined which "required runs" that people might be interested in, and adapt the design of the task accordingly. A "required run" is an algorithm that uses certain sources of data, but that might differ in its underlying mechanisms. By deciding on required runs, the community also decides on which aspects of the task it is important to be able to investigate contrastive conditions.

The MediaEval survey is notoriously difficult to prepare. Each year, a large number of different tasks are proposed, and each task has its own particular questions. 

The descriptions of the task are quite challenging to write. MediaEval tasks are planned with a low entry threshold. This means that new groups are able to step into a task, and very easily come up to speed. In other words, the newbie teams participating in MediaEval have a fair chance with respect to teams that have participated in past years. The task descriptions must include technical depth necessary to elicit detailed information from potential participants, but they cannot be formulated in task-specific "jargon" or shorthand that MediaEval participants use among themselves.

Also, the survey must be set up in a way that people can quickly answer a great number of questions for all tasks. Although in the end teams participate in only one, or perhaps two, tasks, the design of the tasks is made better if people with a general interest in, and knowledge of, multimedia research can give their opinion and feedback on as many tasks as possible.

The MediaEval 2015 survey is about to appear. At the moment, we are at 121 questions and counting. It would take a lot less time just to make a top-down decision on which tasks to run, and how to design these tasks. However, over the years we have learned how critical the survey is: the survey input allows MediaEval tasks each year to maximize the amount of insight gained per effort invested. 

We very much appreciate everyone who participates in the survey, and helps to build a highly effective benchmark, and a productive benchmarking community.