Sunday, February 7, 2016

MediaEval 2015: Insights from last year's experiences in multimedia benchmarking



This blogpost is a list of bullet points concerning MediaEval 2015. It represents the "meta-themes" of MediaEval that I perceived to be the strongest during the MediaEval 2015 season, which culminated with the MediaEval 2015 Workshop in Wurzen, German (14-15 September 2015). I'm putting them here, so we can look back later and see how they are developing.
  1. How not to re-invent the wheel? Providing task participants with reading lists of related work and with baseline implementations helps ensure that it is as easy as possible for them to develop algorithms that extend the state of the art.
  2. Reproducibility and replication: How can we encourage participants to share information about their approaches so that their results can be reproduced or replicated? How can we emphasize the importance of reproduction and replication and at the same time push for innovation, and forward movement in the state of the art (and avoid re-inventing the wheel as just mentioned)? One answer that arose this year was to reinforce student participation. Students should feel welcome at the workshop, even if they “just” reproduced an existing workflow.
  3.  Development of evaluation metrics for new tasks: Innovating a new task may involve a developing a new evaluation metric. All tasks face the challenges of ensuring that they are using an evaluation metric that faithfully reflects usefulness to users within an evaluation scenario.
  4. How to make optimal use of leaderboards in evaluation: Participants should be able to check on their progress over the course of the benchmark, and aspire to ever-greater heights. However, it is important that leaderboards not discourage participants from submitting final runs to the benchmark. It is possible that an innovative new approach does very badly on the leaderboard, but is still valuable.
  5. Understanding the relationship between the conceptual formulation of the task, and the dataset that is chosen for use in the task: Are the two compatible? Are there assumptions that we are making about the dataset that do not hold? How can we keep task participants on track: solving the conceptual formulation from the task, and not leveraging some incidental aspect of the dataset?
  6. Disruption: Tasks are encouraged to innovate from year to year. However, 2015 was the first year that organizers started planning far ahead for “disruption” that would take the task to the next level in the next year.
  7. Using crowdsourcing for evaluation: How to make sure that everyone is aware of and applies best practices? How to ensure that the crowd is reflective of the type of users in the use scenario of the task?
  8. Engineering: Task organization involves an enormous amount of time and dedication to engineering work. We continuously seek ways to structure organizer teams and to recruit new organizers and task auxiliaries to make sure that no one feels that their scientific output suffered in a year where they spend time handling the engineering aspects of MediaEval task organization.
  9. Defining tasks and writing task descriptions: We repeatedly see that the process of defining and new task and of writing task descriptions must involve a large number of people. If people with a lot of multimedia benchmarking experience contribute, they can help to make sure that the task definition is well grounded in the existing literature. If people with very little experience in multimedia benchmarking contribute, they can help to make sure that the task definition is understandable even to new participants. We try to write task descriptions such that a master student planning to write a thesis in a multimedia related topic would easily understand what was required for the task.

In order to round this off to a nice "10" points let me mention another issue that is constantly on my mind, namely, the way that the multimedia community treats the word "subjective".

"Subjective" is something that one feels oneself as a subject (and cannot be directly felt by another person---pain is the classic example). In MediaEval tasks, such as Violent Scene Detection, we would like to respect the fact that people are entitled to their own opinions about what constitutes a concept. Note that people can communicate very well concerning acts of violence, without all having an exactly identical idea of what constitutes "violence". Because the concept "works" in the face of the existence of person perspectives, we can consider the task "subjective". 

So often researchers reason in the sequence, "This task is subjective, therefore it is difficult for automatic multimedia analysis algorithms to address". That reasoning simply does not follow. Consider this example: Classifying a noise source as painful is the ultimate "subjective task". You as a subject are the only one who knows that you are in pain. However: Create a device that signals "pain" when noise levels reach 100 decibels, and you have a solution to the task. Easy as pie. "Subjective" tasks are not inherently difficult. 

Instead: whether a task is difficult to address with automatic methods depends on the stability of content-based features across different target labels. 

The whole point of machine learning is to generalize across not only obvious cases, but also across cases in which no stability of features is apparent to a human observer. If we stuck to tasks that "looked" easy to a researcher browsing through the data, (exaggerating a bit for effect) we might as well handcraft rule-based recognizers. So my point 10 is to try to figure out a way to keep researchers from being scared off from tasks just because they are "subjective", without giving the matter a second thought. Multimedia research needs to tackle "subjective" tasks in order to make sure that it remains relevant to the real-world needs of users---once you understand subjectivity, you start to realize that it is actually all over the place.

In 2014, we noticed that the discussion of such themes was becoming more systematic, and that members of the MediaEval community were interested in having a venue in which they could publish their thoughts. For this reason, in 2015, we added a MediaEval Letters section to the MediaEval Working Notes Proceedings dedicated to short considerations of themes related to the MediaEval workshop. The Letter format allows researchers to publish their thoughts already as they are developing, even before they are mature enough to appear in a mainstream venue.

The concept of MediaEval Letters was described in the following paper, in the 2015 MediaEval Working Notes Proceedings:

Larson, M., Jones, G.J.F., Ionescu, B., Soleymani, M., Gravier, G. Recording and Analyzing Benchmarking Results: The Aims of the MediaEval Working Notes Papers. Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015, CEUR-WS.org, online http://ceur-ws.org/Vol-1436/Paper90.pdf


Look for MediaEval Letters to be continued in 2016.