Saturday, January 24, 2015

The making of a community survey: Contrastive conditions and critical mass in benchmarking evaluation

Each year, the MediaEval Multimedia Benchmark offers a set of challenges to the research community involving interesting new problems in multimedia. Each challenge is a task consisting of a problem description, a data set, and an evaluation metric.

The tasks are each organized independently, each by a separate group of task organizers. Each task focuses on developing solutions to very different problems. However, they are held together by the common theme of MediaEval: social and human aspects of multimedia. A task has a human aspect if it considers modeling the variation in people’s interpretations of multimedia content, including dependencies on context and intent, is not considered variability that must be controlled, but rather part of the underlying problem to be solved. A task has a social aspect if the task develops technology that supports people in developing and communicating knowledge and understanding using multimedia content.

In addition to the human and social aspects, MediaEval tasks are united by the common goal of moving forward the state of the art in multimedia research. To this end, they strive to achieve both qualitative and quantitive insight into the algorithms that are designed by participating teams to address the challenges. We can call qualitative insight "what works" and quantitative insight "how well it works".

How well an algorithm works must necessarily be measured against something. Most obviously, an algorithm works well if the people who actually have the problem that lies at the root of the task agree that the algorithm solves the problem. These people are referred the "problem holders" or "stakeholders", they are usually a company, or, very often, a set of end users of the multimedia technology. In evaluation campaigns such as MediaEval, the formulation of the problem is represented by the data set and the problem definition. Their opinion of what constitutes a solution is represented by the ground truth (i.e., the reference labels for the data set) and the evaluation metric.
In a living labs set up for algorithm evaluation, both the data set and the ground truth are streams, and move closer to actually instantiating the problem rather than representing the problem. However, we are always directly at understanding whether one algorithm can indeed be considered to give better performance than another, i.e., the state of the art.

In order to be fairly and meaningfully compared, two algorithms must represent "contrastive conditions". This means that there is one, constrained respect in which they differ from each other. If there are two or more major differences between two algorithms, then its unclear of why one performs better than the other. In real life, we might not care why, and simply choose the better performing algorithm. However, if we take the time to investigate contrastive conditions, then we can isolate "what works" from "what doesn't work" and ultimately answer questions like "Have I gotten close to the ceiling of the best possible performance that can be achieved on this challenge?", and "Which types of solutions are just not worth pursuing further?". Such questions also have a key contribution to make for algorithms used in operational settings.

Each year, MediaEval publishes a survey with a long list of questions to be answered by the community. The MediaEval survey is key in ensuring that the work of the teams participating in the challenges gives rise to contrastive conditions.
  • The benchmark organizers can determine whether or not there is a minimum number of people in the research community interested in the task, who would like to participate.
  • The task organizers can make contact with "core participants", teams that declare their intention to participate in the task, including submitting runs and writing the working notes paper, "no matter what". Core teams allow us to ensure that there is a critical mass for any given task, and a higher chance of contrastive conditions.
  • The task organizers can determined which "required runs" that people might be interested in, and adapt the design of the task accordingly. A "required run" is an algorithm that uses certain sources of data, but that might differ in its underlying mechanisms. By deciding on required runs, the community also decides on which aspects of the task it is important to be able to investigate contrastive conditions.


The MediaEval survey is notoriously difficult to prepare. Each year, a large number of different tasks are proposed, and each task has its own particular questions. 

The descriptions of the task are quite challenging to write. MediaEval tasks are planned with a low entry threshold. This means that new groups are able to step into a task, and very easily come up to speed. In other words, the newbie teams participating in MediaEval have a fair chance with respect to teams that have participated in past years. The task descriptions must include technical depth necessary to elicit detailed information from potential participants, but they cannot be formulated in task-specific "jargon" or shorthand that MediaEval participants use among themselves.

Also, the survey must be set up in a way that people can quickly answer a great number of questions for all tasks. Although in the end teams participate in only one, or perhaps two, tasks, the design of the tasks is made better if people with a general interest in, and knowledge of, multimedia research can give their opinion and feedback on as many tasks as possible.

The MediaEval 2015 survey is about to appear. At the moment, we are at 121 questions and counting. It would take a lot less time just to make a top-down decision on which tasks to run, and how to design these tasks. However, over the years we have learned how critical the survey is: the survey input allows MediaEval tasks each year to maximize the amount of insight gained per effort invested. 

We very much appreciate everyone who participates in the survey, and helps to build a highly effective benchmark, and a productive benchmarking community.

Thursday, January 1, 2015

Pick-me-up pixels: Reflections on "happy" in the new year


Yes, there's the holiday season, but a lot needs to happen during that time to make sure that 2015 goes smoothly. 

Late at night, recently, I was grinding my teeth about late reviews. I was worried about me being late in reviewing for other people, and other people being late reviewing for me. In general, I was feeling like we were all getting behind before we had even started the new year.

A colleague who knew I was fretting sent me a encouraging email with some beautiful snow pictures.  My favorite one is this one. 

The moment I clicked open the .jpgs of the pictures attached in the email, magical scenes from far away shifted my mind into a state of wonder, and then joy. 

I noticed that if I stop to reflect on how it feels, the effect of an image, a bunch of numbers representing five million pixels, is physically tangible. The experience of looking at a picture like this one delivers the same pick-me-up as a cold lemonade in the hot summer, a stunning cityscape lit up at night, the sound of waves washing over rocks, or a purring cat in my lap on a long evening. 

Goodness knows I have spent enough time reading, writing, and reviewing papers about the affective impact of multimedia, and how it can be predicted by crunching pixels. But now, looking at the photos that my colleague sent, it struck me how real that impact is. As multimedia researchers, we may not be medical doctors, but we do have the responsibility of developing technologies with the power to make people feel better.

It also hit home, that the impact goes beyond the pixels. A good part of the effect is knowing that someone realized I was glum, and also the thought that ultimately I might to have a chance to visit the place where the picture was taken.

It's interesting that the picture came via email. The effect of social multimedia doesn't require a social networking platform. Given a camera, and a display device, people will exchange pictures. The existence of Facebook helps, but is not necessary...and by similar reasoning the practice of sharing pictures will survive social networks in the form that we know them today.

I imagine that the two people in this picture have also just taken a picture of the snowy trees in the lamplight and are pausing to examine it together on a mobile device. Their exchange of thoughts might lead them to discover that they are connected by their reactions to the beauty of the experience. 

Making images us together leads us to share thoughts about our ways of seeing things that we might be otherwise tempted to disregard as irrelevant or not worth further time. Whether we are moved by our similarities, or take delight in unexpected differences in our perspectives, it is a connection that might have been missed without the mediation of a moment of collaborative picture making.

Ideally, the impact of social pixels would be a positive one without exception. The couple in the picture has captured not only pixels, but also the memory of a moment, that they will be able to relive long after the snow has melted.

But we can't know for sure. The moment may be so precious, that it is overwhelming to look back on it. Emotional overload is clearly a danger in the case of heartbreak, but even if our couple is destined to live happily-ever-after, nostalgia can be a burden. 

If no single moment is overpowering, a mass of memories might still be unbearable. The image might represent one of so many moments, that reliving them each would be an exhausting and numbing experience. 

Ultimately, as users who produce and consume multimedia content, we need systems that allow us to save and find the right content, and also the right amount of content. 

We need to be open to the possibility that maybe these systems are not intelligent systems, but rather utterly simple-minded and transparent systems that just happen to be incredibly good at supporting us, as "human users", in saving and finding the right multimedia for each other.

My pick-me up moment caused by the snow images passes quickly, and the more usual train of thoughts clicks in again: 

I start wondering about what are those things in the middle of the path. Are they air ducts? Are they bee hives? How can I find out? Will I notice them if I get there? Are my photoshop skills good enough to get rid of them? Would this make a better picture? 

And then, I am struck by the thought that my mood was lifted by the pictures, but it would really be lifted if the people I am counting on would finish reviews! Which means I should also get back to mine.

The ability of multimedia to relax and revive our inner being is subtle and fleeting. Blink and you could miss it. We feel it, but we take it for granted, and our minds quickly move to other things. We forget its role in maintaining our inner balance, and our balance with the world and each other. Without this delicate equilibrium of our affective states, we would derail....produce no more papers, invent no more cool systems.

And so for 2015, I will continue to devote effort to understanding what people see in pictures, but I aspire to also remember the power that shared pixels have to lift our spirits.