Friday, December 16, 2016

Advanced Bullshit Detection for your protection: Wise words on reading the news

There's been lots of news on the news in the news recently. Just like we try to take care of our bodies by eating healthy foods, we must take care of our societies by carefully consuming quality news. The importance of a high-quality, balanced media diet is independent of your political convictions. 

But how do we keep our news reading habits healthy? What do we do? This morning I came across a great video explaining four steps that everyone can take in order to achieve a healthier news diet. The video is an interview with Curd Knüpfer at the Freie Universität Berlin published by Die Zeit with this article. Since the video is in German, I provide my own translation of the four steps here. I tried to make the translation as accessible as possible.  

I call it "Advanced Bullshit Detection": these four steps are what we need to be spending our time doing in order to protect ourselves while reading the news.

As a media consumer, you have to develop an attitude towards yourself so that you see yourself as someone who chooses news cautiously. There are guidelines that you can follow, and that you can make into habits:

1. Ask yourself questions about your own emotional reaction
When I see an article or a news items that that makes me particularly angry, one to which I have a very strong emotional reaction, one that makes me nervous or fearful, then I should stop and think. I should stop and ask "Wait a minute, what is this information actually saying, and why is it having this effect on me?"

2. Check the quality standards
Develop an eye for what good, meaningful journalism is, and how news articles are created. This means that one can look at any form of journalistic reporting (it doesn't matter where it is from or its political direction). You can say "It's better if an article on a particular topic cites more than one source" or if more than one source is cited, "It is better that more than one perspective is represented."

3. Pay attention to the sources of the information
Of course you need to pay attention to the sources. Can I, for example, trust someone who works for the Freie Universität Berlin? And, if I think that I can't: Why can't I? It also works the other way around. You can say, "Hey, there's someone who knows this topic relatively well!" It doesn't mean that you take everything they say at face value, but at least you can trace back where the person is from and who is paying them, etc.

4. Balance your media diet
Balancing your media diet is a luxury that we have because we live in a world in which media is digital. It is relatively easy for us to access a large number of different sources of news, and we should also take advantage of the diversity available to us.

Thank you Curd Knüpfer  for these wise words (did my best not to lose anything in translation).

Sunday, November 13, 2016

Big Data as Fast Data

Last Thursday I was at a "sounding board" meeting of the Big Data Commission of the Royal Netherlands Academy of Arts and Sciences. This post highlights some points that I have continued to reflect upon since the meeting. 

According to Wikipedia, "Big Data" are data sets that are too large and too complex for traditional data processing systems to work with them. Interestingly, the people who characterize "Big Data" in terms of volume, variety and velocity, often underemphasize, as the Wikipedia definition does, the aspect of  velocity. Here, I argue it is important not to forget that Big Data is also Fast Data.

Fast Streams and Big Challenges
Because I work in the area of recommender systems, I quite naturally conceptualize problems in terms of a data stream rather than a data set. The task a stream-based recommender system addresses is the following: there is a stream of incoming events and the goal is to make predictions on the future of the stream. There are two issues that differentiate stream-based views of data from set-based views.

First: the temporal ordering in the stream means that ordinary cross-validation cannot be applied. A form of A/B testing must be used in order to evaluate the quality of predictions. Online A/B testing has implications for the replicability of experiments.

Second: at any given moment, you are making two intertwining predictions. One is the prediction of the future of the stream. The other is how much, if any, of the past is actually relevant in predicting the future. There are two reasons why the information in the past stream may not be relevant to the future: external and internal factors.

External factors are challenging because you may not know they are happening. A colleague doing medical research recently told me that when deductibles go up people delay going to the doctor, and suddenly the patients that are visiting the doctor have different problems, simply because they delayed their visit. Confounding variables of course exist for set-based data. However, if you are testing stream-based prediction online, you can't simply turn back the clock and start investigating confounding variables: it's already water under the bridge. As much as you may be recording, you cannot reply all of reality as it happened.

Internal factors are even tougher. Events occurring in the data stream influence the stream itself. A common example is the process by which a video goes viral on the Web. In this case, we have a stream of events consisting of viewers watching the video. Because people like to watch videos that are popular (or are simply curious about what everyone else is watching) events in the past actually serve to create the future, yielding an exponentially growing number of views. These factors can be understood as feedback loops. Another important issue, which occurs in recommender systems, is that the future of the stream is influenced by the predictions that you make. In a recommender system, these predictions are shown to users in the form of recommended items, and the users create new events by interacting with these items. The medical researcher is stuck with this effect: she cannot decide not to cure patients, just because it will create a sudden shift in the statistical properties of her data stream.

Time to Learn to do it Do it Right
In short, you are trying to predict and also to predict whether you can predict. We still call it "Big Data", but clearly we are at a place where the assumption that data greed pays off ("There's no data like more data") breaks down. Instead, we start to consider the price of  Big Data failure ("The bigger they are, the harder they fall").

In a recent article Trump's Win Isn't the Death of Data---It Was Flawed All Along, Wired concluded that "...the data used to predict the outcome of one of the most important events in recent history was flawed." But if you think about it: of all the preposterous statements made during the campaign, no one proposed that the actual election be cancelled since Big Data could predict its outcome. There are purposes that Big Data can fulfill, and purposes for which it is not appropriate.

The Law of Large Numbers forms the basis for reliably repeatable predictions. For this reason, it is clear that Big Data is not dead. The situation is perhaps exactly the opposite: Big Data has just been born. We have reasons to believe in its enormous usefulness, but ultimately its usefulness will depend on the availability of people with the background to support it.

There is a division between people with a classic statistics and machine learning background who know how to predict (who may even have the technical expertise to do it at large scale) and people who, on top of a classical background, have the skills to approach the question of when does it even make sense to be predicting. Only the latter are qualified to pursue big data.

The difference is perhaps a bit like the difference between snorkeling and scuba diving. Both are forms of underwater exploration, and many people don't necessarily realize that there is a difference. However, if you can snorkel, you are still a long way from being able to scuba dive. For scuba diving, you need additional training, and more equipment, and a firm grasp of principles that are not necessarily intuitive, such as the physiological effects of depth and the wisdom of redundancy. There is a lot to be achieved on a scuba dive, that can't be accomplished by mere snorkeling: but the diver needs resources to invest, and above all needs to have the time to learn to do it right.

No Fast Track to Big Data
These considerations lead to the realization that although Big Data streams may be in and of themselves incredibly quickly changing, the overall process of making Big Data useful is, in fact, very slow. Working in Big Data requires an enormous amount of training going beyond a traditional data processing background.

Gaining the expertise needed for Big Data also requires understanding of domains that lie outside of traditional math and computer science fields. All work in Big Data areas must start from a solid  ethical and legal foundation. Civil engineers are in some cases able to lift a building to add a foundation.  With Big Data, this possibility is excluded.

To illustrate this point, it is worth returning to consider the idea of replacing the election with a group of data scientists carrying out Big Data calculations. It is perhaps an extreme example, but it is one that makes clear that ethical and legal considerations must come before Big Data. The election must remain the election because on its own a Big Data calculation has no way of achieving the necessary social trust necessary to ensure continuity of the government. For this we need a cohesive society and we need the law. Unless Big Data starts from ethics and from legal considerations, we risk time and effort developing a large number of algorithms that are solving the wrong problems.

Training data scientists while ignoring the ethical and legal implications of Big Data is a short cut that is tempting in the short run, but can do nothing but harm us in the long run.

Big Data as Slow Science
The amount of time and effort needed to make Big Data work, might lead us to expect that Big Data should yield some sort of Big Bang, a real scientific revolution. In fact, however, it are the principles the same old scientific method of centuries that we return to in order to define Big Data experiments. In short, Big Data is a natural development of existing practices. Some have even argued that data-driven science pre-dated the digital age, e.g., this essay entitled Is the Fourth Paradigm Really New?

However, it would also be wrong to characterize Big Data as business as usual. A more apt characterization is as follows: Before the Big Data age scientific research proceeded along a the conventional path: researchers would formulate their hypothesis, design their experiment, and then as the final step collect the data. Now, the path starts with the data, which inspires and informs the hypothesis. The experimental design must compensate for the fact that the data was "found" rather than collected.

Given this state of affairs, it is easy to slip into the impression that Big Data is "fast" in the sense that it speeds up the process of scientific discovery. After all, the data collection process, which in the past could take years, can be carried out quickly. If the workflow is implemented, a new hypothesis could be investigated in a matter of hours. However, it important to consider how the speed of the experiment itself influences the way in which we formulate hypotheses. Because there is little cost to running an experiment, there is little incentive to put a great deal of careful thought and consideration into which hypotheses we are testing.

A good hypotheses is one that is motivated by a sense of scientific curiosity, and/or societal need and that has been informed by ample amounts of real-world experience. If there is negligible additional cost to running an additional experiment, we need to find our motivation for formulating good hypotheses elsewhere. The price of thoughtlessly investigating hypotheses merely because they can be formulated given a specific data collection is high. Quick and dirty experiments lead to mistaking spurious correlations for effects, and yield insights that fall short of generalizing to meaning real-world phenomena, let alone use cases.

In sum, we should remember the "V" of velocity. Big Data is not just data sets, its also data streams, which makes Big Data also Fast Data. Taking a look at data streams makes it easier to see the ways in which Big Data can go wrong, and why it requires special training, tools, and techniques.

Volume, variety, and velocity have been extended by some to include other "Vs" such as Veracity and Value. Here, I would like to propose "Vigilance". For Big Data to be successful we need to slow down: train people with a broad range of expertise, connect people to work in multi-skilled teams, and give them the time and the resources needed in order to do Big Data right. In the end, the value of Big Data is the new insights that it reveals, and not the speed at which it reveals them.

Tuesday, October 18, 2016

The Societal Impact of Multimedia Research: ACM MM 2016 Brave New Ideas Track (Papers and Slides)

At ACM Multimedia 2016, the Brave New Ideas Track was devoted to the theme "Societal Impact of Multimedia Research". In the Call for Papers, we challenged authors to be brave in pursuing topics that have a direct impact on people's lives. 

What is brave about multimedia research with societal impact? The answer is simple: it takes much more time. To pursue work with direct societal impact it is necessary to work together with other disciplines, create new data resources, and develop new evaluation methodologies that demonstrate success with respect to socially relevant criteria. 

It also takes time for researchers to reach the insight that important new scientific questions arise from directly attempting to create solutions for societal problems. For example, concerns about privacy are currently motivating researchers to turn away from pursuit of "data greedy" algorithms to study how to get more out of less data.

We recommend reading the papers of the track to understand other interesting scientific problems that have been opened up by researchers who have the courage to work in these high societal-impact areas:

Mengfan Tang, Siripen Pongpaichet, and Ramesh Jain. 2016. Research Challenges in Developing Multimedia Systems for Managing Emergency Situations. In Proceedings of the 2016 ACM on Multimedia Conference (MM '16). ACM, New York, NY, USA, 938-947. [ACM DL link][slides]

Andrea Castelletti, Roman Fedorov, Piero Fraternali, and Matteo Giuliani. 2016. Multimedia on the Mountaintop: Using Public Snow Images to Improve Water Systems Operation. In Proceedings of the 2016 ACM on Multimedia Conference (MM '16). ACM, New York, NY, USA, 948-957. [ACM DL link][paper][slides]

Alexis Joly, Hervé Goëau, Julien Champ, Samuel Dufour-Kowalski, Henning Müller, and Pierre Bonnet. 2016. Crowdsourcing Biodiversity Monitoring: How Sharing your Photo Stream can Sustain our Planet. In Proceedings of the 2016 ACM on Multimedia Conference (MM '16). ACM, New York, NY, USA, 958-967. [ACM DL link][paper][slides] See also: the Pl@ntNet App.

Michael Riegler, Mathias Lux, Carsten Gridwodz, Concetto Spampinato, Thomas de Lange, Sigrun L. Eskeland, Konstantin Pogorelov, Wallapak Tavanapong, Peter T. Schmidt, Cathal Gurrin, Dag Johansen, Håvard Johansen, and Pål Halvorsen. 2016. Multimedia and Medicine: Teammates for Better Disease Detection and Survival. In Proceedings of the 2016 ACM on Multimedia Conference (MM '16). ACM, New York, NY, USA, 968-977. [ACM DL link][paper][slides]

Contact: ACM MM 2016 BNI Chairs: Martha Larson (TU Delft and Radboud University Nijmegen) and Hari Sundaram (University of Illinois)

Tuesday, August 16, 2016

Music as Technology: Sad song of missed opportunities for music to do what music does well

Last week my colleagues Andrew Demetriou and Cynthia Liem presented a paper at ISMIR 2016 (the 17th International Society for Music Information Retrieval Conference) entitled "Go with the Flow: When Listeners Use Music as Technology" [1]. The idea of this paper is that listeners use music as a tool that is directed to accomplishing a task.

In the paper, we point to the phenomenon of people using music to put themselves into a flow state: "listeners make a conscious decision to expose themselves to the experience of music to alter their internal state in order to achieve a goal that they have set for themselves." With this paper, we want to encourage the development of music information retrieval technology, including recommender systems, that supports listeners in finding the music that they need in order to support their goals.

It is a chicken and egg problem. Unless systems are there that support users in finding music that allows them to reach there goals, it is hard to study the phenomenon at large scale. Unless we understand the phenomenon, it is hard to develop these systems. The first leap remains one of, well, basically, faith: faith that, given the evidence that we already have on hand, that we should push for music information retrieval that recognizes that positive potential of music for allowing people to best use their brains.

We try to keep the focus on the positive potential, but the dark underside of the situation is what happens if our world continues on, with the mainstream being unaware of the effect of music on the brain. The wrong music can prevent certain brain states, as easily as the right music can promote them. When music is not in the control of an individual (such as in a restaurant, cafe, or public place) serious thought is needed about what music is playing and how to play it. Otherwise the music is putting people in a brain state that is neither productive, nor even pleasant.

Stop the noise by Surko
In Chicago, I met a man cleaning tables in a restaurant.

The chain had obviously put a lot of time and money into the decor.

The music was a loud mix of alternating genres. Understanding people speaking was a strain.

When I asked the man about it he got emotional.

"I'm teaching myself guitar," he said. "There are a couple of good country pieces mixed in, but I know them by heart. The whole thing plays over and over again."

"I could tell them what they really should be playing!"

We look to a future in which the people making the music decisions about places like this restaurant care about their sound atmosphere as much as they care about the visual impression of their decor. The music should not only be geared to customers who stop by for lunch, but also to employees, who spend their days listening to it. The experience (and arguably also mental health) of people spending time in the restaurant could be enhanced if the music released them from the grind of repetition. Music may not always allow people to achieve flow state, but it can support them in enjoying being where they are, as much as possible.

Music decision makers should sit up and realize that music matters: music doesn't happen "out there" somewhere, but rather happens inside every person who is listening to it. The leap of faith is not a large one, it just requires listening to people who know that music is important, and asking them how to make things better.

[1] Demetriou, A., Larson, M., and Liem, C. Go with the Flow: When Listeners Use Music as Technology, ISMIR 2016.

Thursday, July 28, 2016

How to use the word "subjective" in multimedia content analysis

Multimedia content analysis is devoted to the automatic processing of video, image, audio, and text content with the purpose of describing it, or otherwise associating it with information that will make it findable, and also useful, to users. Previously, I have urged multimedia content analysis researchers to avoid the word “subjective” and instead formulate their insights in terms of inter-annotator agreement with respect to the data that they are using and the protocol that they give to the annotators who are providing the target labels. Since we don’t seem to be inclined to stop using the word “subjective” soon, it makes sense to formulate some guidelines on how to use it "safely".

Best practice for the use of the word “subjective”: When the word "subjective" it is used, it should be first defined.

The word "subjective" has different definitions. It’s not particularly productive to fix any one way of using it as “the only right way”. Instead, when using the word "subjective" you should simply declare which definition you are using, and you will avoid a lot of unproductive confusion. You do not want to risk that you use “subjective” in one sense, and your reader/listener interprets it in another sense.


We can gain further understanding of why it is important to "define well before use" by examining the dictionary entry for “subjective” provided by Merriam-Webster. Here, you can see the many meanings that “subjective” can take on. I haven’t observed any issues caused by definitions 1 or 2. Multimedia content analysis research is generally not interested in these definitions. Where we get into trouble is with 3-5, so I will focus on these.

Let’s start with definition 4c: “arising out of or identified by means of one's perception of one's own states and processes” This definition of subjective is related to the conceptualization of a situation as being exclusively determined by the point of view of the “subject”, i.e., the person who is undergoing the experience of perceiving something.

Such a conceptualization, in the case of certain situations, is standard, and when we communicate with each other, we don’t even think about the fact that we assume it.  Let’s take a closer look at how this conceptualization works. When we use language, we rely on an unspoken agreement that certain phenomena (for example, the emotion that music evokes in a person) are subjective. Specifically, the agreement means that the way in which we understand the world gives all listeners the power to determine what they feel when listening to music (i.e., induced emotion) for themselves.

Simply stated: if someone says, “This music makes me so happy”, it is nonsensical for me to assert, “No, it doesn’t”. I might say this to tease someone, but it is clear that I am not using language in a standard way. An emotion felt while listening to music can only be asserted by the subject, and I, who am not in the subject’s mind, do not have the power to originate a meaningful statement on the matter. It is not a trivial point: Without this shared understanding, the convention/assumption of subjectivity behind "This music makes me happy", the function of language would break down and we would have failed to communicate.

Here’s where things can go wrong for a researcher working in the area of multimedia content analysis. Imagine you are collecting multimedia content labels from a group of annotators who are judging the content, and you at the end of experiment, and declare, “The results show that the phenomenon we are studying is subjective”. Readers who are using definition 4c of subjective will find this conclusion invalid. The reason is that under this definition, “subjective” is something that is established ahead of time by convention: it cannot be determined experimentally. (Full disclosure: for me this is the preferred definition of "subjective", because it is the most literal interpretation. The word "subjective" contains the word "subject". I also prefer it since it ensures the sanctity of the private world of the individual, and the right of the individual to an independent voice.)

Moving to 4b: “arising from conditions within the brain or sense organs and not directly caused by external stimuli” This definition is not so interesting for multimedia researchers: we study multimedia content, which is an external stimuli. 

Now, we go on to definition 4a: (1): “peculiar to a particular individual:  personal This definition of subjective is related to the idea that each individual has their own unique view. (Merrian-Webster's Definition 1 of "peculiar" is "characteristic of only one person, group, or thing") Under this definition, something is "subjective" it means that everyone disagrees with everyone else. This definition is also not so interesting for multimedia researchers: if everyone has their own completely different interpretation, then we are lost: we cannot hope to build algorithms that generalize over the different meanings that find in multimedia. Until the field of multimedia starts working extensively on systems used only by a single person, this definition of subjective is probably not one that will be used often.

Note that the field of recommender systems strives to develop personalized algorithms, and users evaluation methodologies that assess whether personal predictions are successful. However, even recommender systems rely on the fact that people are similar to each other. In a world populated exclusively with utterly unique individuals, collaborative filtering algorithms will necessarily fail.

More helpful is definition 4a (2): “modified or affected by personal views, experience, or background” This definition of "subjective" is often implicitly assumed in multimedia content analysis. People’s interpretations are affected by what they know, the opinions they hold, and the life experience that they have had. These factors can lead to there being a multitude of different interpretations that apply to certain multimedia content. However, in contrast to the situation above with definition 4a (1), we are not assuming that everyone has their own “peculiar” interpretations. It makes sense for us to try to create systems that generalize or predict meaning, only in the case that we are not dealing with exclusively unique interpretations.

We can see 4a (2) as closely related to 3b: “relating to or being experience or knowledge as conditioned by personal mental characteristics or states”

With both of these definitions, 4a (2) and 3b, we can reasonably have hope that we can find islands of consistency in the perceptions of users of multimedia (and in the labels of our annotators). Within these islands we can make stable inferences that will be useful to users.

Let’s check again if, under these definitions, you can make a statement in your paper, “The results show that the phenomenon we are studying is subjective”. This time you can. But in order to do so, you need to have an experiment that shows that the background of the users is what is causing your classifier not to give you stable predictions. Otherwise, it might be the case that your classifier just has not been well designed or trained.

You also need to provide evidence that the protocol that your annotators are using to make judgements is not unduly steering people to diverse interpretations. Your protocol should put people reasonably on the same page, and then ask them for judgements at all times being careful not to ask "leading" questions, cf. [1, 2]. For some research work, you might not be using a protocol. Many tasks involve "found" labels such as tags. In this case, you need to state the assumptions that you are making concerning the original labeling context, including the reasons for which the labels were assigned.

With any definition of subjective, it is important to strictly avoid arguing along these lines: “This phenomenon is subjective, and therefore it is not important and we should not be studying it.” 

Scientifically, there is no a priori reason to prioritize the “objective” over the “subjective” if we use definitions 4a (2) and 3b.  It is true that we tend to study phenomena with high inner-annotator agreement since these are easier to get a handle on. However, at the same time we remain aware that this tendency steers us dangerously close to the famous story of Nasreddin Hodja who looks for his ring outside, since it is too dark inside where he lost it. In short, define “subjective”, but never use it as an excuse for failure or avoidance.

To drive that particular point home: The message is "Keep up your guard". Your problem should arise from the needs of users. Practically, speaking the problem you choose will be influenced by your ability to access the resources needed to study it, including carrying out a well designed, conclusive experiment. It will not, however, be influenced by your personal decision that something is "subjective".
Next, we turn to definition 3a: “characteristic of or belonging to reality as perceived rather than as independent of mind.” Using this definition is dangerous. It forces you to take a position on the difference between effects that are real, and effects that are imagined. As scientists, we determine this difference experimentally. We do not presume it. Unless we are undertaking experiments directed a making this difference, it makes sense to steer clear of this definition.

Finally, definition 5: “lacking in reality or substance” The same comment applies as in the case of definition 3a. We cannot a priori say whether patterns that can be found in multimedia content lack reality or substance.  If we don’t find evidence for the reality of some phenomenon in our data, it simply means that there is no evidence for its reality in our data. Lack of observation does not disprove existence. We must guard ourselves against jumping to conclusions. Again, this is a definition to be avoided, unless you are actually directly investigating the nature of reality.

As researchers in the area of multimedia content analysis, we must carefully keep ourselves from creating our own realities: the reality we assume must be the reality (possible multiple realities) of the users that we serveall of them. The fact that we do not necessarily understand this reality fully, or have the type of information or data that would capture it in its complexity, richness, and continuous, rapid evolution, is a challenge that we face. This challenge is inherent to the types of algorithms and technologies that we design and develop.

[1] Larson, M., Melenhorst, M., Menéndez, M. and Peng Xu. Using Crowdsourcing to Capture Complexity in Human Interpretations of Multimedia Content. In: Ionescu, B. et al. Fusion in Computer Vision – Understanding Complex Visual Content, Springer, pp. 229-269, 2014.

[2] M. Riegler, V. R. Gaddam, M. Larson, R. Eg, P. Halvorsen and C. Griwodz, "Crowdsourcing as self-fulfilling prophecy: Influence of discarding workers in subjective assessment tasks," 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI), Bucharest, 2016, pp. 1-6.

Saturday, June 18, 2016

Multimedia analysis out of the box: new applications and domains

Flickr: Tom Magliery
This blogpost summarizes the panel on the third day of the 14th International Workshop on Content-based Multimedia Indexing CBMI 2016. It includes both statements by the panelists and comments coming from the audience. I was the panel moderator, and was also taking notes as people were speaking (any error in reproducing what people said here is strictly my own).

The panel was structured into three rounds roughly related to the past, present, and future of multimedia analysis research. Each round had an “opener” that the panelists were asked to respond to, and then continued in free form, with the audience also contributing.

First round: The panelists were asked to discuss, “A past vision (that you have had during the last 20 years) for a multimedia analysis application that came to be.”

The early work of GroupLens started a user revolution. It was great to have recommender systems break onto the scene. Their introduction shifted the focus of the community of researchers, also those studying information/multimedia access, from pure computation to involving users. This shift was possible because computers could collect user interactions, providing researchers with large sets of interactions to work with. Recommender systems introduced the key idea that users can benefit from other users, and this idea has come into its own.

Historically, multimedia indexing started with spoken content indexing. (This statement carried the “footnote” that the panelists and the panel moderator all have a speech background.) In the past years, we have seen the maturation of speech and language technology. Now we are on the brink of systems that index all spoken information in multimedia. (But let’s keep breathing in the meantime.)

The panel noticed that it is easier to name past visions that still have not completely come to be. Examples were:

First person video: In the late 1990’s, video life logging started. The goal was to summarize daily life, and to aid memory and remembering. Privacy is a real stumbling block for this vision. However, now we are seeing first person cameras like GoPro: so, perhaps it is video life logging is here, but it is not exactly what we thought it was going to be.

Users: Ten years ago we were developing algorithms for applications, but there was a sense that they would never be put to use. The field of multimedia analysis is now more user centered, although not yet complete so: we are on our way. Sometimes it’s not gaining 5% MAP that makes product usable. Instead, we need to think about different lines.

Education: The panel was in agreement that we have yet to see multimedia reach its potential as a tool for education. This could and should be the century of education!

In the early 1990’s multimedia retrieval and spoken content retrieval were intended to support education. Today, we see that eduction is still mainly about books. MOOCs and online learning resources are growing in popularity, but we are still waiting for multimedia indexing to really contribute to education at large scales.

We used to have the vision that kids should be able to play with information and to communicate with each other as part of studying and learning: These types of applications were fun. What happened to this kind of work? It is a shame that this hasn’t really been put to mainstream use: Is this the responsibility of the multimedia people?

Well, yes. We are all teachers in a way: Why don’t we eat our own dogfood? Looking at this conference, our presentations are all text-heavy sets of PowerPoint slides!

Why is willingness of teachers and journalists to use multimedia tools so low? Do we need to wait until everyone in the world becomes tech friendly to have our research put to use?

Maybe we just don’t have the tools necessary to allow multimedia indexing to come into its own in support of education. We need the tools in order to engage teachers.

We don’t have the time to do education related research. You can’t just do a 10 minute experiment with data from 30 people: people are complex kids are complex! We haven’t been willing to take the time to work with teachers: we haven’t had funding for a 5-10 year sustained effort in this area. But it’s a worthwhile goal.

We need to understand the nature of education. There is a relationship between student and teacher: it is a human relationship. A machine might not be able to motivate the student.

This observation about student behavior stands in contrast to the success of video games in motivating kids. Games appear to motivate kids more so than their parents are able to do. However, today’s games are too simplistic to be an education tool. They don’t reflect real breath.

Final note of the first round: It seems that multimedia analysis researchers don’t talk about “killer applications” anymore. The way we see our success is more diffuse, and maybe that is also OK.

Second round. Panel members were asked to discuss “A current (widely-held) vision for a multimedia analysis application that is doomed.”

Our panelists jumped on the opportunity to be controversial.

Is lifelogging doomed?
Multimedia researchers of course love the huge amounts of data that life logging delivers. But do people really want their lives to be logged? Why would I want all of those picture? Are we just recording without a real application?

When we are healthy and in good shape we have perhaps no reason to record our lives. But when we become older or are in a situation that we need to be managing an illness, things change. In this case, the lifelogging applications are tremendously interesting. For elderly people living alone, it can be a real help: although it does not replace human company.

Why don’t we see this technology being widely used? The problem is not the market. The problem is that we are not marketing or business people: we need someone else to put this technology on the market. This process for doing so is a mess! We develop nice applications, but we need to move on, and the business development never gets done.

Is virtual reality doomed?
We are not in a virtual space having a virtual conference. We are here. Virtual meeting rooms have not come to be and video conferencing fatigue is real. Virtual reality works great in games. Perhaps also in demonstrating things. But in general, augmented reality appears to be the more promising path.

Is multimedia analysis of broadcast television doomed? 
Analysis of news, sports, movies, in fact, any produced content is over. If someone can produce the content, they can also dedicate the effort to annotate it.

A less extreme version of that position is probably, however, more appropriate. When we carry out multimedia research, often produced content is the only content we have. Not every content producer has the resources to create annotations. Finally (as note by the moderator) some types of annotations are against the business interests of people producing multimedia content: Do film producers really want audiences to have a fine-grained breakdown of the violence in film?

The panel agreed that analysis of produced content is very important for knowledge extraction and summaries of large, heterogenous collections. You can extract knowledge and facts: for example, the present needs a 20 minute summary.

Professionals, or specific applications often need detailed summaries: There would be value in summarizing to study for example the soccer moves of a certain player for practice or for strategy purposes.

Personal content often needs summarization: parents like highlights of school games or performances that feature their own children.

Are standards doomed? 
Standards make sense for compression and communication, but standards have been over pushed. Many researchers identify with this situation: You barely know what you’re doing and you make a standard for it. However, the activity that takes place around the production of standards gives rise to new ideas. The fact that descriptors were encoded in MPEG7 gave rise to a lot of further work on descriptors.

Perhaps a more direct way of achieving the same effects is via reference implementations and toolkits. OpenCV is effectively, although not formally, a standard. This kinds of efforts are very important.

Third round: Panel members were asked for “A future vision for a multimedia analysis application that we should strive for.”

The opening comment was interesting and unexpected: As a early-career researcher in multimedia one is drawn to problems that one likes, and that attract and holds one’s attention. However, as a late-career researcher, one looks back and starts to regret not having considered the contribution that one’s career was making to society.

Multimedia for medicine: Young multimedia researchers should consider “joining the doctors”: the field of medecine needs us.

Human rights: Another area with enormous potential social impact is multimedia for human rights. We need algorithms that will allow us to find evidence of violations: examples are the analysis of areal photos to search for hidden destruction and the reconstruction of events using social media.

We need (footnote by moderator) technology that is able to verify the extent to which multimedia reflects the reality that it claims to capture: and, in particular, identify multimedia created with the intent to deceive.

Low quality content is key: Interestingly, some of the most highly socially relevant applications for multimedia involve processing some of the worst images. Multimedia researchers need to be brave enough to venture into areas where content is poor quality, difficult to obtain, and (footnote by moderator) where evaluation of success is highly challenging.

User intent: Multimedia information retrieval has recently experienced the “intent revolution”: the change from focusing on the nature of the items that users are trying to find, to the tasks that users are trying to achieve. Supporting people in their daily lives is not is as obviously socially relevant as education, medical or human rights applications. However, it has an important contribution to make.

Affective computing: We look forward to multimedia systems that support us in the emotional aspects of communicating with multimedia: sharing and mutual remembering. Humans are social creatures (isolations causes us to suffer). Shared experiences allow us to build relationships, share values, and keep the connections needed for social and psychological well-being. Regretfully, current research on affect and sentiment simplifies the emotional aspects of multimedia to the extent that it may be “trivial”. We need to work towards understanding both multimedia and the mind: a key question is: What pieces need to come together in order for someone to experience the reproduction of a memory or an experience?

Hardware and energy consumption: We should not forget that multimedia analysis is possible because of the devices that capture, store and process multimedia. We are ever dependent on hardware. Processing of multimedia costs energy: and future work should also keep energy efficiency in mind.

Closing comments:
When we study multimedia, we study communicating with multimedia. Moving forward it is important to keep the human in human communication.

Is there an end to multimedia? Can we foresee that it might be replaced by something completely different?

We see multimedia as an “everlasting field” encompassing applications that have not yet been invented. However, we should continue to call it “multimedia”, because continuity of what we call it will allow us to build on the past.

Currently, we see more and more other communities doing multimedia: examples are the computer vision community and the speech and language processing community. Having a distinct identity will allow the other fields to avoid reinventing the wheel.

We saw during the first round of the panel that looking back over the past 20 years, we did not do so well in formulating predictions which came true: the technologies that we anticipated have not achieved mainstream uptake (with a few notable exceptions). It’s not dramatic to be wrong in our predictions. However: it is important that we learn from our mistakes.

In general, we do not expect all early-career multimedia researchers to connect to socially relevant applications by “joining the doctors”. But it is good to have a larger vision. When you are writing a paper, embed your ideas within an overall picture of their potential. Embrace the larger meaning of your work and imbue multimedia research with sense of mission.

A big thank you to our panelists and to the members of the audience who contributed to the discussion.

Guillaume Gravier, IRISA, France
Alexander Hauptmann, Carnegie Mellon University, USA
Bernard Merialdo, EURECOM, France

Audience contributors:
Jenny Benois Pineau, University of Bordeaux, France
Bogdan Ionescu, University Politehnica of Bucharest, Romania
Georges Quénot, LIG, France
Stéphane Marchand Maillet, University of Geneva, Switzerland
Mathias Lux, Klagenfurt University, Austria

Thursday, April 21, 2016

Horizons: Multimedia Technologies that Protect Privacy

The Survey on Future Media for the new H2020 Work Programme gave me 500 characters each to answer a series of critical questions. I’m listing questions and my answers below. I'm taking this as my chance to pull out all the stops: extreme caution meets idealism. Did I use my characters wisely?

Describe which area the new research and innovation work programme of H2020 should look at when addressing the future of Media.

Non-Obvious Relationship Awareness (NORA) is a set of data mining techniques that find relationships between people and events in data that no one would think would exist. European Citizens sharing images or videos online have no way of knowing what sorts of information they are revealing about themselves. We need innovative research on media processing techniques that protect people's privacy by warning them when they are sharing information, and that obfuscate media making it safe for sharing.

What difference would projects in the area you propose make for Europe's society and citizens?

Projects in this area would contribute to safeguarding the fundamental right of European citizens to privacy and protection of personal data. Today, privacy protection focuses on protecting "obvious" personal information. This protection means nothing when personal information is obtainable "non-obvious" form. European citizens need tools to understand the dangers of sharing media in cyberspace, and tools that can support them in making informed decisions and protecting themselves.

What are the main technological and Media ecosystem related breakthroughs to achieve the foreseen scenario?

The Media ecosystem in question is the whole of cyberspace. The breakthrough that we need is techniques to predict that impact of data that we have not seen entering the system. We need techniques that are able to obfuscate images and videos in ways that defeat sophisticated machine learning algorithms, such as deep learning techniques. These technologies must be designed from the beginning in a way that is understandable and acceptable to the general population: protection only works if used.

What kind of technology(ies) will be involved?

Technologies involved are image, text, audio, and video processing algorithms. These algorithms will re-synthesize users' multimedia content so that it still fulfills its intended function, but with a reduced risk of leaking private information. Technology must go beyond big data to be aware of hypothetical future data. Yet unheard of: technology capable of protecting users' privacy against inference of non-obvious relations must be understandable by the people who it is intended to serve.

Describe your vision on the future of Media in 5 years' time?

People will begin to worry about large companies claiming to own (and attempt to sell them back) digital versions of their past selves, forgotten on distant servers. The realization will grow that it is not enough to have a device that takes amazing images and videos, but you also need a device that allows you to save and enjoy those images in years to come. An understanding will emerge that a rich digital media chronicle of ones own life contributes to health, happiness and wellbeing.

Describe your vision on the future of Media in 10 years' time?

Social images circling the globe will give people unprecedented insight into the human condition. People living in both developed and developing countries will rebel at anyone in the human race living under conditions of constant fear, and threat of constant hunger. The world will change. If protecting privacy means that people need to stop sharing images and videos all together, the opportunity to fulfill this idealistic vision is missed. The future of Media is bright, only if can be kept safe.

At the end of the day, multimedia is about making the world healthy, happy, and complete. At the end of this exercise I have concluded that the horizon stretches even further than 2020.

Sunday, April 3, 2016

Starting to RUN

Thank you for the email, tweets and texts about my new appointment at Radboud University Nijmegen. I'm happy that other people realize what a special day it was for me, and share my excitement about new opportunities and new challenges. I appreciate the warm reception at Radboud University. The "Welcome!" was unmistakeable: actually written on my whiteboard, when I walked into my office in the Center for Language Studies for the first time.

My appointment is as "Professor of Multimedia Information Technology" at the Faculty of Science, Institute for Computing and Information Sciences (iCIS). It involves a double affiliation (50/50) between iCIS and the Faculty of Arts, Centre for Language Studies (CLS). In this way, it brings together my background (pre-1990 in Math and EE; 1990-2000 in Formal Linguistics; and since 2000 in Computer Science, i.e., audio-visual search engines). It is a natural extension of this background that I will be working to bridge the research occurring on information access between the two faculties.

A press release about my appointment appeared on 31 March on the Radboud University homepage. I was very happy about the publicity for the MediaEval Multimedia Evaluation Benchmark. MediaEval is an initiative aimed at driving the development of new multimedia access technologies by offering shared tasks to the community. Instead of being centrally organized, it is grassroots in nature. My role is the bass player who, in a band, helps to links different parts together and keep the music moving forward on tempo. The success of the benchmark comes from the dedication and efforts of the task organizers, and the participants. (MediaEval is offering a great lineup of tasks in 2016, and signup is now open on the MediaEval 2016 website. The MediaEval 2016 workshop will be held 20-21 October 2016, right after ACM Multimedia 2016 in Amsterdam.)

Starting January 2017, Radboud University will be my main university (4 days per week), but I will maintain an affiliation with Delft University of Technology (1 day a week).

Currently, my main affiliation remains the Multimedia Computing Group at Delft University of Technology. However, I am at Radboud University Nijmegen for two days a week to get started at CLS. My first act is teach Intelligent Information Tools, a course for first and second year undergraduate students in Communication and Information Science. The students learn about the nature of information, the structure of the internet, how search, recommendation, and other information tools work, and also how to think critically about these tools.

At TU Delft I continue teaching, and pursuing my research. The main focus of my research at this time is recommender systems, within the context of the EC FP7 project CrowdRec "Fusion of active information for next generation recommender systems". It is a privilege to serve the CrowdRec consortium as the scientific coordinator.  Current highlights are: The NewsREEL news recommendation challenge, at CLEF 2016 the ACM RecSys 2016 job recommendation challenge, and the Workshop on Deep Learning for Recommender Systems, also at ACM RecSys 2016. I look forward to a successful conclusion of the project September 2016, and also to future collaborations.

Seven years ago, nearly to the day, I wrote the first post on this blog. I had read an article advising kill your blog, as an answer to blogposts getting lost in a sea of mainstream information. My post points out that it is strange to suggest that bloggers must change, and not mention the role or responsibility of search engines.

Now, I am more convinced in ever of the value of information within small circles. Search needs to support exploitation of that value. The readership of this blog is intended to be future versions of myself, and also a limited number of people interested in a deep dive into reflections on various search-related topics. As I move to a new university, and the number of people I teach or collaborate with grows, I would like to remember that. I'll probably have less time to write blog posts, but I have decided that I will wait a few more years until moving away from occasionally blogging.

Creating information is a way in which we help ourselves think. Intense conversations also refine thought. But the model of everyone talks to everyone about everything does not always make sense. Instead, we need room for reflection with a relatively small set of individuals. Search should support that.

What's blocking the road? Maybe we feel that small scale search is a success because Google now displays calendar events in our search results. Maybe facing the personal is somehow more laborious or painful. In any case, currently we are far from understanding the aggregated impact of thousands of local dialogues, or to evaluating the success of small search that helps us exchange ideas with our past selves, and our closest colleagues. The future holds no lack of challenges.

Saturday, March 5, 2016

A Non Neural Network algorithm with "superhuman" ability to determine the location of almost any image

Martha Larson and Xinchao Li

We would like to complement the MIT Technology Review headline Google Unveils Neural Network with “Superhuman” Ability to Determine the Location of Almost Any Image with information about NNN (Non Neural Network) approaches with similar properties.

This blogpost provides a comparison between the DVEM (Distinctive Visual Element Matching) approach, introduced by our recent arXiv manuscript (currently under review): 

Xinchao Li, Martha A. Larson, Alan Hanjalic Geo-distinctive Visual Element Matching for Location Estimation of Images (Submitted on 28 Jan 2016) (

and the PlaNet approach, introduced by the arXiv manuscript covered in the MIT Technology Review article:

Tobias Weyand, Ilya Kostrikov, James Philbin PlaNet—Photo Geolocation with Convolutional Neural Networks (Submitted on 17 Feb 2016) (

We also include, at the end, a bit of history on the problem of automatically "determining the location of images",  which is also known as geo-location prediction, geo-location estimation as in [3], or, colloquially, "placing" after [4].

Our DVEM approach is a search-based approach to the prediction of the geo-location of an image. Search-based approaches consider the target image (the image whose geo-coordinates are to be predicted) as a query. They then carry out content-based image search (i.e., query-by-image) on a large training set of images labeled with geo-coordinates (referred to as the "background collection"). Finally, they process the search results in order to make a prediction of the geo-coordinates of the target image. The most basic algorithm, Visual Nearest Neighbor (VisNN), simply adopts the geo-coordinates of the image at the top of the search results list as the geo-coordinates of the target image.  Our DVEM algorithm uses local image features for retrieval, and then creates geo-clusters in the list of image search results. It adopts the top ranked cluster, using a method that we previously introduced [5, 6]. The special magic of our DVEM approach is the way that it reranks the clusters in the results list: it validates the visual match at the cluster level (rather than at the level of an individual image) using a geometric verification technique for object/scene matching we previously proposed in [7], and it leverages the occurrence of visual elements that are discriminative for specific locations.

The PlaNet approach divides the surface of the globe into cells with an algorithm that adapts to the number of images in its training set that are labeled with geo-coordinates for that location, i.e., a location that has more photos will be divided into finer cells. Each cell is considered a class, and is used to train a CNN classifier.

Further comparison of the way the algorithms were trained and tested in the two papers:

Training set size5M images train, 2K validation91M train, 34M validation
Training set selectionCC Flickr images with geo-locations, (MediaEval 2015 Placing Task)Web images with Exif geolocations
Training time1 hour on 1,500 cores for 5M photos for indexing and feature extraction2.5 months on 200 CPU cores
Test set sizeca. 1M images2.3M images
Test set selectionCC Flickr images (MediaEval 2015)Flickr images with 1-5 tags
Train/test de-duplicationtrain/test sets mutually exclusive wrt uploading userCNN trained on near-duplicate images
Data set availabilityvia MM Commons on AWSnot specified
Model size100GB for 5M images377MB
BaselinesGVR [6], MediaEval 2015 IM2GPS [8]

From this table, we see that the training and test data for the algorithms are different, and for this reason, we cannot compare the accuracy measured for the two approaches directly. However, the numbers at the 1 km level (i.e., street level) suggest that DVEM and PlaNet are playing in the same ballpark. PlaNet reports correct prediction for 3.6% of the images on the (2.3M image test set) and 8.4% on the IM2GPS data set (237 images). Our DVEM approach achieves around 8% correct predictions on our 1M image test set, and is surprisingly robust to the exact choice of parameters. DVEM gains 12% relative performance over VisNN, and 5% over our own previous GVR. Note that [6] provides evidence that GVR outperforms IM2GPS [8]. PlaNet also reports that it outperforms IM2GPS, but the numbers are not directly comparable because 14x less training data is used.

The downside of search-based approaches is prediction time, as pointed out by the PlaNet authors in discussion IM2GPS. DVEM requires 88 hours on a Hadoop based cluster containing 1,500 cores to make predictions for 1M images. For applications requiring offline prediction, this may be fine, however, we assume that online geo-prediction is also important. We point out that with enough memory or an efficient index compression method, we would not need Hadoop, and we would be able to do the prediction on a single core with about 2s per query. Further, the question of how runtime scales is closely related to the question of the number of images that are actually needed in the background collection. Our DVEM approach uses 18x less training data than the PlaNet algorithm: if we are indeed in the same ballpark, this result calls in to question the assumption that prediction accuracy will not saturate after a certain number of training images.

We mention a couple reasons for which DVEM might ultimately turn out to out-perform PlaNet. First, the PlaNet authors point out that the discretization hurts accuracy in some cases. DVEM, in contrast, creates candidate locations "on the fly". As such, DVEM has the ability to make a geo-prediction at an arbitrarily small geo-resolution.

Second, the test set used to test DVEM is possibly more challenging than the PlaNet test set because it does not eliminate images without tags. We assume that the presence of a tag is at least a weak indicator of care on the part of the user. A careless user might also engage in careless photography, producing images that are low quality and/or are not framed to clearly depict their subject matter. A test set containing images taken by relatively more careful users could be expected to yield a higher accuracy.

Third, we assume that when near duplicates were eliminated from the PlaNet test/training set, that these were near duplicates from the same location. Eliminating images that are very close visual matches with other locations would, of course, artificially simplify the problem. However, it may also turn out that the elimination artificially makes the problem more difficult. In real life, a lot of people simply do take the same picture, for example, of the leaning tower of Pisa. A priori it is not clear how near duplicates should be eliminated to ensure the testing setup maximally resembles an operational setting.

The PlaNet paper was a pleasure to read, the name "PlaNet" is truly cool, and we are enthused about the small size of the resulting model. We are interested by the fact that "PlaNet" produces a distributional probability over the whole world, although we also remark that, DVEM is capable of producing top-N location predictions. We also liked the idea of exploiting sequence information, but think that considering temporal neighborhoods rather than temporal sequences might also be helpful. Extending DVEM with either temporal sequences or neighborhoods would be straightforward.

We hope that the PlaNet authors will run their approach using the MediaEval 2015 Placing Task data set so that we are able to directly compare the results. In any case, they will want to revisit their assertion that "...previous approaches only recognize landmarks or perform approximate matching using global image descriptors" in the light of the MediaEval 2015 Placing Task results, including our DVEM algorithm.

We would like to point out that work on algorithms able to predict the location of almost any image has been ongoing in full public visibility for a number of years. (Although given our field, we also enjoy the delicious jolt of a headline beginning "Google unveils...") The starting point can be seen as Mapping the World's Photos [9] in 2009. The MediaEval Multimedia Evaluation benchmark has been developing solutions to the problem since 2010, as chronicled in [10]. The most recent contribution was the MediaEval 2015 Placing task [11], cf. the contributions that use visual approaches to the task [12,13]. The MediaEval 2015 data set is part of the larger, publicly available YFCC100M data set, part of Multimedia Commons, and recently featured in Communications of the ACM [14]. MediaEval 2016 will offer a further edition of the Placing Task, which is open to participation for any research team who signs up.

We close by retuning to comment on the importance of  NNN (Non Neural Network) approaches. This example of the strength of DVEM vs. PlaNet provides a demonstration that there is reason for the research community to retain a balance in their engagement in NN and NNN approaches. One appealing aspect  of NNN approaches, and, in particular of search-based geo-location prediction, is the relative transparency of how the data is connected to the prediction. It may sound like science fiction from today's perspective, but one could imagine a future in which the person who took the image would receive a micro fee every time their image was used for the purpose of predicting geo-location metadata for someone else. Such a system would encourage people to take images that were useful for geo-location, and move us forward as a whole.

We would like to thank the organizers of the MediaEval Placing task for making the data set available for our research. Also a big thanks to SURF SARA for the HPC infrastructure without which our work would not be possible.

[1] Xinchao Li, Martha A. Larson, Alan Hanjalic Geo-distinctive Visual Element Matching for Location Estimation of Images (Submitted on 28 Jan 2016) (

[2] Tobias Weyand, Ilya Kostrikov, James Philbin PlaNet—Photo Geolocation with Convolutional Neural Networks (Submitted on 17 Feb 2016) (
[3] Jaeyoung Choi and Gerald Friedland. 2015. Multimodal Location Estimation of Videos and Images. Springer Publishing Company, Springer.
[4] P. Serdyukov, V. Murdock, R. van Zwol. 2009. Placing Flickr photos on a map. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '09), ACM, New York, pp. 484–491.
[5Xinchao Li, Martha Larson, and Alan Hanjalic. 2013. Geo-visual ranking for location prediction of social images. In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval (ICMR '13). ACM, New York, NY, USA, 81-88. 
[6Xinchao Li, Martha Larson, and Alan Hanjalic. Global-Scale Location Prediction for Social Images Using Geo-Visual Ranking, in IEEE Transactions on Multimedia, vol. 17, no. 5, pp. 674-686, May 2015.
[7] Xinchao Li, Martha Larson, Alan Hanjalic. 2015. Pairwise Geometric Matching for Large-scale Object Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '15), pp. 5153-5161.
[8] J. Hays and A. A. Efros, "IM2GPS: estimating geographic information from a single image," Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, Anchorage, AK, 2008, pp. 1-8.
[9] David J. Crandall, Lars Backstrom, Daniel Huttenlocher, and Jon Kleinberg. 2009. Mapping the world's photos. In Proceedings of the 18th international conference on World wide web (WWW '09,) ACM, New York, 761-770.
[19Martha Larson, Pascal Kelm, Adam Rae, Claudia Hauff, Bart Thomee, Michele Trevisiol, Jaeyoung Choi, Olivier Van Laere, Steven Schockaert, Gareth J.F. Jones, Pavel Serdyukov, Vanessa Murdock, Gerald Friedlan. 2015. The Benchmark as a Research Catalyst: Charting the Progress of Geo-prediction for Social Multimedia. In [3]. 
[11Jaeyoung Choi, Claudia Hauff, Olivier Van Laere, Bart Thomee. The Placing Task at MediaEval 2015. In Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015,, online
[12] Lin Tzy Li, Javier A.V. Muñoz, Jurandy Almeida, Rodrigo T. Calumby, Otávio A. B. Penatti, Ícaro C. Dourado, Keiller Nogueira, Pedro R. Mendes Júnior, Luís A. M. Pereira, Daniel C. G. Pedronette, Jefersson A. dos Santos, Marcos A. Gonçalves, Ricardo da S. Torres. RECOD @ Placing Task of MediaEval 2015. In Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015,, online
[13] Giorgos Kordopatis-Zilos, Adrian Popescu, Symeon Papadopoulos, Yiannis Kompatsiaris. CERTH/CEA LIST at MediaEval Placing Task 2015. In Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015,, online
[14] Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, Li-Jia Li. YFCC100M: The New Data in Multimedia Research. Communications of the ACM, Vol. 59 No. 2, Pages 64-73.

Sunday, February 7, 2016

MediaEval 2015: Insights from last year's experiences in multimedia benchmarking

This blogpost is a list of bullet points concerning MediaEval 2015. It represents the "meta-themes" of MediaEval that I perceived to be the strongest during the MediaEval 2015 season, which culminated with the MediaEval 2015 Workshop in Wurzen, German (14-15 September 2015). I'm putting them here, so we can look back later and see how they are developing.
  1. How not to re-invent the wheel? Providing task participants with reading lists of related work and with baseline implementations helps ensure that it is as easy as possible for them to develop algorithms that extend the state of the art.
  2. Reproducibility and replication: How can we encourage participants to share information about their approaches so that their results can be reproduced or replicated? How can we emphasize the importance of reproduction and replication and at the same time push for innovation, and forward movement in the state of the art (and avoid re-inventing the wheel as just mentioned)? One answer that arose this year was to reinforce student participation. Students should feel welcome at the workshop, even if they “just” reproduced an existing workflow.
  3.  Development of evaluation metrics for new tasks: Innovating a new task may involve a developing a new evaluation metric. All tasks face the challenges of ensuring that they are using an evaluation metric that faithfully reflects usefulness to users within an evaluation scenario.
  4. How to make optimal use of leaderboards in evaluation: Participants should be able to check on their progress over the course of the benchmark, and aspire to ever-greater heights. However, it is important that leaderboards not discourage participants from submitting final runs to the benchmark. It is possible that an innovative new approach does very badly on the leaderboard, but is still valuable.
  5. Understanding the relationship between the conceptual formulation of the task, and the dataset that is chosen for use in the task: Are the two compatible? Are there assumptions that we are making about the dataset that do not hold? How can we keep task participants on track: solving the conceptual formulation from the task, and not leveraging some incidental aspect of the dataset?
  6. Disruption: Tasks are encouraged to innovate from year to year. However, 2015 was the first year that organizers started planning far ahead for “disruption” that would take the task to the next level in the next year.
  7. Using crowdsourcing for evaluation: How to make sure that everyone is aware of and applies best practices? How to ensure that the crowd is reflective of the type of users in the use scenario of the task?
  8. Engineering: Task organization involves an enormous amount of time and dedication to engineering work. We continuously seek ways to structure organizer teams and to recruit new organizers and task auxiliaries to make sure that no one feels that their scientific output suffered in a year where they spend time handling the engineering aspects of MediaEval task organization.
  9. Defining tasks and writing task descriptions: We repeatedly see that the process of defining and new task and of writing task descriptions must involve a large number of people. If people with a lot of multimedia benchmarking experience contribute, they can help to make sure that the task definition is well grounded in the existing literature. If people with very little experience in multimedia benchmarking contribute, they can help to make sure that the task definition is understandable even to new participants. We try to write task descriptions such that a master student planning to write a thesis in a multimedia related topic would easily understand what was required for the task.

In order to round this off to a nice "10" points let me mention another issue that is constantly on my mind, namely, the way that the multimedia community treats the word "subjective".

"Subjective" is something that one feels oneself as a subject (and cannot be directly felt by another person---pain is the classic example). In MediaEval tasks, such as Violent Scene Detection, we would like to respect the fact that people are entitled to their own opinions about what constitutes a concept. Note that people can communicate very well concerning acts of violence, without all having an exactly identical idea of what constitutes "violence". Because the concept "works" in the face of the existence of person perspectives, we can consider the task "subjective". 

So often researchers reason in the sequence, "This task is subjective, therefore it is difficult for automatic multimedia analysis algorithms to address". That reasoning simply does not follow. Consider this example: Classifying a noise source as painful is the ultimate "subjective task". You as a subject are the only one who knows that you are in pain. However: Create a device that signals "pain" when noise levels reach 100 decibels, and you have a solution to the task. Easy as pie. "Subjective" tasks are not inherently difficult. 

Instead: whether a task is difficult to address with automatic methods depends on the stability of content-based features across different target labels. 

The whole point of machine learning is to generalize across not only obvious cases, but also across cases in which no stability of features is apparent to a human observer. If we stuck to tasks that "looked" easy to a researcher browsing through the data, (exaggerating a bit for effect) we might as well handcraft rule-based recognizers. So my point 10 is to try to figure out a way to keep researchers from being scared off from tasks just because they are "subjective", without giving the matter a second thought. Multimedia research needs to tackle "subjective" tasks in order to make sure that it remains relevant to the real-world needs of users---once you understand subjectivity, you start to realize that it is actually all over the place.

In 2014, we noticed that the discussion of such themes was becoming more systematic, and that members of the MediaEval community were interested in having a venue in which they could publish their thoughts. For this reason, in 2015, we added a MediaEval Letters section to the MediaEval Working Notes Proceedings dedicated to short considerations of themes related to the MediaEval workshop. The Letter format allows researchers to publish their thoughts already as they are developing, even before they are mature enough to appear in a mainstream venue.

The concept of MediaEval Letters was described in the following paper, in the 2015 MediaEval Working Notes Proceedings:

Larson, M., Jones, G.J.F., Ionescu, B., Soleymani, M., Gravier, G. Recording and Analyzing Benchmarking Results: The Aims of the MediaEval Working Notes Papers. Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015,, online

Look for MediaEval Letters to be continued in 2016.