Monday, December 21, 2015

Features, machine learning, and explanations

Selected Topics in Multimedia Computing is a master-level seminar taught at the TU Delft. This post is my answer to the question from one of this year's students, asked in the context of our discussion of the survey paper that he wrote for the seminar on emotion recognition for music. I wrote an extended answer, since questions related to this one come up often, and my answer might be helpful for other students as well. 

Here is the text of the question from the student's email:


I also have a remark about something you said yesterday during our Skype meeting. You said that the premise of machine learning is that we don't exactly know why the features we use give an accurate result (for some definition of accuracy :)), hence that sentence about features not making a lot of sense could be nuanced, but does that mean it is accepted in the machine learning field that we don't explain why our methods work? I think admitting that some of the methods that have been developed simply cannot be explained, or at least won't be for some time, would undermine the point of the survey. Also, in that case, it would seem we should at least have a good explanation for why we are unable to explain why the unexplained methods work so well (cf. Deutsch, The beginning of infinity, particularly ch. 3).

Your survey tackles the question of missing explanations. In light of your topic, yes, I do agree that it would undermine your argument to identify methods that cannot be explained as valuable in moving the field forward.

My comment deals specifically with this sentence: "For some of the features, it seems hard to imagine how they could have a significant influence on the classification, and the results achieved do not outrank those of other approaches by much."

I'll start off by saying that a great deal more important that this specific sentence is a larger point that you are making in the survey, namely that meaningful research requires paying attention to whether what you think that you are doing and what you are actually doing are all aligned. You point out the importance of the "horse" metaphor of:

Sturm, B.L., "A Simple Method to Determine if a Music Information Retrieval System is a “Horse”," in Multimedia, IEEE Transactions on , vol.16, no.6, pp.1636-1644, Oct. 2014.

I couldn't agree more on that.

But here, let's think about the specific sentence above. My point is that it would help the reader to express what you are thinking here more fully. If you put two thoughts into one sentence like this, the reader will jump to the conclusion that one explains the other. You want to avoid assuming (or implying that you assume) that the disappointing results could have been anticipated by choosing features that a priori could be "imagined" to have significant influence.

(Note that there are interpretations of this sentence — i.e., if you read "and" as a logical conjunction — that do not imply this. As computer scientists, we are used to reading code, and these interpretations are, I have the impression, relatively more natural to us than to other audiences. So it is safer to assume that in general your readers will not spend a lot of time picking the correct interpretation of "and", and need more help from your side as the author :-))

As a recap: I said that Machine Learning wouldn't be so useful if humans could look at a problem, and tell which features should be used in order to yield the best performance.  I don't want to go as far as claiming it is the premise, in fact, I rather hope I didn't actually use the word "premise" at all.

Machine learning starts with a feature engineering step where you apply an algorithm to select features that will be used, e.g., by your classifier. After this step, you can "see" which features were selected. So it's not the case that you have no idea why machine learning works.

My point is that you need to be careful about limiting your input to feature selection a priori. If you assume you yourself can predict which features will work, you will miss something. When you use deep learning, you don't necessarily do feature selection, but you do having the possibility of inspecting the hidden layers of the neural network, and these can shed some light on why it works.

This is not to say that human intelligence should not be leveraged for feature engineering. Practically, you do need to make design decisions to limit the possible number of choices that you are considering. Well-motivated choices will land you with a system that is probably also "better", also along the line of Deutsch's thinking (I say "probably" because I have not read the book the you cite in detail.)

In any case: careful choices of features are necessary to prevent you from developing a classifier that works well on the data set that you are working on because there is an unknown "leak" between a feature and the ground truth, i.e., for some reason one (or more) of the features is correlated with the ground truth. If you have such a leak, you methods will not generalize, i.e., their performance will not transfer to the case of unseen data. The established method for preventing this problem is ensuring that you carry out your feature engineering step on separate data (i.e., separate from both your training and your test sets). A more radical approach that can help when you are operationalizing a system is to discard features that work "suspiciously" well. A good dose of common sense is very helpful, but note that you should not try to replace good methodology and feature engineering with human intelligence (which I mention for completeness, and not because I think you had any intention in this direction).

It is worth pointing out there are plenty of problems out there that can indeed be successfully addressed by a classifier that is based on rules that were defined by humans, "by hand". If you are trying to solve such a problem, your shouldn't opt for an approach that would require more data or computational resources, merely for the sake of using a completely data-driven algorithm. The issue is that it is not necessarily easy to know whether or not you are trying to solve such a problem.

In sum, I am trying to highlight a couple of points that we tend to forget sometimes when we are using machine learning techniques: You should resist the temptation to look at a problem and declare it unsolvable because you can't "see" any features that seem like they would work. A second related temptation that you should resist is using sub-optimal features because you make your own assumptions about what the best features must be a priori.

A few further words on my own perspective:

There are algorithms that are explicitly designed to make predictions and create human-interpretable explanations simultaneously. This is a very important goal for intelligent systems that are being used by users who don't have the technical training to understand what is going on "under the hood."

Personally, I hold the rather radical position that we should aspire to creating algorithms that are effective, but yet so simple that they can be understood by anyone who uses their output. The classic online shopping recommender "People who bought this item also bought ...." is an example that hopefully convinces you such a goal is not completely impossible. A major hindrance is that we may need to sacrifice some of the fun and satisfaction we derive from cool math.

Stepping back yet further:

Underlying the entire problem is the danger is that the data you use to train you learner has properties that you have overlooked or did not anticipate, and that your resulting algorithm gives you, well, let me just come out and call it "B.S." without your realizing it. High profile cases get caught (http://edition.cnn.com/2015/07/02/tech/google-image-recognition-gorillas-tag/). However, these cases should also prompt us to ask the question: Which machine learning flaws go completely unnoticed?

Before you start even thinking about features, you need to explain the problem that you are trying to address, and also explain how and why you chose the data that you will use to address it.

Luckily, it's these sorts of challenges that make our field so fascinating.

Friday, October 30, 2015

Compressing a complicated situation: Making videos that support search

Mohammad Soleymani asked me to make a video for ASM 2015, the Workshop on Affect and Sentiment in Multimedia, held at ACM Multimedia 2015 on 30 October in Brisbane, Australia. He wanted to present during the workshop the view of different disciplines outside of computer science on sentiment. Since my PhD is in linguistics (semantics/syntax interface), I was the natural choice for "The Linguist". The results of my efforts was a video entitled "A Linguist Talks about Sentiment".

In this post, I briefly discuss my thoughts upon making this video. I wanted to give viewers with no linguistics background the confidence they needed in order to attempt to understand semantics, as it is studied by linguists, and leverage these insights in their work. Ultimately, such a "support for search" video has the goal of addressing people starting with the background that they already have, and giving them just enough knowledge in order to support them in searching for more information themselves.


Mohammad gave me four minutes of time for the video, and I pushed it beyond the limit: the final video runs over six minutes. I realized that what I needed to do is not to convey all possible relevant information, but rather show where a complicated situation is hiding behind something that might seem at first glance simple. The effect of the video is to convince the viewer that its worth searching for more information on linguistics, and giving just enough terminology to support that search.

My original motivation to make the video, was a strong position that I hold: At one level, I agree that anyone can study anything that they want. However, without a precisely formulated, well-grounded definition of what we are studying we are in danger of leaving vast areas of the problem unexplored, and prematurely declare the challenge a solved problem.

After making this video, I realized that one can consider "support for search" videos a new genre of videos. These videos allow people get their foot in the door, and provide the basis for search. A good support-for-search video needs to address specific viewer segments "where they are", i.e., given their current background knowledge. It must simplify the material without putting people off on completely the wrong track. Finally, it must admit this simplification, so that viewers realize that there is more to the story.

When I re-watched my video after the workshop, I found a couple places that make me cringe. Would a fellow linguist accept these simplifications, or did they distort too much? I make no mention of the large range of different psychological verbs, or the difference between syntactic and semantic roles. I put a lot of emphasis on the places that I myself have noticed that people fail to understand. On the whole, the goal is that the video allows the viewer to go from a state of being overwhelmed by a complex topic to having those handholds necessary in order to formulate information needs and support search.

Are "support for search" videos a new genre on the rise? If this "genre" is indeed a trend, it is a welcome one.

Recently browsing, I hit on a video entitled "Iraq Explained -- ISIS, Syria and War":


This video assumes of the reader a particular background knowledge (little) and focuses on introducing the major players and a broad sketch of the dynamics. There are points in the video where I find myself saying "Well, OK" (choice of music gives the impression of things happening with a sense of purpose that I do not remember at the time). This video is clearly a "support for search" video since it ends with the following information:

"We did our best to compress a complicated situation in a very short video, so please don’t be mad that we had to simplify things. There is sooo much stuff that we couldn’t talk about in the detail it deserved…But if you didn’t know much about the situation this video hopefully helps making researching easier."

On the whole, the video succeeds in the goal of giving viewers the information that they need to start sorting out a complicated situation. It gives them the picture on what they don't yet understand, in order that they can the start looking for more information.

Tuesday, September 22, 2015

CrowdRec 2015 Workshop Panel Discussion: Crowdsourcing in Recommender Systems

The CrowdRec 2015 Workshop on Crowdsourcing and Human Computation for Recommender Systems was held this past Saturday at ACM RecSys 2015 in Vienna, Austria. The workshop ended with a fish bowl panel on the topic of the biggest challenge facing the successful use of crowdsourcing for recommender systems. I asked the panelist to take a position as to the nature of this challenge, was it related to algorithms, engineering or ethics. During the panel the audience took crowdsourced notes about the panel on titanpad.

After the workshop I received two comments that particularly stuck in my mind. One was that I should have told people that if they contributed to the titanpad notes, I would write a blogpost summarizing them. I was happy that at least someone thought that a blog post would be a good idea. (I hadn't considered that having ones thoughts summarized in my blog would be a motivational factor.) The other comment was that the panel question was not controversial enough to spark a good discussion.

In response to these two comments here now a summary/interpretation of what was recorded in the crowdsourced notes about the panel.
  • The biggest challenge of crowdsourcing is to design a product in which crowdsourcing adds value for the user. Crowdsourcing should not be pursued unless it makes a clear contribution.
  • The way that the crowd uses a crowdsourcing platform, or a system that integrates crowdsourcing is essential. Here, engagement of the crowd is key, so that they are "in tune" with the goals of the platform, and make a productive contribution.
  • The biggest challenge is the principle of KYC. Here, instead of Know Your Client, this is Know Your Crowd. There are many individual and cultural differences between crowdmembers that need to be taken into account.
  • The problem facing many systems is not the large amount of data, but that they data is unpredictably structured and in homogenous, making it difficult to ask the crowd to actually do something with it.
  • With human contributors in the crowd, people become afraid of collusion attacks that go against the original, or presumed. intent of a platform. A huge space for discussion (which was not pursued during the panel) opens about who has the right to decide what the "right" and "wrong" way to use a platform.
  • Crowdwork can be considered people paying with their time: We need to carefully think about what they receive in return.
  • With the exception of this last comment, it seemed that most people on the panel found it difficult to say something meaningful about ethics in the short time that was available for the discussion.
In general, we noticed that there are still multiple definitions of crowdsourcing at play in the community. In the introduction to the workshop, I pointed out that we are most interested in definitions of crowdsourcing where crowdwork occurs in response to a task that was explicitly formulated. In other words, collecting data that was create for another purpose rather than in response to a taskasker is not crowdsourcing in the sense of CrowdRec. It's not uninteresting to consider recommender systems that leverage user comments collected from the web. However, we feel that such systems fall under the heading of "social" rather than "crowd", and reserve a special space for "crowd" recommender systems, which involve active elicitation of information. It seems that it is difficult to be productively controversial, if we need to delineate the topic at the start of every conversation.

At this point, it seems that we are seeing more recommender systems that involve taggers and curators. Amazon Mechanical Turk, of course, came into being as an in-house system to improve product recommendations, cf. Wikipedia. However, it seems that recommender systems that actively leverage the input of the crowd still need to come into their own.

See also:

Martha Larson, Domonkos Tikk, and Roberto Turrin. 2015. Overview of ACM RecSys CrowdRec 2015 Workshop: Crowdsourcing and Human Computation for Recommender Systems. In Proceedings of the 9th ACM Conference on Recommender Systems (RecSys '15). ACM, New York, NY, USA, 341-342.

Saturday, August 22, 2015

Choosing movies to watch on an airplane? Compensate for context

For some people, an airplane is the perfect place to catch up on their movies-to-watch list. For these people there is no difference between sitting on a plane and sitting on the couch in their living room.

If you are one of these people, you are lucky.

If not, then you make want to take a few moments to think about what kind of a movie you should be watching on an airplane.

These are our two main insights on how you should make this choice:
  • Watch a movie that uses a lot of closeups (or relatively little visual detail), is well lit, and moves relatively slowly so that you can enjoy it on a small screen.
  • Watch something that is going to engage you. Remember that the environment and the disruptions on an airplane might affect your ability to focus, and, in this way disrupt your ability to experience an empathetic relationship with the characters. In other words, unless the plot and characters really draw you in movie might not "work" in the way it is intended.
For an accessible introduction to how movies manipulate your brain see the Wired series on Cinema Science article:
http://www.wired.com/2014/08/how-movies-manipulate-your-brain
At the perceptual level your brain needs to be able exercise its ability to "stitch things together to make sense". It's plausible that this "stitching" also has to be able to take place at an emotional level. Certain kinds of distractors can be expected to simply get in the way of that happening as effectively as it is meant to.

When we began to study what kinds of movies that people watch on planes we used these two insights as a point of departure. We started with these insights after having made some informal observations about the nature of distractors on an airplane, which are illustrated by this video.



At the end of the video, we formulate the following initial list of distractors, which impact what you might want to watch on an airplane.
  1. Engine noise
  2. Announcements
  3. Turbulence
  4. Small screen 
  5. Glare on screen
  6. Inflight service
  7. Fellow travelers
  8. Kid next to you
In short, when choosing a movies to view on the airplane, you should pick a movie that can "compensate for context", meaning that you can enjoy it despite the distractors inherent in the situation aboard an airplane.

We are looking to expand this list of distractors as our research moves forward.

Our ultimate goal (still a long way off) is to build a recommender system that can automatically "watch" movies for you ahead of time. The system would be able to suggest movies that you would like, but above and beyond that the suggested movies would be prescreened to be suitable for watching on an airplane. Such a system would help you to quickly decide what to turn on at 30,000 feet, without worrying that half way through you will realize that it might not have been a good choice.

Since we started this research, I have been paying more an more attention to the experiences that I have with movies on a plane. Here are two.

  • On a recent domestic flight: The woman next to me started to watch Penny Dreadful. She turned it off about ten minutes in. I then also tried to watch it. I really like the show, but it's meant to be dark, gory, and mysterious. These three qualities turned into poorly visible, disconcerting, and confusing at 10,000 feet. This is what I am trying to capture in the video above.
  • On a recent Transatlantic flight: The man sitting next to me turned on his monitor, and turned on The Color Purple as if it were on his watch list. It's probably on most people's movies-to-watch list, so this seems like a safe choice. However, I was trying to review a paper, and was subject to over two hours of unavoidable glimpses of violence on a screen a few feet from my own. It's the emotional impact of these scenes that make it a great movie. By the same token, you might not want to be watching it on a plane, especially if you are not going to experience the entire emotional arc. (The movie should be watched with full focus, from beginning to end.)

Until now, the work on context-aware movie recommender systems that I have encountered has recommended movies for situations that are part of what is considered to be "normal" daily life, e.g.,  watching movies during the week vs. on the weekend, watching movies with your kids vs. with your spouse. We need more recommender system work that will allow us to get to movies that are suitable for less ideal, more unpleasant, perhaps less frequent situations.  Why waste a good movie by watching it in the wrong context? And why suffer anymore than necessary while on an airplane?

The work is being carried out within the context of the MediaEval Benchmarking Initiative for Multimedia Evaluation, see Context of Experience 2015. It owes a lot to the CrowdRec project, which pushes us to understand how we can make recommenders better by asking people explicitly to contribute their input.

Wednesday, July 29, 2015

Google Scholar: Sexist or simply statistical?

This is the first of what I intend to be a short series of posts related to my experience with Google Scholar, and to a phenomenon that I call "algorithmic nameism", an unfortunate danger of big data. Here, I describe what appears to be happening, on Google Scholar, to references to certain journal papers that I have co-authored, and briefly discuss the reasons for which we should be concerned, not for me personally, but in general.


I am currently Assistant Professor in computer science at Delft University of Technology, in Delft, Netherlands, and at this moment also a visiting researcher at the International Computer Science Institute in Berkeley, California. Like many in my profession, I maintain a Google Scholar profile, and rely on the service as a way of communicating my publication activities to my peers, and also keeping up with important developments in my field. Recently, I clicked the "View and Apply Updates" link (pictured above) in my Google Scholar profile, to see if Google Scholar had picked up on several recent publications, and was quite surprised by what I discovered.

For those perhaps not familiar with Google Scholar, a word of background information. On the update page for your profile, Google Scholar supplies a list of suggested edits to the publication references in your profile. As profile owner, you can then choose to accept or discard them individually. 

In the list, I was very surprised to find the following suggested edit for a recent publication on which I am co-author:

In short, Google Scholar is suggesting that my first name "Martha" be changed to "Matt". 

It is not an isolated case. Currently, in my suggested edits list, there are suggestions to change my name to "Matt" for a total of four papers that I have co-authored, all in IEEE Transactions on Multimedia. 

Part of my specialization within the field of computer science is information retrieval. For this reason, I have insight into the probable reasons for which this might be happening, even without direct knowledge of the Google Scholar edit suggestion algorithm. The short story is that Google Scholar appears to be using a big data analytics algorithm to predict errors, and suggest edits. But it is clearly a case in which "big data" is "wrong data". I plan to delve into more detail in a future post.

Here, I would just like to state why we should be so concerned:

Suggested edits on the "View and Apply Updates" page find their way into the Google Scholar ranking function, and affect whether or not certain publications are found when people search Google Scholar. I have not, to my knowledge, ever clicked the "Edit article" link that would accept the suggestion to change my name to Matt in the reference to one of my publications. However, the Google Scholar search algorithm has apparently already integrated information about "Matt Larson".

Currently, if you go to Google Scholar and query "predicting failing queries matt larson", my paper comes up as a number the number one top-ranked result.


However, if you query "predicting failing queries martha larson", this paper can only be found in the sixth position on the second page (It is the bottom reference in this screenshot of the second page. I have put a red box around Page 2.)


Different people say different things about the importance of having a result on the first page of search results. However, you don't have to be literally of the first-page school (i.e., you don't have to believe "When looking something up on Google, if its not on the first page of search results then it doesn't exist and my journey ends there.") to imagine that my paper is going to be more readily found if my name were Matt. (For brevity's sake, I will just zip past the irony of discussing basically a failed query for a paper which is actually about failing queries.)

I myself search for people (and their publications) on Google scholar for a range of reasons. For example, last year I was appointed as an Associate Editor of IEEE Transactions on Multimedia. I search for people on Google Scholar in order to see if they have published the papers that would qualify them to review for the journal.

At the moment, my own IEEE Transactions papers seem to be being pushed down in the ranking because Google Scholar is confused about whether my name should actually be "Matt". In general, however, Google Scholar does a good job. I don't research lung cancer (second result Page 2 as shown above), but otherwise it can be seen from the results list above, that Google Scholar generally "knows" that I am me. My profile page does not have any of the shortcomings of the search results, that I am aware of.

I am someone with an established career, with tenure and relatively many publications. I have no problem to weather the Matt/Martha mixup.

However: Imagine someone who was at the fragile beginning of her career!

Having IEEE Transactions publications appearing low in her results (compared to her equally well-published colleagues) could make the difference between being invited to review or not. Or, goodness forbid, a potential employer is browsing her publications to determine whether she was qualified for a job, and misses key publications.

I'll conclude with what is without doubt and unexpected statement: It would be somehow positive if the Matt/Martha mix up were a case of sexism. If it were an "anti-woman" filter programmed into Google Scholar, the solution would be simple. The person/team responsible could be fired, and we could all get on with other things. However: With extremely high probability there are no explicitly "anti-woman" designs here. Although the example above looks for all the world like sexism, at its root it is most probably not. The odds are that the algorithm behind the suggestions to edit "Martha" to "Matt" has no knowledge of my gender whatsoever, and the discrimination is therefore not directly gender based.

The Matt/Martha mix up is actually more worrisome if it is not sexism. The more likely case is that this is a new kind of "ism" that has the danger of going totally under the radar. It is not related to specifically to gender, but rather to cases that are statistically unusual, given a particular data collection. It is the kind of big data mistake that can potentially disadvantage anyone if big data is applied in the wrong way.

Whether sexist or "simply statistical", we need to take it seriously.

An immediate action that we can take is realize that we should not trust Google Scholar blindly. 

Sunday, July 19, 2015

Teaching the First Steps in Data Science: Don't Simplify Out the Essentials

Teachers of Data Science are faced with the challenge of initiating students into a new way of thinking about the world. In my case, I teach multimedia analysis, which combines elements of speech and language technology, information retrieval and computer vision. Students of Data Science learn that data mining and analysis techniques can lead to knowledge and understanding that could not be gained from conventional observation, which is limited in its scope and ability to yield unanticipated insights.

When you stand in front of an audience who is being introduced to data science for the first time, it is very tempting to play the magician. You set up the expectations of what "should be" possible, and then blow them away with a cool algorithm that does the seemingly impossible. Your audience will go home and feel that the got a lot of bang for their buck---they have witnessed a rabbit being pulled from a hat.

However: will they be better data scientists as a result?

In fact, if you produce a rabbit from a hat, your audience has not been educated at all, they have been entertained. Worse case they have been un-educated, since the success of the rabbit trick involves misdirection of attention away from the essentials.

My position is that when teaching the first steps in data science, it is important not to simplify out the essentials. Here, two points are key:

First, students must learn to judge the worth of algorithms in terms of the real-world applications that they enable. With this I do not mean to say that all science must be applied science. Rather, the point is that data science does not exist in a vacuum. Instead, the data originally came from somewhere. It is connected to something that happened in the real-world. Ultimately, the analysis of the data scientist must be relevant to that "somewhere", be it a physical phenomenon or a group of people.

Second, students must learn the limitations of the algorithms. Understanding an algorithm means also understanding what it cannot be used for, where it necessarily breaks down.

At a magic show, it would be ridiculous if a magician announced that his magic trick is oriented towards the real-world application of creating a rabbit for rabbit soup. And no magician would display alternative hats from which no rabbit could possibly be pulled. And yet, as data science teachers, this is precisely what we need to do. It is essential that our students know exactly what an algorithm is attempting to accomplish, and the conditions that cause failure.

Yesterday, was the final day of the Multimedia Information Retrieval Workshop at CCRMA at Stanford, and Steve Tjoa gave a live demo of a simple music identification algorithm. It struck me as a great example of how to teach data science. As workshop participants we saw that they algorithm is tightly connected to reality (it was identifying excerpts of pieces that he played right there in the classroom on his violin), and his demo showed its limitations (it did not always work).

This exposition did not simplify out the essentials. Students experiencing such a live demo learn the algorithm, but they also learn how to use and how to extend it.

We were blown away not so much by the cool algorithm, but by the fact that we really grasped what was going on.

Experiences like this are solid first steps for data science students, and will lead to great places.


Postscript:

That evening, one of my colleagues asked me if I still wrote on my blog. No, I said, I had a bit of writer's block. I had been trying to write a post on Jeff Dean's keynote at ACM RecSys 2014, "Large Scale Machine Learning for Predictive Tasks", and failing miserably. The keynote troubled me, and I was attempting to formulate a post that could constructively explain why. Ten months past.

With the example of Steve's live demo it became clear why my main problem was with the keynote. It contained nothing that I could demonstrate was literally wrong. It was simply a huge missed opportunity. 

Since ACM RecSys is a recommender system conference, many people in the room were being thinking about natural language processing and computer vision problems for the first time. The keynote did not connect its algorithms to the source of the data and possible applications. Afterwards, the audience was none the wiser concerning the limitations of the algorithms it discussed.

I suppose some would try to convince me that when listening to a keynote (as opposed to a lecture) I need to stop being a teacher, and go into magic watching mode, meaning that I would, suspend my disbelief.  "That sort of makes sense, it looks pretty good" Dean said to wrap up his exposition of paragraph vectors of Wikipedia articles. 

https://www.youtube.com/watch?v=Zuwf6WXgffQ&feature=youtu.be&t=4m20s

If you watch at the deep link, you see that he would like to convince us that we should be happy because the algorithm has landed music articles far away from computer science. 

In the end, I can only hope that the YouTube video of the keynote is no one's first steps in data science.

Independently of a particular application, landing music far away from computer science is also just not my kind of magic trick.

Friday, February 27, 2015

The Dress: View from the perspective of a multimedia researcher

For those of us who are active as researchers in the area of multimedia content analysis, the fury over the color of the "The Dress" drives to the heart of our scientific interests. Multimedia content analysis is the science of automatically assigning tags and description to images and videos based on techniques from signal processing, pattern recognition,  and machine learning. For years, the dominant assumption has been that the important things that people see in images on the Internet can be characterized by unambiguous descriptions, and that technology should therefore attempt to also predict unambiguous labels for images and videos.

This assumption is convenient if you are automatically predicting a description for an image, because your technology only needs to generate a single single description. Once you have predicted your description, you only need to ask one person whether or not your prediction is correct.

However, convenient is not always useful to users. As multimedia analysis gets more and more sophisticated, it's not longer necessary to go with convenient, and we can start to try to automatically describe images in the way that people see them (i.e., both the "white and gold" and the "blue and black" camps that characterize "The Dress" debate).

We recently wrote a chapter in a book on computer vision entitled "Using Crowdsourcing to Capture Complexity in Human Interpretations of Multimedia Content" [1]. Wrapped up into a single sentence, our main point was: "It's complicated" and multimedia research must embrace that complexity head on. In the chapter, we plea for research on "multimedia descriptions involving a complex interpretation". Here's how we defined it:
Multimedia description involving a complex interpretation: A description of an image or a video that is acceptable given a particular point of view. The complex interpretation is often accompanied by an explanation of the point of view. It is possible to question the description by offering an alternative explanation. It does not make sense to reference a single, conventionally accepted external authority.
Some people look at an image and see one thing, some people look at an image and see another thing, and that is normal and ok. In the case of "The Dress", the debate quickly moved from what people saw when they looked at the image, to what they saw when looking at the actual dress of which the image was taken. Looking at the image, and looking at the object can give people two different impressions. That also is normal and ok.

If you accept the opinion about the color of "The Dress" as being decided by referencing the opinion about the actual real-world dress as an external authority, then you have indeed solved the problem. Wired, for example, does this http://www.wired.com/2015/02/science-one-agrees-color-dress The New York Times compares the image to other images of the real-world dress http://www.nytimes.com/2015/02/28/business/a-simple-question-about-a-dress-and-the-world-weighs-in.html

However, from the perspective of multimedia research, the interesting point about "The Dress" is that it is not a debate about a real-world dress, but rather about an image of that dress. Not in every case in which we analyze the content of the photo, is it possible to go to find and inspect the real world object, most photos on the Internet are just photos in and of themselves, and we interpret them without direct knowledge or or connection to the real-world situation in which they were taken.

Our perceptual and cognitive lives as human beings are rich and interesting. We stretch ourselves, grow in our intellectual and emotional capacity, when we discover that not everyone see this from the same point of view. The lesson of "The Dress" for multimedia research is that we should embrace the ambiguity of images.

If we don't see how important ambiguity is to our relationship to the world and each other, we endanger the richness in our lives. Specifically, the danger is that new image search technologies, such as that used by Google, will start providing us with a single unique answer. The fury over "The Dress" illustrates that faithfulness to how people interpret images requires that there are two answers.

[1] Larson, M., Melenhorst, M., Menéndez, M. and Peng Xu. Using Crowdsourcing to Capture Complexity in Human Interpretations of Multimedia Content. In: Ionescu, B. et al. Fusion in Computer Vision – Understanding Complex Visual Content, Springer, pp. 229-269, 2014.

Saturday, January 24, 2015

The making of a community survey: Contrastive conditions and critical mass in benchmarking evaluation

Each year, the MediaEval Multimedia Benchmark offers a set of challenges to the research community involving interesting new problems in multimedia. Each challenge is a task consisting of a problem description, a data set, and an evaluation metric.

The tasks are each organized independently, each by a separate group of task organizers. Each task focuses on developing solutions to very different problems. However, they are held together by the common theme of MediaEval: social and human aspects of multimedia. A task has a human aspect if it considers modeling the variation in people’s interpretations of multimedia content, including dependencies on context and intent, is not considered variability that must be controlled, but rather part of the underlying problem to be solved. A task has a social aspect if the task develops technology that supports people in developing and communicating knowledge and understanding using multimedia content.

In addition to the human and social aspects, MediaEval tasks are united by the common goal of moving forward the state of the art in multimedia research. To this end, they strive to achieve both qualitative and quantitive insight into the algorithms that are designed by participating teams to address the challenges. We can call qualitative insight "what works" and quantitative insight "how well it works".

How well an algorithm works must necessarily be measured against something. Most obviously, an algorithm works well if the people who actually have the problem that lies at the root of the task agree that the algorithm solves the problem. These people are referred the "problem holders" or "stakeholders", they are usually a company, or, very often, a set of end users of the multimedia technology. In evaluation campaigns such as MediaEval, the formulation of the problem is represented by the data set and the problem definition. Their opinion of what constitutes a solution is represented by the ground truth (i.e., the reference labels for the data set) and the evaluation metric.
In a living labs set up for algorithm evaluation, both the data set and the ground truth are streams, and move closer to actually instantiating the problem rather than representing the problem. However, we are always directly at understanding whether one algorithm can indeed be considered to give better performance than another, i.e., the state of the art.

In order to be fairly and meaningfully compared, two algorithms must represent "contrastive conditions". This means that there is one, constrained respect in which they differ from each other. If there are two or more major differences between two algorithms, then its unclear of why one performs better than the other. In real life, we might not care why, and simply choose the better performing algorithm. However, if we take the time to investigate contrastive conditions, then we can isolate "what works" from "what doesn't work" and ultimately answer questions like "Have I gotten close to the ceiling of the best possible performance that can be achieved on this challenge?", and "Which types of solutions are just not worth pursuing further?". Such questions also have a key contribution to make for algorithms used in operational settings.

Each year, MediaEval publishes a survey with a long list of questions to be answered by the community. The MediaEval survey is key in ensuring that the work of the teams participating in the challenges gives rise to contrastive conditions.
  • The benchmark organizers can determine whether or not there is a minimum number of people in the research community interested in the task, who would like to participate.
  • The task organizers can make contact with "core participants", teams that declare their intention to participate in the task, including submitting runs and writing the working notes paper, "no matter what". Core teams allow us to ensure that there is a critical mass for any given task, and a higher chance of contrastive conditions.
  • The task organizers can determined which "required runs" that people might be interested in, and adapt the design of the task accordingly. A "required run" is an algorithm that uses certain sources of data, but that might differ in its underlying mechanisms. By deciding on required runs, the community also decides on which aspects of the task it is important to be able to investigate contrastive conditions.


The MediaEval survey is notoriously difficult to prepare. Each year, a large number of different tasks are proposed, and each task has its own particular questions. 

The descriptions of the task are quite challenging to write. MediaEval tasks are planned with a low entry threshold. This means that new groups are able to step into a task, and very easily come up to speed. In other words, the newbie teams participating in MediaEval have a fair chance with respect to teams that have participated in past years. The task descriptions must include technical depth necessary to elicit detailed information from potential participants, but they cannot be formulated in task-specific "jargon" or shorthand that MediaEval participants use among themselves.

Also, the survey must be set up in a way that people can quickly answer a great number of questions for all tasks. Although in the end teams participate in only one, or perhaps two, tasks, the design of the tasks is made better if people with a general interest in, and knowledge of, multimedia research can give their opinion and feedback on as many tasks as possible.

The MediaEval 2015 survey is about to appear. At the moment, we are at 121 questions and counting. It would take a lot less time just to make a top-down decision on which tasks to run, and how to design these tasks. However, over the years we have learned how critical the survey is: the survey input allows MediaEval tasks each year to maximize the amount of insight gained per effort invested. 

We very much appreciate everyone who participates in the survey, and helps to build a highly effective benchmark, and a productive benchmarking community.

Thursday, January 1, 2015

Pick-me-up pixels: Reflections on "happy" in the new year


Yes, there's the holiday season, but a lot needs to happen during that time to make sure that 2015 goes smoothly. 

Late at night, recently, I was grinding my teeth about late reviews. I was worried about me being late in reviewing for other people, and other people being late reviewing for me. In general, I was feeling like we were all getting behind before we had even started the new year.

A colleague who knew I was fretting sent me a encouraging email with some beautiful snow pictures.  My favorite one is this one. 

The moment I clicked open the .jpgs of the pictures attached in the email, magical scenes from far away shifted my mind into a state of wonder, and then joy. 

I noticed that if I stop to reflect on how it feels, the effect of an image, a bunch of numbers representing five million pixels, is physically tangible. The experience of looking at a picture like this one delivers the same pick-me-up as a cold lemonade in the hot summer, a stunning cityscape lit up at night, the sound of waves washing over rocks, or a purring cat in my lap on a long evening. 

Goodness knows I have spent enough time reading, writing, and reviewing papers about the affective impact of multimedia, and how it can be predicted by crunching pixels. But now, looking at the photos that my colleague sent, it struck me how real that impact is. As multimedia researchers, we may not be medical doctors, but we do have the responsibility of developing technologies with the power to make people feel better.

It also hit home, that the impact goes beyond the pixels. A good part of the effect is knowing that someone realized I was glum, and also the thought that ultimately I might to have a chance to visit the place where the picture was taken.

It's interesting that the picture came via email. The effect of social multimedia doesn't require a social networking platform. Given a camera, and a display device, people will exchange pictures. The existence of Facebook helps, but is not necessary...and by similar reasoning the practice of sharing pictures will survive social networks in the form that we know them today.

I imagine that the two people in this picture have also just taken a picture of the snowy trees in the lamplight and are pausing to examine it together on a mobile device. Their exchange of thoughts might lead them to discover that they are connected by their reactions to the beauty of the experience. 

Making images us together leads us to share thoughts about our ways of seeing things that we might be otherwise tempted to disregard as irrelevant or not worth further time. Whether we are moved by our similarities, or take delight in unexpected differences in our perspectives, it is a connection that might have been missed without the mediation of a moment of collaborative picture making.

Ideally, the impact of social pixels would be a positive one without exception. The couple in the picture has captured not only pixels, but also the memory of a moment, that they will be able to relive long after the snow has melted.

But we can't know for sure. The moment may be so precious, that it is overwhelming to look back on it. Emotional overload is clearly a danger in the case of heartbreak, but even if our couple is destined to live happily-ever-after, nostalgia can be a burden. 

If no single moment is overpowering, a mass of memories might still be unbearable. The image might represent one of so many moments, that reliving them each would be an exhausting and numbing experience. 

Ultimately, as users who produce and consume multimedia content, we need systems that allow us to save and find the right content, and also the right amount of content. 

We need to be open to the possibility that maybe these systems are not intelligent systems, but rather utterly simple-minded and transparent systems that just happen to be incredibly good at supporting us, as "human users", in saving and finding the right multimedia for each other.

My pick-me up moment caused by the snow images passes quickly, and the more usual train of thoughts clicks in again: 

I start wondering about what are those things in the middle of the path. Are they air ducts? Are they bee hives? How can I find out? Will I notice them if I get there? Are my photoshop skills good enough to get rid of them? Would this make a better picture? 

And then, I am struck by the thought that my mood was lifted by the pictures, but it would really be lifted if the people I am counting on would finish reviews! Which means I should also get back to mine.

The ability of multimedia to relax and revive our inner being is subtle and fleeting. Blink and you could miss it. We feel it, but we take it for granted, and our minds quickly move to other things. We forget its role in maintaining our inner balance, and our balance with the world and each other. Without this delicate equilibrium of our affective states, we would derail....produce no more papers, invent no more cool systems.

And so for 2015, I will continue to devote effort to understanding what people see in pictures, but I aspire to also remember the power that shared pixels have to lift our spirits.