Friday, February 27, 2015

The Dress: View from the perspective of a multimedia researcher

For those of us who are active as researchers in the area of multimedia content analysis, the fury over the color of the "The Dress" drives to the heart of our scientific interests. Multimedia content analysis is the science of automatically assigning tags and description to images and videos based on techniques from signal processing, pattern recognition,  and machine learning. For years, the dominant assumption has been that the important things that people see in images on the Internet can be characterized by unambiguous descriptions, and that technology should therefore attempt to also predict unambiguous labels for images and videos.

This assumption is convenient if you are automatically predicting a description for an image, because your technology only needs to generate a single single description. Once you have predicted your description, you only need to ask one person whether or not your prediction is correct.

However, convenient is not always useful to users. As multimedia analysis gets more and more sophisticated, it's not longer necessary to go with convenient, and we can start to try to automatically describe images in the way that people see them (i.e., both the "white and gold" and the "blue and black" camps that characterize "The Dress" debate).

We recently wrote a chapter in a book on computer vision entitled "Using Crowdsourcing to Capture Complexity in Human Interpretations of Multimedia Content" [1]. Wrapped up into a single sentence, our main point was: "It's complicated" and multimedia research must embrace that complexity head on. In the chapter, we plea for research on "multimedia descriptions involving a complex interpretation". Here's how we defined it:
Multimedia description involving a complex interpretation: A description of an image or a video that is acceptable given a particular point of view. The complex interpretation is often accompanied by an explanation of the point of view. It is possible to question the description by offering an alternative explanation. It does not make sense to reference a single, conventionally accepted external authority.
Some people look at an image and see one thing, some people look at an image and see another thing, and that is normal and ok. In the case of "The Dress", the debate quickly moved from what people saw when they looked at the image, to what they saw when looking at the actual dress of which the image was taken. Looking at the image, and looking at the object can give people two different impressions. That also is normal and ok.

If you accept the opinion about the color of "The Dress" as being decided by referencing the opinion about the actual real-world dress as an external authority, then you have indeed solved the problem. Wired, for example, does this The New York Times compares the image to other images of the real-world dress

However, from the perspective of multimedia research, the interesting point about "The Dress" is that it is not a debate about a real-world dress, but rather about an image of that dress. Not in every case in which we analyze the content of the photo, is it possible to go to find and inspect the real world object, most photos on the Internet are just photos in and of themselves, and we interpret them without direct knowledge or or connection to the real-world situation in which they were taken.

Our perceptual and cognitive lives as human beings are rich and interesting. We stretch ourselves, grow in our intellectual and emotional capacity, when we discover that not everyone see this from the same point of view. The lesson of "The Dress" for multimedia research is that we should embrace the ambiguity of images.

If we don't see how important ambiguity is to our relationship to the world and each other, we endanger the richness in our lives. Specifically, the danger is that new image search technologies, such as that used by Google, will start providing us with a single unique answer. The fury over "The Dress" illustrates that faithfulness to how people interpret images requires that there are two answers.

[1] Larson, M., Melenhorst, M., Men̩ndez, M. and Peng Xu. Using Crowdsourcing to Capture Complexity in Human Interpretations of Multimedia Content. In: Ionescu, B. et al. Fusion in Computer Vision РUnderstanding Complex Visual Content, Springer, pp. 229-269, 2014.