Back in June, I gave a talk at the Communication Science Department here at Radboud University Nijmegen. Today, I presented a version of that talk to my colleagues in the Language and Speech Technology Research Meeting. The abstract is below together with the slides, which are on SlideShare. During the discussion it became clear that many problems in natural language processing and information retrieval face the issue of human interpretations. It is important to find ways to move forward, although it may not be possible to pack our challenges into neat classification or ranking problems with a single set of consensus ground truth labels. A way forward, is to look to other disciplines for theory of how people understand and use media, and let these inform what we design our systems to do and the ways that we measure success.
Within computer science, "Multimedia" is a field of research that investigates how computers can support people in communication, information finding, and knowledge/opinion building. Multimedia content is defined broadly. It includes not only video, but also images accompanied by text and other information (for example, a geo-location). It can be professionally produced, or generated by users for online sharing. Computer scientists historically have a “love-hate” relationship with multimedia. They “love” it because of the richness of the data sources and the wealth of available data, which leads to interesting problems to tackle with machine learning. They “hate” it because multimedia is a diffuse and moving target: the interpretation of multimedia differs from person to person, and changes over time in the course of its use as a communication medium. This talk gives a view onto ongoing research in the area of multimedia information retrieval algorithms, which help people find multimedia. We look at a series of topics that reveal how pattern recognition, text processing, and crowdsourcing tools are used in multimedia research, and discuss both their limitations and their potential.
Showing posts with label image interpretation. Show all posts
Showing posts with label image interpretation. Show all posts
Wednesday, February 8, 2017
Friday, February 27, 2015
The Dress: View from the perspective of a multimedia researcher
For those of us who are active as researchers in the area of multimedia content analysis, the fury over the color of the "The Dress" drives to the heart of our scientific interests. Multimedia content analysis is the science of automatically assigning tags and description to images and videos based on techniques from signal processing, pattern recognition, and machine learning. For years, the dominant assumption has been that the important things that people see in images on the Internet can be characterized by unambiguous descriptions, and that technology should therefore attempt to also predict unambiguous labels for images and videos.
This assumption is convenient if you are automatically predicting a description for an image, because your technology only needs to generate a single single description. Once you have predicted your description, you only need to ask one person whether or not your prediction is correct.
However, convenient is not always useful to users. As multimedia analysis gets more and more sophisticated, it's not longer necessary to go with convenient, and we can start to try to automatically describe images in the way that people see them (i.e., both the "white and gold" and the "blue and black" camps that characterize "The Dress" debate).
We recently wrote a chapter in a book on computer vision entitled "Using Crowdsourcing to Capture Complexity in Human Interpretations of Multimedia Content" [1]. Wrapped up into a single sentence, our main point was: "It's complicated" and multimedia research must embrace that complexity head on. In the chapter, we plea for research on "multimedia descriptions involving a complex interpretation". Here's how we defined it:
If you accept the opinion about the color of "The Dress" as being decided by referencing the opinion about the actual real-world dress as an external authority, then you have indeed solved the problem. Wired, for example, does this http://www.wired.com/2015/02/science-one-agrees-color-dress The New York Times compares the image to other images of the real-world dress http://www.nytimes.com/2015/02/28/business/a-simple-question-about-a-dress-and-the-world-weighs-in.html
However, from the perspective of multimedia research, the interesting point about "The Dress" is that it is not a debate about a real-world dress, but rather about an image of that dress. Not in every case in which we analyze the content of the photo, is it possible to go to find and inspect the real world object, most photos on the Internet are just photos in and of themselves, and we interpret them without direct knowledge or or connection to the real-world situation in which they were taken.
Our perceptual and cognitive lives as human beings are rich and interesting. We stretch ourselves, grow in our intellectual and emotional capacity, when we discover that not everyone see this from the same point of view. The lesson of "The Dress" for multimedia research is that we should embrace the ambiguity of images.
If we don't see how important ambiguity is to our relationship to the world and each other, we endanger the richness in our lives. Specifically, the danger is that new image search technologies, such as that used by Google, will start providing us with a single unique answer. The fury over "The Dress" illustrates that faithfulness to how people interpret images requires that there are two answers.
[1] Larson, M., Melenhorst, M., Menéndez, M. and Peng Xu. Using Crowdsourcing to Capture Complexity in Human Interpretations of Multimedia Content. In: Ionescu, B. et al. Fusion in Computer Vision – Understanding Complex Visual Content, Springer, pp. 229-269, 2014.
This assumption is convenient if you are automatically predicting a description for an image, because your technology only needs to generate a single single description. Once you have predicted your description, you only need to ask one person whether or not your prediction is correct.
However, convenient is not always useful to users. As multimedia analysis gets more and more sophisticated, it's not longer necessary to go with convenient, and we can start to try to automatically describe images in the way that people see them (i.e., both the "white and gold" and the "blue and black" camps that characterize "The Dress" debate).
We recently wrote a chapter in a book on computer vision entitled "Using Crowdsourcing to Capture Complexity in Human Interpretations of Multimedia Content" [1]. Wrapped up into a single sentence, our main point was: "It's complicated" and multimedia research must embrace that complexity head on. In the chapter, we plea for research on "multimedia descriptions involving a complex interpretation". Here's how we defined it:
Multimedia description involving a complex interpretation: A description of an image or a video that is acceptable given a particular point of view. The complex interpretation is often accompanied by an explanation of the point of view. It is possible to question the description by offering an alternative explanation. It does not make sense to reference a single, conventionally accepted external authority.Some people look at an image and see one thing, some people look at an image and see another thing, and that is normal and ok. In the case of "The Dress", the debate quickly moved from what people saw when they looked at the image, to what they saw when looking at the actual dress of which the image was taken. Looking at the image, and looking at the object can give people two different impressions. That also is normal and ok.
If you accept the opinion about the color of "The Dress" as being decided by referencing the opinion about the actual real-world dress as an external authority, then you have indeed solved the problem. Wired, for example, does this http://www.wired.com/2015/02/science-one-agrees-color-dress The New York Times compares the image to other images of the real-world dress http://www.nytimes.com/2015/02/28/business/a-simple-question-about-a-dress-and-the-world-weighs-in.html
However, from the perspective of multimedia research, the interesting point about "The Dress" is that it is not a debate about a real-world dress, but rather about an image of that dress. Not in every case in which we analyze the content of the photo, is it possible to go to find and inspect the real world object, most photos on the Internet are just photos in and of themselves, and we interpret them without direct knowledge or or connection to the real-world situation in which they were taken.
Our perceptual and cognitive lives as human beings are rich and interesting. We stretch ourselves, grow in our intellectual and emotional capacity, when we discover that not everyone see this from the same point of view. The lesson of "The Dress" for multimedia research is that we should embrace the ambiguity of images.
If we don't see how important ambiguity is to our relationship to the world and each other, we endanger the richness in our lives. Specifically, the danger is that new image search technologies, such as that used by Google, will start providing us with a single unique answer. The fury over "The Dress" illustrates that faithfulness to how people interpret images requires that there are two answers.
[1] Larson, M., Melenhorst, M., Menéndez, M. and Peng Xu. Using Crowdsourcing to Capture Complexity in Human Interpretations of Multimedia Content. In: Ionescu, B. et al. Fusion in Computer Vision – Understanding Complex Visual Content, Springer, pp. 229-269, 2014.
Subscribe to:
Posts (Atom)