Reflections on visual concepts in images (Halloween I)
LSCOM stands for "Large Scale Concept Ontology for Multimedia" and it is a list of concepts associated with multimedia, including images and videos. If you are to ask me where I stand with the LSCOM concept list, I am a 2753-Solid_Tangible_Thing kind of a multimedia researcher and not a 125-Airplane_Flying kind of a multimedia researcher.
Basically, what I mean is that I adhere to the perspective that in order to solve the general problem of multimedia information retrieval on the Web, we should make use of basic properties of objects depicted in images and video, rather than their specific identities. I have discussed the issue previously in a post on proto-semantics, dimensions of meaning that arise from human perceptions and interactions with the world. Proto-semantic dimensions are more fundamental than the words that we usually use to describe the world around us, and for that reason, they can be considered to be sub-lexical. For example, I am drinking coffee from a mug, but more fundamentally this is a small, corporeal object, or if we pick something from LSCOM 1425-Concave_Tangible_Object. I return to the issue here, since I've been pondering it again on the occasion of Halloween.
It seems that the way that scientists approach the problem of visual indexing, i.e., automatically describing the visual content of images and videos, is always inextricably related to their backgrounds. I've worked in the area of multimedia retrieval for going on 12 years now, and it my experience two main backgrounds dominate the field: surveillance and cultural heritage. Let me say a few words about both.
Surveillance: The analysis of surveillance footage or images captured by security cameria is aimed at the task of automatically identifying threat levels. For surveillance tasks, one defines a closed set of objects and behaviors that constitute "business as usual" and anything outside of that range can be considered a threat and triggers and alarm calling for the intervention of human intelligence. Surveillance is a high recall task -- meaning that it is more important not to miss any events than to reduce the detection rate of false alarms. This background doesn't quite transfer to the general problem of multimedia retrieval on the Web.
We can't assume that Web multimedia will depict a closed class of objects. The cases that cannot be covered by a closed class are not infrequently occurring "threats", but rather entities drawn from the long tail: which, if we can indeed assume a finite inventory, will contain approximately half of the encountered entities. Further, Web multimedia retrieval is typically a precision oriented problem, which means that reducing false alarms is relatively more important than exhaustive detection.
Cultural heritage: Iconographic classification of visual art involves a classification system such as Iconclass. The stated purpose of Iconclass is the description and retrieval of subjects represented in image. I rather suspect that before the very first paint had dried on the very first canvas, next to the artist was standing an art historian who started to create a classification system to categorize the painting. In other words, using classification systems for visual art is an old idea, that has well-established conventions and has been honed over generations of use. Such a classification system necessarily views works of art as physical objects, and would have as it's goal the task of organizing the storage facility of a museum or of helping to choose which works to hang together in an exhibition. The people who created it assumed that the number of dimensions of similarity between works of art was necessarily finite. Such an assumption makes sense, in light of a relatively small number of art historians working on a relatively small number of questions concerning art history and the iconography of art.
Enter, however, the Web. Images and video are not physical objects and we do not have to be able to list them all in a well ordered list or even every make the decision of "Do we hang this in the East Wing gallery or the West Wing gallery?" There are many more users than art historians, and suddenly it actually be useful to admit the possibility that the number of ways to compare two images might in fact be infinite, rather than finite.
As for myself, I neither fall into the surveillance or the cultural heritage category. I attribute this to what's probably a naive equation of surveillance with totalitarian states and also to having the yearly experience in grade school of being packed on a bus and shipped off for a day at the Art Institute of Chicago.
I guess the Art Institute of Chicago was supposed to have broadened the horizons of our young minds, but instead it sort of warped me in a way that makes it difficult to talk to me, if you are an cultural heritage person or an art historian. I was young enough that everything I drew sort of came out flattish, whether I intended it to look two-dimensional or not, when I was suddenly confronted with the likes of Marc Rothko. I think what happened is that someone in Chicago told me that Marc Rothko described his work as an “elimination of all obstacles between the painter and the idea, between the idea and the observer” (as quoted on this AIC webpage describing the Rothko painting above). At the time, I didn't particularly like Rothko, but the experience permanently hardened my mind to the idea that it made any sense whatsoever to describe visual art in terms of its depicted subject.
I think that Marc Rothko must fit into iconclass categorization "0 Abstract, Non-representational Art: 22C4 colours, pigments, and paint", which is unsatisfactory to me because it makes him seem like an afterthought. In Chicago, they apparently forgot to mention that he was reacting to what came before him. For me, I was already broken. A system that put Rothko on the outside rather than at its core could never been acceptable to me. From then until always: the main point of art is what we do with it: how we talk about it, how we stand before it and mull in the museum, which prints we buy in the shop and go home and hang on our walls and (as little as we like to admit it) how much we pay for it. A priori we don't know what draws us to art, so why should we make little lists of entities corresponding to its subjects?
The perspective I take may not ultimately prove more productive than either the surveillance perspective or the cultural heritage perspective. It is the linguistics perspective. My view is the following: the elements of meaning arising from human perception and interaction with the world that have been encoded into language human language semantics, these are the elements that we should try to dig out of videos and images. They are the lowest common denominator of meaning that we can be sure will give us the ability to cover all human queries: the ones that we can anticipate and the ones that we cannot.
So should the image above be given the LSCOM category 2753-Solid_Tangible_Thing ? Sure. It's an image of a painting. That's a tangible object. But let's also let the image be found by shape and color. And be found how I found it on the Internet: with the query "Rothko". And let it also be found when we search for formative experiences. And for Chicago...
I divide my time between Radboud University Nijmegen and Delft University of Technology in the Netherlands. My research focuses on multimedia retrieval techniques that exploit speech and language and focus on human interpretations of meaning. I am particularly interested in internet video, in networked communities, and crowdsourcing techniques. Lately, I've been noticing how difficult it is to imagine life without search.