Monday, October 31, 2011

Visual concepts and Wittgenstein's language games (Halloween II)

Wittgenstein conceives of human language as an activity consisting of language games, that are related, but different. One of these games is the game that we play when we read picture books to kids. We point at images and name them. The kids are then supposed to gradually acquire this pointing and naming behavior. We generally happily consider the children to be acquiring human language during these sessions. However, if we apply our Wittgenstein, what we are doing is teaching kids how to play the "naming game". We notice this because two minutes later the young child is furiously indicating that it doesn't want to do something, whereby the concept "no" is being actively used. The concept of "no" or "no, I don't want" (we recognize while delicately shoving small, flailing hands into sweater arms) is not depictable as a nameable entity in a picture book. We're still using language of some sort, but we've switched to another, possibly more important game.

As multimedia retrieval researchers we generally fall into the same trap when developing multimedia retrieval indexing systems. We get the systems to annotate depictable visual concepts and some how forget that this is only one "language game" in the whole gamut of different games that humans use when they use language. The point is an important one. Visual content based retrieval systems are in their infancy. We, as, well, a species, are currently negotiating a system of conventions, of game moves as it were, that determine how we interact with these systems.

The danger is: if we start out by making very narrow assumptions about what people could possibly be looking for when they look for images and video the conventions of interacting with video search engines will become calcified into a very simplistic game. We'll be stuck in the picture book phase of multimedia retrieval childhood forever.

Actually, this Halloween I encountered a picture book that suggests that even picture books are trying to pop out of the "naming game". This one has a page with a picture of kids making jack-o-lanterns and an orange box asking the questions: "How many organize pumpkins can you count?" and "How many are jack-o-lanterns?"

Well, ahem. When does something stop being a pumpkin and become a jack-o-lantern? When you cut of the top? When you've fully emptied the inside? When you cut the first eye or when you have popped out the final piece around the teeth to complete the grin?

How about those jack-o-lanterns that have been drawn on the chalk board? Are those jack-o-lanterns or are they pictures of jack-o-lanterns? And maybe actually a jack-o-lantern still count as a pumpkin if it was made from a pumpkin in the first place?

In short, it is impossible to give a unique answer to the questions that this book is asking. We can either think that the people at Fischer-Price are corrupting our youth, or we can realize: kids don't need to have books that depict things that are uniquely identifiable. There is simply a huge ambiguity as to what exactly is a pumpkin and what is a jack-o-lantern. We can extend the 'naming-game' with this ambiguity and it is still truly a part of our human language. We don't need to (and generally do not) resolve ambiguity in order to use language effectively. The page of this books is not some sort of obscure philosophical exception: this is a situation that is frequent and highly characteristic of the situations we deal with on a daily basis.

Fischer-Price apparently now thinks that kids' books should not longer protect them against ambiguity in language. We shouldn't "baby" our multimedia systems either: Rather we should let them play as large and complex a language game as they can possibly handle: as large as technically possible and as users find helpful and interesting.

The next post makes another related point about this picture book...

Reflections on visual concepts in images (Halloween I)



LSCOM stands for "Large Scale Concept Ontology for Multimedia" and it is a list of concepts associated with multimedia, including images and videos. If you are to ask me where I stand with the LSCOM concept list, I am a 2753-Solid_Tangible_Thing kind of a multimedia researcher and not a 125-Airplane_Flying kind of a multimedia researcher.

Basically, what I mean is that I adhere to the perspective that in order to solve the general problem of multimedia information retrieval on the Web, we should make use of basic properties of objects depicted in images and video, rather than their specific identities. I have discussed the issue previously in a post on proto-semantics, dimensions of meaning that arise from human perceptions and interactions with the world. Proto-semantic dimensions are more fundamental than the words that we usually use to describe the world around us, and for that reason, they can be considered to be sub-lexical. For example, I am drinking coffee from a mug, but more fundamentally this is a small, corporeal object, or if we pick something from LSCOM 1425-Concave_Tangible_Object. I return to the issue here, since I've been pondering it again on the occasion of Halloween.

It seems that the way that scientists approach the problem of visual indexing, i.e., automatically describing the visual content of images and videos, is always inextricably related to their backgrounds. I've worked in the area of multimedia retrieval for going on 12 years now, and it my experience two main backgrounds dominate the field: surveillance and cultural heritage. Let me say a few words about both.

Surveillance: The analysis of surveillance footage or images captured by security cameria is aimed at the task of automatically identifying threat levels. For surveillance tasks, one defines a closed set of objects and behaviors that constitute "business as usual" and anything outside of that range can be considered a threat and triggers and alarm calling for the intervention of human intelligence. Surveillance is a high recall task -- meaning that it is more important not to miss any events than to reduce the detection rate of false alarms. This background doesn't quite transfer to the general problem of multimedia retrieval on the Web.

We can't assume that Web multimedia will depict a closed class of objects. The cases that cannot be covered by a closed class are not infrequently occurring "threats", but rather entities drawn from the long tail: which, if we can indeed assume a finite inventory, will contain approximately half of the encountered entities. Further, Web multimedia retrieval is typically a precision oriented problem, which means that reducing false alarms is relatively more important than exhaustive detection.

Cultural heritage: Iconographic classification of visual art involves a classification system such as Iconclass. The stated purpose of Iconclass is the description and retrieval of subjects represented in image. I rather suspect that before the very first paint had dried on the very first canvas, next to the artist was standing an art historian who started to create a classification system to categorize the painting. In other words, using classification systems for visual art is an old idea, that has well-established conventions and has been honed over generations of use. Such a classification system necessarily views works of art as physical objects, and would have as it's goal the task of organizing the storage facility of a museum or of helping to choose which works to hang together in an exhibition. The people who created it assumed that the number of dimensions of similarity between works of art was necessarily finite. Such an assumption makes sense, in light of a relatively small number of art historians working on a relatively small number of questions concerning art history and the iconography of art.

Enter, however, the Web. Images and video are not physical objects and we do not have to be able to list them all in a well ordered list or even every make the decision of "Do we hang this in the East Wing gallery or the West Wing gallery?" There are many more users than art historians, and suddenly it actually be useful to admit the possibility that the number of ways to compare two images might in fact be infinite, rather than finite.

As for myself, I neither fall into the surveillance or the cultural heritage category. I attribute this to what's probably a naive equation of surveillance with totalitarian states and also to having the yearly experience in grade school of being packed on a bus and shipped off for a day at the Art Institute of Chicago.

I guess the Art Institute of Chicago was supposed to have broadened the horizons of our young minds, but instead it sort of warped me in a way that makes it difficult to talk to me, if you are an cultural heritage person or an art historian. I was young enough that everything I drew sort of came out flattish, whether I intended it to look two-dimensional or not, when I was suddenly confronted with the likes of Marc Rothko. I think what happened is that someone in Chicago told me that Marc Rothko described his work as an “elimination of all obstacles between the painter and the idea, between the idea and the observer” (as quoted on this AIC webpage describing the Rothko painting above). At the time, I didn't particularly like Rothko, but the experience permanently hardened my mind to the idea that it made any sense whatsoever to describe visual art in terms of its depicted subject.

I think that Marc Rothko must fit into iconclass categorization "0 Abstract, Non-representational Art: 22C4 colours, pigments, and paint", which is unsatisfactory to me because it makes him seem like an afterthought. In Chicago, they apparently forgot to mention that he was reacting to what came before him. For me, I was already broken. A system that put Rothko on the outside rather than at its core could never been acceptable to me. From then until always: the main point of art is what we do with it: how we talk about it, how we stand before it and mull in the museum, which prints we buy in the shop and go home and hang on our walls and (as little as we like to admit it) how much we pay for it. A priori we don't know what draws us to art, so why should we make little lists of entities corresponding to its subjects?

The perspective I take may not ultimately prove more productive than either the surveillance perspective or the cultural heritage perspective. It is the linguistics perspective. My view is the following: the elements of meaning arising from human perception and interaction with the world that have been encoded into language human language semantics, these are the elements that we should try to dig out of videos and images. They are the lowest common denominator of meaning that we can be sure will give us the ability to cover all human queries: the ones that we can anticipate and the ones that we cannot.

So should the image above be given the LSCOM category 2753-Solid_Tangible_Thing ? Sure. It's an image of a painting. That's a tangible object. But let's also let the image be found by shape and color. And be found how I found it on the Internet: with the query "Rothko". And let it also be found when we search for formative experiences. And for Chicago...

And what does this have to do with Halloween...continue to the next post.

Thursday, October 20, 2011

Deep Link to Delft Technology Fellowship

Being educated in the US and being a scientist in Europe is sometimes quite tough. I need to continuously use a sort of filter that tells me that although I am hearing X, I need to pause and carefully consider and realize that the person is really saying Y. One particularly painful example, was unfortunately provided by our rector magnificus, the president of our university, in a recent interview. In promoting a new program to attract female scientists to the TU Delft, he said '...vrouwelijke wetenschappers zijn minstens zo talentvol als mannelijke wetenschappers.' which translates in English as 'female scientists are at least as talented as their male counterparts'. Ouch.

This statement does not work in the US academic context, because it fails gender symmetry. Gender symmetry can be diagnosed with the following test: flip the polarity of gender terms (e.g., 'woman', 'man', 'male', 'female') in a statement, and determine whether the resulting statement retains meaning within the context.

Let's try it. Flipping polarity of gender terms in his sentence yields, '...male scientists are at least as talented as their female counterparts'. This sentence is clearly interpretable, but no longer has a meaning that fits the context.

Contrast that with an alternate sentence such as: 'There is no discrepancy in talent between male and female scientists'. This sentence has the same declarative content, but it passes the gender symmetry test because you can substitute it with 'These is no discrepancy in talent between female and male scientists'.

Of course, in this case, a further problem arises. This sentence has the implicature that there is some reason for which this fact needs to be asserted in the first place. The act of pronouncing this sentence communicates that the speaker does not consider the point to be completely obvious, but rather feels that it needs to be explicitly asserted. One might choose against even this alternative sentence in order to avoid sending the message that one feels that there is someone out there that still needs to be convinced on the point of talent equivalence between male and female scientists. But on the whole, this alternative could be considered the 'best practices' formulation, should one indeed find oneself in a situation where it was necessary to make a statement comparing the relative scientific talent of men and women.

What my filter tells me is that although X was said in this case, what was meant is Y. And concerning Y, I rather suspect that our rector magnificus harbors the personal opinion that women have perhaps even a teensy bit more science talent than men and that in fact he is saying, "at least as (if not more) qualified". Whether or not that is true, it's safe to say that he is of the opinion that our university would, at this point in time, benefit from hiring additional women.

One of the research topics that I am interested in as a multimedia retrieval scientists is developing algorithms for the retrieval of jump in points (JIP) in video. JIPs allow the viewer to click directly to a certain relevant point in a video. On YouTube, they are called deep links. JIPs make it possible to share or to comment about particular points of a video, just as I am currently doing with this post. The deep link to the relevant section of the interview under discussion is the following:

http://youtu.be/wvto6MWXE6k?t=35s

The current status of technology on the Web is that it is possible to comment on JIPs or share them, but search engines don't return them as results. Together with colleagues within the Netherlands and across Europe I am developing and helping to promote the development of JIP retrieval in the MediaEval Rich Speech Retrieval task (see the feature on MediaEval 2011 in MMRecords for a brief description.) Such technology would allow search engines to return pointers to specific time points within video that are relevant to user queries.

At the end of the day, I am more interested in the scientific questions raised by the task of JIP multimedia retrieval than I am in the gender issue. Since grade school, I have frequently been the "only girl" involved in whatever activity fascinated me. You don't know it any other way, so you don't really notice. I contribute what I can to the discourse on promoting gender balance, not so much because of myself, but because I find it wasteful if I feel that women who I am mentoring are somehow holding themselves back.

When I first came to Delft, I contributed the following comment on improving the working climate at the Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS). This is the point of view that I still stand by so I include it here to complete my comment on the deep link.

Response on the 2009 Challenging Gender survey
The way of improving the working climate at EEMCS would be to address the gender imbalance within a larger program of promoting diversity into the Faculty of EEMCS. A faculty that includes international scientists addressing multi- and trans-disciplinary questions is automatically going to be more comfortable for women, since gender differences become just one of many differences of background and perspective that make the faculty richer and more productive.

Any effort invested in promoting inclusion of scientists/researchers that have pursued non-traditional career tracks (e.g., completing their PhD at an older or younger age, taking time off, switching disciplines mid-career) will automatically make women feel more welcome. When women feel welcome, they will also feel confident that the effort that they invest will be rewarded by a long and productive career in the EEMCS, establishing a virtuous cycle.

Everyone benefits from the promotion of diversity. For example, in this kind of climate, a researcher who has worked in the faculty for years will feel more comfortable about taking the risk of investigating a new class of algorithms or applying expertise accumulated in one domain to solving a problem in a radically different domain.

Positive side-effect: If everyone benefits, then women will not be burdened by the (perceived) need to fight the prejudice that they have been hired due to their gender and not due to their competence.

By promoting diversity, both in terms of scientific expertise and also in terms of other characteristics (cultural, religious, linguistic, socio-economic, sexual orientation as well as gender), the faculty will draw on a larger pool of talent and increase its productivity and capacity for creation and invention.

Working at TU-Delft, you see "Challenge the future" written everywhere...sometimes in unexpected places. As a woman this speaks to me in a special way: it says that the future at the TU-Delft is not set up to be carbon copy of the past. Because of the "challenge the future" attitude, I have confidence that the demographics of my department will shift naturally as we the Faculty of EEMCS continues to mature, extend and innovate scientifically.