Wednesday, November 30, 2011

Using Narrative to Support Image Search


A strong system metaphor helps to align the needs and expectations with which a user approaches a multimedia search engine and the functionality and types of results that that search engine provides. My conviction on this point is so firm that I found myself dressed up as Alice from Alice's Adventures in Wonderland and competing as a finalist at the ACM Multimedia 2011 Grand Challenge in the Yahoo! Image Challenge.

Essentially, the story in the book runs that Alice enters, after a long fall, through a door into another world. Here, she encounters the fantastic and the unexpected, but her views are basically determined by two perspectives: one that she has when she grows to be very big and one that she has when she shrinks to be very small. The book plays with language and with logic and for this reason has a strong intellectual appeal to adults as well as holding the fascination of children.

We built a system based on this narrative, which offers users (in response to an arbitrary Flickr query) sets of unexpected yet fascinating images, created either from a "big" perspective or from a "small" perspective. The "Alice" metaphor tells the user to: (1) Expect the "big" and "small" perspectives (2) Expect a system that can be understood at two levels: as both engaging childlike curiosity and also meriting serious intellectual attention due to the way in which it uses language and statistics (3) Expect a system that will need a little bit of patience since the images appear a bit slowly (we're processing a flood of live Flickr results in the background), like the fading in of the Cheshire Cat.

The Grand Challenge requires participants to present their idea in exactly three minutes in a presentation that addresses the following points:
  • What is your solution? Which challenge does it address? How does it address the challenge?
  • Does your solution work? Is there evidence that it works?
  • Can you demo the system?
  • Is the solution generalizable to other problems? What are the limits of your approach?
  • Can other people reproduce your results? How?
  • Did the audience and the jury understand and ENJOY your presentation?
We used the three minutes to cover these points in a dialogue between Alice and Christoph Kofler (CK), first author on the Grand Challenge paper:

Kofler, C., Larson, M., Hanjalic, A. Alice's Worlds of Wonder: Exploiting Tags to Understand Images in terms of Size and Scale. ACM Multimedia 2011, Grand Challenge paper.

During the dialogue we demonstrated the system running live (We knew it was a risk to run a live demo, but luck was with us and the wireless network held up).

Alice's Worlds of Wonder: Three Minute Dialogue

(showing a rather standard opening slide)
CK: Alice, look at them out there, their image search experience is dry and boring.

Alice: We should show them our answer to the Yahoo! Image Challenge on Novel Image Understanding.

(showing system interface)
CK: The Wonderlands system runs on top of Flickr and sorts search results for the user at search time.

(dialogue during live demo)
Alice: Let’s show them how it works. Do we trust the wireless network?
CK: Yes. We need a Flickr query.
Alice: Let’s do “car”
CK: The Wonderlands system presents the user with the choice to enter “Alice’s Small World” or “Alice’s Big World”
Alice: Let’s choose Small World.

Alice (to audience): If you know me in "Alice in Wonderland", you know that in the story I shrink to become very small. This is the metaphor underlying the Small World of the Wonderlands system. It shrinks you, too, as a Flickr user, by putting you eye-to-eye with small objects pictured in small environments with limited range. You get the impression you have the perspective of a small being viewing the world from down low.

Still Alice: (to CK) Let’s choose Big World now. In the book, I also grow to be very big. The Big World makes you grow again. Objects are large and the perspective is broad.

You can imagine cases in which you were looking for person-sized cars --- here, the Big World would help you focus your search on the images that you really want.

CK: Should we explain how it works?

Alice: Yes.

CK: (Displays "Implicit Physics of Language" slide) We exploit a combination of user tags and the implicit physics of language.

Alice: Exactly.

Alice: Basically, your search engine knows something about the physics of the real world because it indexes large amounts of human language.

Certain queries give you the real-world size of objects: “the flower in her hand” returns a large number of results, so you can infer that a flower is small.

CK: Oh yes! And “the factory in her hand” returns no results so you know a factory is large.

Alice: Basically, the search engine is telling us that a girl holding a flower in her hand is a common situation, but that her holding a factory is not. We get this effect because physics dictates that something commonly held in a human hand must be small.

CK: (Displays with the entry window with the two doors) The sorting algorithm is straightforward. Alice’s Small World contains images whose tags tend to designate smaller objects and Alice’s Big World contains images whose tags tend to designate larger objects.

Alice: Exactly.

CK: So Alice, the system takes a fanciful and engaging perspective. But in order to carry out quantitative evaluation we can look at it in terms of scale. We achieve a weighted precision nearly three times random chance.
(Flash up under the two doors "Evaluation on 1,633 Flickr images from MIRFLICKR data set. 0.773 weighted precision")

Alice: So the scale numbers point to the conclusion that we are creating a genuine two-worlds experience for users.

CK: Right. But, Alice, do we need to stop at two worlds: big and small? Are there other worlds out there?

Alice: Well, Christoph, effectively the only limit is the speed at which we can query Flickr and Yahoo!. You know that the implicit physics of language works because of general physical principles. So, in theory, there are as many different worlds as there are interesting physical properties.

CK: But being Alice, you like the small and the big worlds, right?

Alice: Yes, I do. Shall we try another query?

CK: (Display final slide) Or we can just tell them where to download the system. You know, the code's online.

Alice: Yes, let them try it out! No more dry and boring image search for this group...(TIME UP!!)

Saturday, November 5, 2011

Affect and concepts for multimedia retrieval (Halloween III)

This Halloween I just kept on noticing what I am calling "affect pumpkins". These are jack-o-lantern faces labeled with emotion words. Jack-o-lanterns and decorations (such as the ones in this image) that depict jack-o-lanterns are typical for celebrations of Halloween.

I don't remember having my jack-o-lanterns labeled with adjectives when I was a child, so I am rather curious about this phenomenon and have been observing it a bit. Apparently, the activity of giving jack-o-laterns emotion words is quite fun and is, all and all, a harmonious process, characterized by a lack of disagreement or other inter-personal strife. If you have happy jack-o-lantern, there appears to be a high degree of consensus about the applicability of the label 'happy'.

I contrast this smooth and fun pumpkin labeling procedure with the disagreement in the multimedia community that has apparently developed into full-fledged distate for what are referred to as "subjective user tags", tags that express feelings or personal perspectives. Such tags have been referred to as "imprecise and meaningless" in Liu et al. 2009 published at WWW (page 351) and my impression is that many, many researchers agree with this point of view. In the authors' defense, had they used what I feel as the more appropriate formulation of "imprecise and meaningless with respect to a certain subset of multimedia retrieval tasks", the community would still probably be on a rampage against personal and affective tags.

Sometimes it seems everyone has simply made this spontaneous decision to take up arms against the insight of Rosalind Picard, who in 1995 wrote, "Although affective annotations, like content annotations, will not be universal, they will still help reduce time searching for the 'right scene.' Both types of annotation are potentially powerful; we should be exploring them in digital audio and visual libraries." (from "TR 321" p. 11). Do we have a huge case of sour grapes? Have we decided that we have irreversibly failed over the past 15+ years to exploit affective image labels and are therefore now deciding that we should never have considered them potentially interesting in the first place?

Oh, I hope not. Just look at this wall and think about all the walls like this, all the jack-o-lantern pictures that were created this Halloween and posted to the Internet. There are too many pictures of Halloween pumpkins out there that we can afford to overlook the chance to organize them by affect. Of course, some people might hold that this silly pumpkin should actually also be considered a happy pumpkin: We can anticipate some disagreement. However, it is important to keep two points in mind: (1) Labels that are ostensibly 'objective' and have nothing to do with affect are also subject to lack of consensus on their applicability, e.g., the ambiguity on whether a depicted object is a 'pumpkin' and 'jack-o-lantern' discussed in my previous post. (2) Even if we do not agree on the exact affective label, we do have intuitions that we do not agree and on other possible interpretations. For example, someone who insists on 'silly' will also admit that someone else might consider this pumpkin 'happy', but that it would be less likely to expect anyone to find 'sad' as the most appropriate label.

Interestingly, in my observations, I have seen that the emotion word used to describe a jack-o-lantern seem to be chosen from one of two perspectives: Depicted in the image above are "pumpkin perspective" emotion words ('happy', 'silly', 'sad' and 'mad') which designate the emotion being experienced by the jack-o-lantern that explains the jack-o-lantern's expression. In the picture book page in the image from my previous post there is a mixture of this "pumpkin perspective" with a "people perspective". The book reads, "We'll make our jack-o-lanterns--it might be messy, but it's fun!" and then asks "Will yours be scary?" A jack-o-lantern is scary if it causes fear from the perspective of people looking at it. And then it goes on to ask "Happy? Sad?" which are "pumpkin perspective" words. And finally "A sweet or silly one?". Other perspectives are also possible: the affect label could reflect what the carver of the jack-o-lantern intended to achieve by making the pumpkin.

In my own work, I tend to insist on the importance of distinguishing these different perspectives, with the idea that if the underlying model of affect is complete and sound, it will provide a more stable foundation for building a system of annotation. However, in practical use, the affect labels don't need to distinguish the experiencer or understand the principle of empathic sympathy: we simply know a happy pumpkin when we see one and that of course makes us a little happy ourselves.

Dong Liu, Xian-Sheng Hua, Linjun Yang, Meng Wang, and Hong-Jiang Zhang. 2009. Tag ranking. In Proceedings of the 18th international conference on World wide web (WWW '09). ACM, New York, NY, USA, 351-360.

Rosalind W. Picard, Affective computing, MIT, Media Laboratory Perceptual Computing Section Technical Report 321, November 1995.