N-grams: image understanding

Wednesday, May 24, 2017

Multimedia Meets Machine (Learning): Understanding images vs. Image Understanding

Today, I gave a talk at Radboud University's Good AIfternoon symposium, for Artificial Intelligence students. I covered several papers that I have written with different subsets of my collaborators [1,2, 3]. The goal was to show students the difference in the way humans understand images, and in the type of understanding the can be achieved by computers applying visual content analysis, particularly concept detection.

Human Understanding of Images
Consider the images below from [1]. The concept detection paradigm claims success if a computer algorithm can identify these images as depicting a woman wearing a turquoise blue sundress with water in the background. For bonus points, in one image the woman is wearing sunglasses.

A person looking at these images would not say that such concept-based description of the images is wrong. In fact, if a person is presented with these pictures out of context, and asked what they depict, "A woman wearing a blue sundress at the beach" would be an unsurprising response.

However, this response falls short of really characterizing the photos from the perspective of a human viewer. This shortcoming becomes clear by considering contexts of use. For example, if we needed to chose one of the two as a photo for selling a turquoise blue dress in a web shop, the right hand photo is clearly the photo we want. The left-hand photo is clearly unsuited for the job. Concept-based descriptions of these images fail to fully capture user perspectives on images. Upon reflection, a person looking at these images would conclude that the concept-based description is not wrong per se, but that it seriously misses the point of the image.

A often-heard argument is that you need to start somewhere and that concept-based description is a good place to start. However, we need to keep in mind that this starting point represents a build-in limitation on the ability of systems that use automatic image understanding (such as image retrieval systems) to serve users.

Think of it this way. Indexing images with a preset set of concepts is a bit like those parking garages that paint each floor a different color. If you remember the color, that color is effective at allowing you to find your car. However, the relationship of the color and your car is one of convenience. The parking-garage-floor color is an essential property of your car when you are looking for it in the garage, but outside of the garage, you wouldn't consider it an important property of your car at all.

In short, automatic image understanding underestimates the uniqueness of these images, although this uniqueness is of the essence for a human viewer.

Machine Image Understanding

Consider the images below from [4]. A human viewer would see these as two different images.

If the geo-location of the right-hand image is known, geo-location estimation algorithms [3] can correctly predict the geo-location of the left-hand image. In this case, a machine learning algorithms "understands" something about an image that is not particularly evident to a casual human viewer. Humans are largely unaware that the geo-location of their images is "obvious" to a computer algorithm that has accessed to other images known to have been taken at the same place.

In short, human understanding of images overestimates the uniqueness of these images, and visual content analysis algorithms understand more than people realize that they do.

Moving forward

Given the current state of the art in visual content analysis, "Multimedia Meets Machine" is perhaps a bit out dated, and we should be thinking in terms of titles like, "Multimedia Has Already Met Machine".

The key question moving forward is whether machine understanding of images supports the people who take and use those images, or if it is providing a little convenience, at the larger cost of personal privacy.

[1] Michael Riegler, Martha Larson, Mathias Lux, and Christoph Kofler. 2014. How 'How' Reflects What's What: Content-based Exploitation of How Users Frame Social Images. In Proceedings of the 22nd ACM international conference on Multimedia (MM '14).

[2] Martha Larson, Christoph Kofler, and Alan Hanjalic. 2011. Reading between the tags to predict real-world size-class for visually depicted objects in images. In Proceedings of the 19th ACM international conference on Multimedia (MM '11).

[3] Xinchao Li, Alan Hanjalic, Martha Larson. Geo-distinctive Visual Element Matching for Location Estimation of Images, Under review. http://arxiv.org/pdf/1601.07884v1.pdf

[4] Jaeyoung Choi, Claudia Hauff, Olivier Van Laere and Bart Thomee. 2015. The Placing Task at MediaEval 2015. In Working Notes Proceedings of the MediaEval 2015 Workshop.

http://ceur-ws.org/Vol-1436/Paper6.pdf

Wednesday, November 30, 2011

Using Narrative to Support Image Search

A strong system metaphor helps to align the needs and expectations with which a user approaches a multimedia search engine and the functionality and types of results that that search engine provides. My conviction on this point is so firm that I found myself dressed up as Alice from Alice's Adventures in Wonderland and competing as a finalist at the ACM Multimedia 2011 Grand Challenge in the Yahoo! Image Challenge.

Essentially, the story in the book runs that Alice enters, after a long fall, through a door into another world. Here, she encounters the fantastic and the unexpected, but her views are basically determined by two perspectives: one that she has when she grows to be very big and one that she has when she shrinks to be very small. The book plays with language and with logic and for this reason has a strong intellectual appeal to adults as well as holding the fascination of children.

We built a system based on this narrative, which offers users (in response to an arbitrary Flickr query) sets of unexpected yet fascinating images, created either from a "big" perspective or from a "small" perspective. The "Alice" metaphor tells the user to: (1) Expect the "big" and "small" perspectives (2) Expect a system that can be understood at two levels: as both engaging childlike curiosity and also meriting serious intellectual attention due to the way in which it uses language and statistics (3) Expect a system that will need a little bit of patience since the images appear a bit slowly (we're processing a flood of live Flickr results in the background), like the fading in of the Cheshire Cat.

The Grand Challenge requires participants to present their idea in exactly three minutes in a presentation that addresses the following points:

What is your solution? Which challenge does it address? How does it address the challenge?
Does your solution work? Is there evidence that it works?
Can you demo the system?
Is the solution generalizable to other problems? What are the limits of your approach?
Can other people reproduce your results? How?
Did the audience and the jury understand and ENJOY your presentation?

We used the three minutes to cover these points in a dialogue between Alice and Christoph Kofler (CK), first author on the Grand Challenge paper:

Kofler, C., Larson, M., Hanjalic, A. Alice's Worlds of Wonder: Exploiting Tags to Understand Images in terms of Size and Scale. ACM Multimedia 2011, Grand Challenge paper.

During the dialogue we demonstrated the system running live (We knew it was a risk to run a live demo, but luck was with us and the wireless network held up).

Alice's Worlds of Wonder: Three Minute Dialogue

(showing a rather standard opening slide)
CK: Alice, look at them out there, their image search experience is dry and boring.

Alice: We should show them our answer to the Yahoo! Image Challenge on Novel Image Understanding.

(showing system interface)
CK: The Wonderlands system runs on top of Flickr and sorts search results for the user at search time.

(dialogue during live demo)
Alice: Let’s show them how it works. Do we trust the wireless network?
CK: Yes. We need a Flickr query.
Alice: Let’s do “car”
CK: The Wonderlands system presents the user with the choice to enter “Alice’s Small World” or “Alice’s Big World”
Alice: Let’s choose Small World.

Alice (to audience): If you know me in "Alice in Wonderland", you know that in the story I shrink to become very small. This is the metaphor underlying the Small World of the Wonderlands system. It shrinks you, too, as a Flickr user, by putting you eye-to-eye with small objects pictured in small environments with limited range. You get the impression you have the perspective of a small being viewing the world from down low.

Still Alice: (to CK) Let’s choose Big World now. In the book, I also grow to be very big. The Big World makes you grow again. Objects are large and the perspective is broad.

You can imagine cases in which you were looking for person-sized cars --- here, the Big World would help you focus your search on the images that you really want.

CK: Should we explain how it works?

Alice: Yes.

CK: (Displays "Implicit Physics of Language" slide) We exploit a combination of user tags and the implicit physics of language.

Alice: Exactly.

Alice: Basically, your search engine knows something about the physics of the real world because it indexes large amounts of human language.

Certain queries give you the real-world size of objects: “the flower in her hand” returns a large number of results, so you can infer that a flower is small.

CK: Oh yes! And “the factory in her hand” returns no results so you know a factory is large.

Alice: Basically, the search engine is telling us that a girl holding a flower in her hand is a common situation, but that her holding a factory is not. We get this effect because physics dictates that something commonly held in a human hand must be small.

CK: (Displays with the entry window with the two doors) The sorting algorithm is straightforward. Alice’s Small World contains images whose tags tend to designate smaller objects and Alice’s Big World contains images whose tags tend to designate larger objects.

Alice: Exactly.

CK: So Alice, the system takes a fanciful and engaging perspective. But in order to carry out quantitative evaluation we can look at it in terms of scale. We achieve a weighted precision nearly three times random chance.
(Flash up under the two doors "Evaluation on 1,633 Flickr images from MIRFLICKR data set. 0.773 weighted precision")

Alice: So the scale numbers point to the conclusion that we are creating a genuine two-worlds experience for users.

CK: Right. But, Alice, do we need to stop at two worlds: big and small? Are there other worlds out there?

Alice: Well, Christoph, effectively the only limit is the speed at which we can query Flickr and Yahoo!. You know that the implicit physics of language works because of general physical principles. So, in theory, there are as many different worlds as there are interesting physical properties.

CK: But being Alice, you like the small and the big worlds, right?

Alice: Yes, I do. Shall we try another query?

CK: (Display final slide) Or we can just tell them where to download the system. You know, the code's online.

Alice: Yes, let them try it out! No more dry and boring image search for this group...(TIME UP!!)

N-grams

Wednesday, May 24, 2017

Multimedia Meets Machine (Learning): Understanding images vs. Image Understanding

Wednesday, November 30, 2011

Using Narrative to Support Image Search

Search Happens

Search This Blog

Blog Archive

Labels

Twitter