Wednesday, May 24, 2017

Multimedia Meets Machine (Learning): Understanding images vs. Image Understanding

Today, I gave a talk at Radboud University's Good AIfternoon symposium, for Artificial Intelligence students.  I covered several papers that I have written with different subsets of my collaborators [1,2, 3]. The goal was to show students the difference in the way humans understand images, and in the type of understanding the can be achieved by computers applying visual content analysis, particularly concept detection.

Human Understanding of Images
Consider the images below from [1]. The concept detection paradigm claims success if a computer algorithm can identify these images as depicting a woman wearing a turquoise blue sundress with water in the background. For bonus points, in one image the woman is wearing sunglasses.
A person looking at these images would not say that such concept-based description of the images is wrong. In fact, if a person is presented with these pictures out of context, and asked what they depict, "A woman wearing a blue sundress at the beach" would be an unsurprising response. 

However, this response falls short of really characterizing the photos from the perspective of a human viewer. This shortcoming becomes clear by considering contexts of use. For example, if we needed to chose one of the two as a photo for selling a turquoise blue dress in a web shop, the right hand photo is clearly the photo we want. The left-hand photo is clearly unsuited for the job. Concept-based descriptions of these images fail to fully capture user perspectives on images. Upon reflection, a person looking at these images would conclude that the concept-based description is not wrong per se, but that it seriously misses the point of the image.

A often-heard argument is that you need to start somewhere and that concept-based description is a good place to start. However, we need to keep in mind that this starting point represents a build-in limitation on the ability of systems that use automatic image understanding (such as image retrieval systems) to serve users. 

Think of it this way. Indexing images with a preset set of concepts is a bit like those parking garages that paint each floor a different color. If you remember the color, that color is effective at allowing you to find your car. However, the relationship of the color and your car is one of convenience. The parking-garage-floor color is an essential property of your car when you are looking for it in the garage, but outside of the garage, you wouldn't consider it an important property of your car at all.

In short, automatic image understanding underestimates the uniqueness of these images, although this uniqueness is of the essence for a human viewer.

Machine Image Understanding
Consider the images below from  [4]. A human viewer would see these as two different images.
If the geo-location of the right-hand image is known, geo-location estimation algorithms [3] can correctly predict the geo-location of the left-hand image. In this case, a machine learning algorithms "understands" something about an image that is not particularly evident to a casual human viewer. Humans are largely unaware that the geo-location of their images is "obvious" to a computer algorithm that has accessed to other images known to have been taken at the same place.

In short, human understanding of images overestimates the uniqueness of these images, and visual content analysis algorithms understand more than people realize that they do.

Moving forward
Given the current state of the art in visual content analysis, "Multimedia Meets Machine" is perhaps a bit out dated, and we should be thinking in terms of titles like, "Multimedia Has Already Met Machine".

The key question moving forward is whether machine understanding of images supports the people who take and use those images, or if it is providing a little convenience, at the larger cost of personal privacy.

[1] Michael Riegler, Martha Larson, Mathias Lux, and Christoph Kofler. 2014. How 'How' Reflects What's What: Content-based Exploitation of How Users Frame Social Images. In Proceedings of the 22nd ACM international conference on Multimedia (MM '14). 

[2] Martha Larson, Christoph Kofler, and Alan Hanjalic. 2011. Reading between the tags to predict real-world size-class for visually depicted objects in images. In Proceedings of the 19th ACM international conference on Multimedia (MM '11).

[3] Xinchao Li, Alan Hanjalic, Martha Larson.  Geo-distinctive Visual Element Matching  for Location Estimation of Images, Under review.

[4] Jaeyoung Choi, Claudia Hauff, Olivier Van Laere and Bart Thomee. 2015. The Placing Task at MediaEval 2015. In Working Notes Proceedings of the MediaEval 2015 Workshop.