Saturday, March 30, 2013

Multimedia Readymades

This blog post is a continuation of and a response to Cynthia Liem's call to collect "hidden gems" of internet multimedia made during her TedX Delft talk "Every bit of it" (cf. Cynthia says that original bits of multimedia that we share on the Internet have the feel of "randomness" or of "low quality", but that what they are are "entry points" to things of value, if we can discover them and polish them. In this blog post I relate discovered bits of multimedia to the surrealist concept of "Readymades".

An art exhibition that I visited some 25 years ago in Frankfurt was probably what set me first to thinking about hidden gems and the process by which value is assigned to what would conventionally be considered mundane.

My most recent hidden gem is a video that captures the comments of people on something that has been created not by an act of coming-into-being, but by an act of putting-into-context.

From one perspective, the video is about Fountain, which is one of the Readymades of Marcel Duchamp. Wikipedia gives us the definition of the Readymade from André Breton and Paul Éluard's Dictionnaire abrégé du Surréalisme: "an ordinary object elevated to the dignity of a work of art by the mere choice of an artist." 

The Fountain is a urinal that has been taken out of context in two ways: first, it lies flat rather than hanging on a wall in the position it would need to be in order to be used, and, second, it was submitted by Marcel Duchamp in 1917 as a work of art to an art exhibition.

One could quip, "Beauty lies in the eye of the beholder." and let it go at that. However, scratch a little deeper, look beyond the shock-value of the object being a urinal, and a different message becomes clear: The artifact itself has little importance, and merely provides a trigger for the process of Fountain becoming something interesting and worthwhile.

Marcel Duchamp 'FOUNTAIN' - IS IT ART? from arlen figgis on Vimeo.

From another perspective, the video is about what happens when you take a widely-recognized work of art and display it to people out of context. In the video, the urinal is put into a public toilet in Liverpool and people, who recognize the urinal as Fountain, are asked to comment on it.
Of all the people in the video, only two comment on the fact that they are actually standing in a toilet looking at a toilet. The relative inattention to this point suggests that the fact that Fountain is finally in its "home" surroundings does not have much impact on people's opinions of it. No one in the video is looking at the urinal and having anything that at all related to a "The-Emperor-has-no-clothes" moment. Instead, they seem to have the same thoughts and feeling as they did when they walked into the toilet: they are interested, and they acknowledge it as something worthwhile.

I consider this video to be rather unfinished (yet a "diamond in the rough") because it has the power to make a point that it does not explicitly make. The un-made point is this: it is non-trivial to undo the Fountain effect, i.e., the urinal stubbornly resists returning to un-interestingness. Once a gem has emerged from a mundane object, the gem-status is difficult to shake.

These two perspectives on what this video is about represent two poles of a spectrum of interactions that we (as users, uploaders and viewers) have with bits of online multimedia that people capture and upload to the internet.

In some moments, we amuse, educate and otherwise occupy ourselves through layers of rediscovery, and the popularity, quality, origin, and perhaps even topic of the original multimedia object could not be less relevant.

In other moments, we forget about  rediscovering hidden gems and return to what it widely-acknowledged to be interesting or worthwhile, we cling to The Cannon or flock to watch the blockbusters.

Our goal as multimedia information retrieval researchers should be to develop techniques that support this entire spectrum.

In Frankfurt 25 years ago, my encounter with Readymades revealed to me that it is not particularly useful to assume that there is a sharp boundary between creation and discovery. It made clear that significance arises through the dialogue representing an interplay between a large number of voices representing the general public and more limited number of voices recognized as authorities.

These poles exist now, they existed 25 years ago (pre-Internet), it was happening in 1917, and from there it seems a relatively straightforward extension to the idea that they were there before and that their importance will continue.

It is multimedia systems that give us access to the hits and classics, but that also allow us to discover the triggers ("entry points" in Cynthia's words) from which interesting and worthwhile objects emerge and to support the interplay of contributions and opinions necessary for emergence.

It's a difficult problem. I rather think that the number of people that share my understanding of my hidden gem is quite limited, and that we would need a retrieval or recommendation system that would reach all of them in order to really add any polish. Or maybe it is enough for the gem to be personal, although I am far from sure that it is easy to ensure this video is still discoverable in future years when I again return to the topic of the emergent value of the mundane.

Friday, March 29, 2013

Multimedia Bits

Cynthia Liem recently gave a TedX Delft talk called "Every bit of it". I have the pleasure and the honor of being her colleague at the Delft Multimedia Information Retrieval Lab. I missed being at the talk in person, but finally found time today to watch it on YouTube.

What she is saying is so critical---it at the same time both fresh and timeless---that the talk deserves a more in-depth reaction than a tweet or re-tweet. This post summarizes the message that I heard in Cynthia's talk.

Cynthia makes the point that every bit on the Internet has meaning to the original person who put it there. These are of course, multimedia bits, that she is speaking of videos and images that have been captured by people and shared on the Internet. An important question in this era of Big Data: each of these "bits" of multimedia is captured by someone for some reason. Cynthia tells us that we can add worth to these bits by enhancing them with other bits; in particular, she shows us wondrous transformations that can be brought about by adding music to video.

Towards the end of the video, she says that a big question that with face with multimedia is "What is relevant?"

She points out that we tend to focus for the obvious in our understanding of relevance, e.g., popularity and quality, and that because of this focus, we miss material that people do not know exists.

Instead, the "bits" of multimedia on the Internet should be seen as an entrance to the world, not as the final product, but a place to start. A diamond in the rough that needs to be polished in order to add value.

At the end of the video, she asks the audience to keep their eyes open for a rediscovered hidden gem....I discuss my gem in my next post:

Saturday, March 2, 2013

Visual Relatedness is in the Eye of the Beholder: Remember Paris


How do we know if a tag is related to the visual content of an image? In this blogpost, I am going to argue that in order to answer that question, it is first necessary to decide who "we" is. In other words, it is necessary to first define who is the person or persons who are judging visual relatedness, and only then ask the question is this tag related to the visual content of the image.

I'll start out by remarking that an alternate way of approaching the issue is to get rid of the human judge all together. For example, this paper:

Aixin Sun and Sourav S. Bhowmick. 2009. Image tag clarity: in search of visual-representative tags for social images. In Proceedings of the first SIGMM workshop on Social media (WSM '09). ACM, New York, NY, USA, 19-26.

provides us with a clearly-defined notion of the "visual representativeness" of tags. A tag is considered to be visually representative if it describes the visual content of a photo. "Sunset" and "beach" are visually representative and "Asia" and "2008" may not be. A tag is visually representative if it associated with images whose visual representations diverge from that of the overall collection. The model in this paper uses a visual clarity score, which is the Kullback-Leibler divergence of language models based on visual-bag-of-words representations.

Why don't we like this alternative? Well, this definition of visual representativeness does not reflect visual representativeness as perceived by humans. It's not clear that we really are helping ourselves build multimedia systems that serve human users if we make things less complicated by getting rid of the human judge.

The issue is the following: Humans have no problem confirming that an image depicting a pagoda at sunset and an image depicting a busy intersection with digital billboards both depict "Asia".  There is something about the visual content of these two images that is representative of "Asia", and it seems to be a simple leap from there to conclude that the tag "Asia" is related to the visual content of these images.

But there was a time in my life where I didn't know what a pagoda was. It was less long ago than one may think (although certainly before the workshop at which the paper above was presented, held at ACM Multimedia 2009 in Beijing), which prompts me to think further.

A solution might be the following: We could stipulate that in my pre-pagoda-awareness years, I should have been excluded from the set of people who gets to judge if photos are related to Asia. But then would would then have to worry about my familiarity with digital billboards, and then the next Asia indicator and on and on until I and everyone that I know is excluded from the set of people who gets to judge the visual relatedness of photos to tags. In short, this solution does not lead to a clearer definition of how we can know that a tag relates to the visual content of an image.

Why do things get so complicated? The problem, I argue, is that we ask the question of a pair: "For this image and this tag (i,t) is the visual content of the image related to this tag?"  This question does not lead to a well-defined answer.

The answer is, however, well defined if we ask the question of a triple: "For this image, this tag and this person or group of people (i,t,P): is the visual content of the image related to this tag in the judgement of this person or group of people?" In other words, we need to look for the relationship between tags and the visually depicted content of images in the eye of the beholder.

We can then perform a little computational experiment: Put person or people P in a room and expose them to the visual content of image i and ask the yes/no question "Is tag t related to image i?"

The answer of P is going to depend on the method that P uses in order to reason from the visual content of i to the relatedness of tag t. Here's a list of different Ps who are able to identify Paris for different reasons.

(i, "paris" P1): I took the picture and when I see it, I remember it.
(i, "paris" P2): I was there when the picture was taken and when I see it, I remember this moment.
(i, "paris" P3): Someone told me about a picture that was taken in Paris and there is something that I see in this picture that tells me that this must be it.
(i, "paris" P4): I know of another picture that looks just like this one and it was labeled Paris.
(i, "paris" P5): I've seen other pictures like this an recognize it (the specific buildings that appear).
(i, "paris" P6): I've been there and recognize characteristics of the place (the type of architecture).
(i, "paris" P7): I am a multimedia forensic expert and have established a chain of logic that identifies the place as Paris.

Perhaps even more are possible. What is clear is the following: It would be nice if we would have ended up with two P's: expert annotators and non-expert annotators. However, it looks like what we have is judgements that are based on quite a few differences in personal history, previous exposure, world knowledge, and expertise.

If we want to develop truly useful algorithms that validate the match between the visual content and the tag, we have a lot more work to do, in order to cover all the (i,t,P).

The key is to get a chance to question enough Ps. Multimedia research needs the Crowd.