Showing posts with label tagging. Show all posts
Showing posts with label tagging. Show all posts

Saturday, March 2, 2013

Visual Relatedness is in the Eye of the Beholder: Remember Paris

buttespagoda

How do we know if a tag is related to the visual content of an image? In this blogpost, I am going to argue that in order to answer that question, it is first necessary to decide who "we" is. In other words, it is necessary to first define who is the person or persons who are judging visual relatedness, and only then ask the question is this tag related to the visual content of the image.

I'll start out by remarking that an alternate way of approaching the issue is to get rid of the human judge all together. For example, this paper:

Aixin Sun and Sourav S. Bhowmick. 2009. Image tag clarity: in search of visual-representative tags for social images. In Proceedings of the first SIGMM workshop on Social media (WSM '09). ACM, New York, NY, USA, 19-26.

provides us with a clearly-defined notion of the "visual representativeness" of tags. A tag is considered to be visually representative if it describes the visual content of a photo. "Sunset" and "beach" are visually representative and "Asia" and "2008" may not be. A tag is visually representative if it associated with images whose visual representations diverge from that of the overall collection. The model in this paper uses a visual clarity score, which is the Kullback-Leibler divergence of language models based on visual-bag-of-words representations.

Why don't we like this alternative? Well, this definition of visual representativeness does not reflect visual representativeness as perceived by humans. It's not clear that we really are helping ourselves build multimedia systems that serve human users if we make things less complicated by getting rid of the human judge.

The issue is the following: Humans have no problem confirming that an image depicting a pagoda at sunset and an image depicting a busy intersection with digital billboards both depict "Asia".  There is something about the visual content of these two images that is representative of "Asia", and it seems to be a simple leap from there to conclude that the tag "Asia" is related to the visual content of these images.

But there was a time in my life where I didn't know what a pagoda was. It was less long ago than one may think (although certainly before the workshop at which the paper above was presented, held at ACM Multimedia 2009 in Beijing), which prompts me to think further.

A solution might be the following: We could stipulate that in my pre-pagoda-awareness years, I should have been excluded from the set of people who gets to judge if photos are related to Asia. But then would would then have to worry about my familiarity with digital billboards, and then the next Asia indicator and on and on until I and everyone that I know is excluded from the set of people who gets to judge the visual relatedness of photos to tags. In short, this solution does not lead to a clearer definition of how we can know that a tag relates to the visual content of an image.

Why do things get so complicated? The problem, I argue, is that we ask the question of a pair: "For this image and this tag (i,t) is the visual content of the image related to this tag?"  This question does not lead to a well-defined answer.

The answer is, however, well defined if we ask the question of a triple: "For this image, this tag and this person or group of people (i,t,P): is the visual content of the image related to this tag in the judgement of this person or group of people?" In other words, we need to look for the relationship between tags and the visually depicted content of images in the eye of the beholder.

We can then perform a little computational experiment: Put person or people P in a room and expose them to the visual content of image i and ask the yes/no question "Is tag t related to image i?"

The answer of P is going to depend on the method that P uses in order to reason from the visual content of i to the relatedness of tag t. Here's a list of different Ps who are able to identify Paris for different reasons.

(i, "paris" P1): I took the picture and when I see it, I remember it.
(i, "paris" P2): I was there when the picture was taken and when I see it, I remember this moment.
(i, "paris" P3): Someone told me about a picture that was taken in Paris and there is something that I see in this picture that tells me that this must be it.
(i, "paris" P4): I know of another picture that looks just like this one and it was labeled Paris.
(i, "paris" P5): I've seen other pictures like this an recognize it (the specific buildings that appear).
(i, "paris" P6): I've been there and recognize characteristics of the place (the type of architecture).
(i, "paris" P7): I am a multimedia forensic expert and have established a chain of logic that identifies the place as Paris.

Perhaps even more are possible. What is clear is the following: It would be nice if we would have ended up with two P's: expert annotators and non-expert annotators. However, it looks like what we have is judgements that are based on quite a few differences in personal history, previous exposure, world knowledge, and expertise.

If we want to develop truly useful algorithms that validate the match between the visual content and the tag, we have a lot more work to do, in order to cover all the (i,t,P).

The key is to get a chance to question enough Ps. Multimedia research needs the Crowd.

Saturday, August 13, 2011

Subjectivity vs. Objectivity in Multimedia Indexing

In the field of multimedia, we spend so much time in discussions about semantic annotations (such as tags, or concept labels used for automatic concept detection) and whether they are objective or subjective. Usually the discourse runs along the lines of "Objective metadata is worth our effort, subjective metadata is too personal to either predict or be useful." Somehow the underlying assumption in these discussions is that we all have access to an a priori understanding of the distinction between "subjective" and "objective" and that this distinction is of some specific relevance to our field of research.

My position is that, as engineers building multimedia search engines, if we want to distinguish between subjective and objective we should do so using a model. We should avoid listening to our individual gut feelings on the issue (or wasting time talking about them). Instead, we should adopt a the more modern notion of "human computational relevance" which, since the rise of crowdsourcing, has entered into conceivable reach.

The underlying model is simple: Given a definition of a demographic that can be used to select a set of human subjects and a definition of a functional context in the real world inhabited by those subjects, the level of subjectivity or objectivity of an individual label is defined as the percentage of of human subjects who would say "yes, that label belongs with that multimedia item". The model can be visualized as follows:

Fig. 1: The relevance of a tag to an object is defined as the proportion of human subjects (pictured as circles) within a real-world functional context and drawn from a well-defined demographic that agree on a tag. I claim that this is the only notion of the objective/subjective distinction relevant for our work in developing multimedia search engines.

Under this view of the world, the distinction between subjective and objective reduces to the inter-annotator agreement under controlled conditions. I maintain that the level of inter-annotator agreement will also reflect the usefulness that the tag will have deployed within a multimedia search engine designed for use within the domain defined by the functional context by the people in the demographic. If we want to assimilate personalized multimedia search into this picture we can define it within a functional context for a demographic consisting only of one person.

This model reduces the subjective/objective difference to a estimation of the utility of a particular annotation within the system. The discussions we should be spending our time on are the ones about how to tackle the daunting task of implementing this model so as to generate a reliable estimates of human computational relevance.

As mentioned above, the model is intended to be implemented on a crowdsourcing platform that will produce an estimate of the relevance of each label for each multimedia item. I am as deeply involved as I am with crowdsourcing HIT design because am trying to find a principled manner to constrain worker pools with regard to demographic specifications and with regard to the specifications of a real-world function for multimedia objects. At the same time, we need useful estimators of the extent to which the worker pool deviates from the idealized conditions.

These are daunting tasks and will, without doubt, require well-motivated simplifications of the model. It should be clear that I don't claim that the model makes things suddenly 'easy'. However, it is clearly a more principled manner of moving forward than debate on the subjectivity vs. objectivity difference.

Continued...

Sunday, May 29, 2011

Tagging Love and Affection: Part II

Perhaps even the more interesting thing about the wedding in terms of modern media was the interaction between the professional photographers and the wedding guests who were taking pictures. It seemed like these were two completely different activities in terms of the results that they were aiming to produce. The photographers created an amazing album of storybook moments -- and the guests -- well, speaking for myself at least -- took pictures of people as people that they knew.

The professional photos actually included shots of members of the bridal party and guests photographing the bride and groom and each other. There was a particular dramatic one of the best man from the back taking a picture of the groom and you can see the groom twice: once over the shoulder of the best man in the display of his cell phone and once sitting as the main subject of the image.

There is another one in which one of the bridesmaids is taking a picture of the newly weds. It's like this act of greeting, an 'I was there with you in your moment of bliss and I was so so happy for you.' Taking a picture is like smiling, waving or winking at someone -- except that it is asynchronous, delayed in time. From the past, a shout out, "Congratulations!"

I was struck how the professional photographers were able to use the act of photo-taking as a way to depict the love and affection between friends and family members. Its not just the places that we tag with our photos, as I've discussed in a previous blog post, but its people, too.

And this is where the difference between the professional photographers and the guests really became apparent. I was using the little camera of my mom, so my pictures of the wedding are not qualitatively speaking very good. It would probably take some improvement of my photographic skills and not just a better camera to get high quality pictures.

But there was something that struck us about them. I naturally looked for the people in our family who we see in frequently and took pictures of people talking that only get to see each other once every several years -- if at all. I took pictures of people holding the youngest baby in our extended family -- that capture the moment that generations within the extended family meet for the first time. I took pictures that showed siblings engrossed in conversations with each other -- showing how the intensity of how we speak and how we listen. The photographers didn't know us and although the wedding pictures were beautiful, aesthetically not to be surpassed, we really love to look at the personal pictures, because they are somehow more "us".

Maybe the "real" pictures are the pictures that we take that mark love and affection. They encode our personalities, our common past and our hope for the future.

The implications are quite large for the field of multimedia retrieval, as revealed by the following line of reasoning: Life is finite. We only live so long and can support close relationships with so many people. If Dunbar is right it is a very limited number indeed. If we take photos at particular moments, such as moments of expression of affection that I am describing here, then the number of total pictures that we take is also limited. If we keep on insisting that our multimedia retrieval algorithms must be able to handle millions and millions of photos, then we run the danger of missing out on developing important techniques. Multimedia algorithms developed for relatively small numbers of photos can afford to be computationally more complex. If we ignore the "small set" problem, we run the danger of not developing the best possible algorithms to personal multimedia retrieval challenges.

As final comment, I can add that during the wedding I was already challenged by an image retrieval problem. I took maybe 200 photos. I wanted to show a special photo of my mom -- taken a few minutes back -- to the cousins I was sitting with at the dinner table. It took me so long to flip through the index to find those photos on that small screen. Very disruptive for dinner conversation.

I had the idea that I was the one that should have been wearing the GSR sensor. I am sure that my affective peak was physiologically measurable when I took the picture of my mom, saw it on my camera display for the first time and realized I had gotten a once-in-a-lifetime shot. If my camera display could take me right to peak pictures, it could be a much more functional device: transcending a capture to also support storytelling as well.

On second thought skip the GSR. I'm sure I jumped up and down. The accelerometer on a mobile phone could have picked that up. If all else fails, make a photo taking app that encourages me to shake the thing when I notice I like a picture. Better stop blogging and start implementing.

Sunday, October 24, 2010

MediaEval 2010 Workshop Report

We were delighted that Bill Bowles attended the MediaEval 2010 workshop and that he made us our own MediaEval video trailer, in which he tells the story of MediaEval from his own point of view. The MediaEval 2010 Affect Task was devoted to analyzing Bill's travelogue video from his Travel Project and ranking it by how boring viewers reported it to be. As a filmmaker, another rational reaction would be "Who are these people, what did they do to my video? I don't want to get anywhere near them!" But instead, he came, participated and told us about ourselves using the very same medium we devote so much effort to studying.



I was amazed at how quickly this video accumulated views, it quickly outstripped any video I've ever posted to the Internet. However, if video is not your thing and you want the text version of what happend here is the text of a workshop report written for a project newsletter.

MediaEval 2010 Workshop Report

The MediaEval 2010 workshop was held on Sunday, October 24, 2010 in Pisa, Italy at Santa Croce in Fossabanda. MediaEval is a benchmarking initiative for multimedia retrieval, focusing on speech, language and contextual aspects of multimedia (geographical and social context) and their combination with visual features. Its central sponsor is the PetaMedia Network of Excellence. In total, four tasks were run during MediaEval 2010. To approach the tasks, participants could make use of spoken, visual, and audio content as well as accompanying metadata. Two “Tagging Tasks’ (a version for professional content and one for Internet video) required participants to automatically predict the tags that humans assign to video content. An ‘Affect Task’ involved automatic prediction of viewer-reported boredom for Travelogue video. Finally, a ‘Placing Task’ required participants to automatically predict the geo-coordinates of Flickr video. The Placing Task was co-organized by PetaMedia and Glocal. It was also given special mention in the talk of Gerald Friedland entitled “Multimodal Location Estimation” in the “Brave New Ideas” session at ACM Multimedia 2010.

During the MediaEval 2010 workshop, researchers presented and discussed the algorithms developed and the results achieved on the MediaEval 2010 tasks. The workshop drew 29 participants from 3 continents. More information about the 2010 results including participants’ short working notes papers, are available at: http://www.multimediaeval.org/mediaeval2010
Currently, MediaEval 2010 participants are working towards a special session at the 2010 ACM International Conference on Multimedia Retrieval (ICMR 2010), which will be dedicated to presenting extended results on MediaEval 2010 tasks.

Mediaeval 2011 will be organized again with sponsorship from PetaMedia and in collaboration with other projects from the Media Search Cluster. The task offering in 2011 will be decided on the basis of participants' interest, assessed, as last year, via a survey. At this time, we anticipate that we will run a Tagging Task and a Placing Task as well as a couple innovative other, new tasks as dictated by popularity. If you are interested in participating in MediaEval 2011 or if your project would like to organize a task, please contact Martha Larson m.a.larson@tudelft.nl Additional information on MediaEval 2011 is available on the website: http://www.multimediaeval.org