Tuesday, September 28, 2010

Where's Wikipedia?

The ACM Multimedia Grand Challenge is a high-adrenaline event where researchers from the Multimedia community compete against each other to develop the best solutions to problems posed by industry. For example, Google formulated two challenges, Video Genre Classification and Personal Diaries, in this year's competition.

Today in Tokyo at Interspeech 2010, I stopped to chat with last year's Grand Challenge winner, who is competing once again this year. I was struck anew by the realization that in the pressure-cooker of the Grand Challenge, creativity, raw intelligence, technical competence, competitive drive and off-beat thinking gives rise to lines of attack that might never have emerged in a traditional R&D setting. Such solutions stand to benefit us all.

But is it really only industry who should be formulating the challenges for such competitions? Where, for example, is Wikipedia? If there is any major player in the Internet information arena that deserves a crowd-sourced solution from the research community, it is Wikipedia, the knowledge resource homegrown by collaborative effort.

Wikipedia does truly inspire the research community. Very recently I've witnessed up close how fired up scientists get about Wikipedia. The Tribler team, who sit on the ninth floor of our building, have been sinking unbelievable time and effort into the development of the Swarmplayer V2.0. Their dedication is inspiring and their incredible belief in the power of a distributed solution for videos on Wikipedia is infective.

Datasets from Wikipedia have been used by multiple benchmarking initiatives such ImageCLEF and INEX as well as in MediaEval, the benchmark I co-ordinate. We certainly enjoyed coming up withour own Wikipedia-related task. However, it would be great to hear directly from the Wikimedia Foundation, in the form of a Grand Challenge, what problems they see on the horizon in the next 2-5 years for which the research community could be helpful in generating solutions. The Challenge takes the form of a simple textual description of the problem and researchers do the rest, presenting the solution in form of a system or system demo and a paper describing it.

There's a lot out there of course that I don't know about. For example, just read this post on the ECML PKDD 2010 Data Challenge: Measuring Web Data Quality. But I've never seen a clear Challenge originating from the Wikipedia community and published for the research community.

One aspect that researchers need to think seriously about, however, is the form in which solutions for Wikipedia or developed using Wikipedia data are published. ACM Multimedia Proceedings are not an open access publication. It's a contradiction to carry out research on a free knowledge resource and publish results under conventional copyright. Peer-reviewed open access journals such as the Journal of Digital Information should be preferred when publishing results obtained using Creative Commons licensed data.

Maybe that's actually one Challenge that the Wikimedia Foundation actually has to offer the research community: challenging us to breaking the habit of creating solutions in a rush of creative joy and technical muscle, and then publishing them where they cannot be accessed by everyone.

Saturday, September 18, 2010

MediaEval Tagging Task Professional

DIXIT, a Dutch-language journal for speech and language technology, invited me to do a piece on the "Tagging Task Professional", one of the four multimedia indexing and retrieval tasks that the MediaEval benchmarking initiative ran in 2010. I am posting an English version of the text here on my blog. The piece will appear in December, after the MediaEval 2010 workshop in October (I note that in order to explain the past tense used to describe an event that has not happend yet).

The workshop will be held in a medieval convent called Santa Croce de Fossabanda, located in Pisa, Italy. The photo here is from Flickr user Marius B, licensed under Creative Commons License by-nc-sa. I notice that I do well with attribution if I am going to print material (brochures etc.), but I get sloppy with Power Point. If I know this photo is on my blog, I will be able to mind myself it comes from Marius B quickly in case I want it in future presentations.

Many Minds Make Light Work: Bringing Researchers Together to Work towards Automatic Indexing for Cultural Heritage Multimedia Collections

"Medieval", "mediaeval" and "MediaEval" are all pronounced the same. While "medieval" and "mediaeval" are alternate spellings for a adjective describing something that occurred in the Middle Ages, "MediaEval" is a benchmark initiative that brings researchers together to tackle challenging tasks in the area of multimedia indexing and retrieval. In 2010, a group of researchers worked individually and then met at a medieval convent "Santa Croce in Fossabanda" in Italy. Can a group of MediaEval scientists solve today's challenges of automatic generation of metadata for cultural heritage multimedia content?

Cultural heritage content often takes the form of multimedia and in particular of audio and video recordings. Cultural heritage collections are often staggering in size. The archive of the Netherlands Institute for Sound and Vision houses a breathtaking 250,000 hours of video content and receives and additional 8,000 hours of content broadcast by national broadcasting companies each year. Material that is stored in such a huge collection, but is not adequately annotated, is useless since it can no longer be found by people who wish to view, reuse or otherwise study it. Professional archivists have developed a set of techniques for annotating material with metadata for storage in the archive that will ensure that it can later be found. These techniques have stood the test of time and will continue to be critical for finding multimedia content in large archives in the future. The ability to generate high quality metadata, however, is not enough. Rather, metadata production must be scaled so that incoming material can be appropriately annotated at the rate at which it arrives.

Techniques from the area of Speech and Language Technology hold promise to support archivists in the generation of archival metadata. Here, we specifically look at the problem of generating subject labels (or "keywords") for television broadcasts. Subject labels are terms drawn from the archive thesaurus. Examples of keywords are, Archeology (archeologie), Architecture (architectuur), Chemistry (chemie), Dance (dansen), Film (film), History (geschiedenis), Music (muziek), Paintings (schilderijen), Scientiļ¬c research (wetenschappelijk onderzoek) and Visual arts (beeldende kunst). Automatic generation of subject labels can help archivists in one of two ways: by providing a list of suggested subject labels for a video, thus narrowing their field or choice, or, by automatically generating a best guess in order to label material which would otherwise go un-annotated due to huge volume of incoming video material and the time constraints of the archive staff.

Automatic generation of subject labels is accomplished by algorithms that make use of several data sources: production metadata for broadcasts, transcripts of the spoken content of broadcasts produced by automatic speech recognition technologies and analysis of the visual content of the broadcast recording. The algorithms apply statistical techniques including word-counts and co-occurrences and also machine learning methods. Current algorithms are, however, far from perfect and their further improvement requires sustained and concerted effort on the part of research scientists.

Many researchers are interested in working on the problem of automatically generating subject labels for cultural heritage material. However, in order for a researcher to begin working in this area, a number of problems must be faced.
  1. It is necessary to have an understanding of the problem -- requires a general knowledge of how subject labels are produced in the archive and what they are used for
  2. It is necessary to have access to a large amount of example data in order to develop and train algorithms
  3. It is necessary to have access to data sources such a speech recognition transcripts or visual features. In general, it is not possible to generate these resources in a lab that is not already specialized in these areas
  4. It is necessary to understand the work that has previously been carried out in the area in order not to duplicate techniques that have already been tried by other researchers
  5. It is necessary to know how well one's own algorithms compare to the current state of the art.
The purpose of a benchmarking initiative is to address these problems and let researchers concentrate their energy on the hard work and creative thinking that it takes to develop new algorithms for important tasks. MediaEval is one of several benchmarking initiatives that pursue this paradigm. The special topic area addressed by MediaEval is multimedia, with a focus on on speech, language and social features and how they can be combined with visual features.

MediaEval promotes research progress in the area of automatically generating subject labels for cultural heritage material by running a "Task" devoted to subject labeling for professional archives. A Task is comprised of three parts: a description of the problem, a data set and a set of resources that can be used to solve the problem. Having the problem packaged as a task gives researchers easy entry to understanding the issue from the perspective of the archives and allows licensing of the data from the archive to occur in a streamlined manner. The University of Twente supplies speech recognition transcripts makes it possible for research groups without competence in Dutch-language speech recognition to contribute to developing improved approach to the task. Information about the other tasks offered can be found on the MediaEval website: http://www.multimediaeval.org/

Researchers approach the tasks by first working to solve them individually. They submit their solutions, which are evaluated by the MediaEval organizing committee. Because all researchers working on the same task have used the same data set, the solutions are directly comparable with each other and it is possible to see which approaches provide the best performance for the automatic generation of subject labels. Researchers then gather at a workshop in order to discuss the results, build collaborations and plan approaches for next year. The workshop fosters friendly competition between sites necessary for progress on the issues, but also builds collaboration encouraging sites to bundle their efforts and to avoid duplicating investigation on approaches that have already been shown to be less fruitful.

The MediaEval 2010 workshop was held in Pisa, Italy in October 2010 directly before ACM Multimedia, a large multimedia conference. It was held in a medieval convent "Santa Croce in Fossabanda" that had been converted into a hotel with seminar facilities. A site so evocative of the beauty and the value cultural heritage was particular suited to host researchers focused on the issues that will help improve automatic indexing of tomorrow's cultural heritage content.

Tuesday, September 14, 2010

Affordance

"People," continued the taxi driver driving me to the airport in Dublin, "do the strangest things with chocolate." He paused, reflectively, before adding, "I mean in private."

When I didn't immediately respond, he hurried to explain himself. "You know, a Bounty bar?" I did. "I pick the chocolate off of the outside and then eat the inside separately. Do you do that?" As politely as I could I explained that I didn't like Bounty bars. "What do you do then?" he asked. The best thing that I could come up with was Oreo cookies, that I twist them open and eat out the middle, "A lot of people do that," I added. This puzzled him, until he brightened, "Oh, I heard about this biscuit in Australia and you bite off two of the corners and you drink your tea right through the biscuit. It has some sort of a cream filling that just melts as you drink. It's supposed to be just lovely." He thought for a moment. "It's Tam Tam or Yam Yam or something like that it's called."

I tried to imagine the Tam Tam or the Yam Yam and what it might look like. I was in Perth for about four days after SIGIR 2008, but didn't remember any cookies like that. "Do you suppose," I asked him, "that people just take the biscuit out of the package and look at it and think, 'oh, I should break off two of the corners and drink my tea through it' or was there one person who invented it and then it quickly spread as an idea throughout Australia?"

His response surprised me: he laughed! Then, "It's like the comedian," he pronounced. And then he filled me in: there is a comedian one-liner about watching a chicken lay an egg. "Hey, I think I could eat that!" was the punch line.

And so, I end up discussing with a Dublin taxi driver, the principle of affordance, the ability of an object to be acted on in its environment, and, in a larger definition, communicate its use via its appearance.

In multimedia information retrieval, I am obsessed with affordance in this latter sense. At the first glance or very quickly during interaction, the system should implicitly communicate to the user what it does, what the user can do with what it does and the extent to which it can be trusted to reliably do what it does in all cases.

A few years ago, I believed that a speech retrieval system should not show transcripts to users because users are disturbed by errors. Now we are all a lot further. People are used to reading relatively unedited or unconventional text in text messages, blogs (!) and comments. Now, the level of error can signal to the user that the text has been created by a speech recognizer and how well that speech recognizer can be trusted to capture the spoken content of the audio signal.

But his is negative affordance, the message what can't this system do. It is quite possible that negative affordance is much more challenging to communicate to the user since the space of possible non-uses is not intuitively constrained.

And with biscuits, of course, comes the problem of distributed affordance. What works well once does not continue working well with repeated applications. The package of biscuits should tell you, individually, we are delicious, but if you eat the whole package you won't feel nice and full, but instead you will have an unhappy stomach. Even it was written explicitly on the package, I imagine I would mostly ignore that message.

This is Part III (final part!) of the "Irish Chocolate Discussion", reflections on the conversation I had with a Dublin taxi driver and how that relates to finding things and search systems in general.

Geotags and Geotrails

At the moment I am sitting in a German IC. I'm doing "Duivendrecht" ... "Apeldorn" ... "Hannover" ... "Berlin" and if I had a GPS that was logging my position, you would see that I was doing a certain trajectory at a certain speed. For me there is something soothing about sitting in the train watching Germany roll by. You've made the decision to go there and now all you have to do is wait and it will happen. In this state of doing nothing while doing something, I can unleash my thoughts, do a bit of mental housekeeping and just general feel like I'm absorbed into the German landscape. It's always been like this for me, and I can highly recommend it. If you try it within the Netherlands, you probably have to go to Groningen since the other trajectories would be too short.

In the last entry I mentioned Creative Tourism. This train ride to Berlin is my way of really feeling in touch with Germany -- a way of living and a way of being. If your reading posts in chronological order, you'll recall that our Near2Me concept links up two places people. If enough people take pictures at places with two different geotags, then the place must be related. If you like one, then you like the other. But what I am doing right now is not associated with a single geotag, it is associated with an entire trajectory and also a very specific "train speed". Perhaps our systems will be richer if we include "geotrails", a path with parameters of time and space. These are the doodles we draw and redraw in our travel experiences, patterns that represent something we like to do, but might also represent the type of thing that we should try in the future when we feel inclined to "branch out".

Two other examples for which geotrails could be helpful come to mind. The discussion about chocolate that I had with the taxi driver that drove me to the airport in Dublin on Saturday really stuck with me. But it is one of the things that strikes me nearly every time I am there. You can really talk to the taxi drivers. Not always, but sometimes you have these amazing conversations and that seems to be more important to them than the tip. Once I had a Dublin taxi driver refuse the tip.

I had one conversation where the taxi driver asked me what I did and I said I worked in a multimedia information retrieval lab. "So what do you do there?" I didn't know where to start, so I described to him one of the systems at our lab that processes videos of soccer games and does highlight detection. He thought that was interesting, and I then asked him what he thought counted as a highlight in football. Is it only the goals and the penalties, or what other parts do you want to see if you are watching a summary to the game. He gave a thoughtful answer to this questions. If you would look at my Dublin geotrails, there are a series of characteristic doodles, often ending at the airport that represent these experiences I have had in taxis.

My Amsterdam geotrails, on the other hand, show me giving the taxi stand wide berth. The taxis will drive you in circles there. They have to, sometimes, given the geometry of the city, but they'll add their own embellishments that drive your final price up. The only place that is worse is Brussels.

In Amsterdam, I prefer the bike. You'll see a lot of bike doodles. These are slower than the taxis and also go on roads where cards don't drive. Biking is the authentic Amsterdam experience. Visitors to the city often participate in the local culture by renting bicycles and riding around. This is Creative Tourism the way it was conceived to be carried out.

However, you would see the difference in the geotrails. Amsterdamers bike every day from point A to B and back again. Maybe there is C and D as well, but in general they know their routes, every stop light, every bump, every place where another bike might come out from an unexpected directions. They also know their routes at the time of the day that they characteristically ride them, for example, in the early morning, there was always a characteristic amount of traffic when I biked out from the Lelygracht in the center of Amsterdam where I used to live to the Science Park. Routes are optimized so that they do not take minute longer than they absolutely have to.

Slipping here back into the topic of my last blog, I would like to make the point that the tourist geotrails are totally different, they are slower, different times of the day and involve indirect routes. Tourists wobble a bit, they maybe haven't ridden a bike since they were kids. They stop unexpectedly to consult their maps. They stop for puddles -- and Amsterdamer has seen the route dry the week before and knowing exactly how deep the puddle is going to be rides through it. Tourists also stop for red lights -- Amsterdamers know which lights are conventionally ignored by bicyclists.

Don't recommend to me a quick straight shot to the Science Park if I am on vacation. I want to do the trail with a little wobble that ends me up at the Van Gogh Museum. Is that the authentic Amsterdam experience? Maybe not. But if you push the limit, authenticity is in the eye of the beholder. For the Amsterdamer, the authentic bicycle culture consists of complaining of how badly tourists ride bicycles. Take away commercial tourism and the city loses some of its characteristic spirt, the tension between those that live there and those that play there. I'll leave it to the reader to decide if this would affect Amsterdam's charm.

This is Part II of the "Irish Chocolate Discussion", reflections on the conversation I had with a Dublin taxi driver and how that relates to finding things and search systems in general.

Authentic, Personalized Travel Recommendation



At the moment a lot of my time and attention is being devoted to Near2Me, which is a concept for a travel recommender developed by a colleague of mine, Luz Caballero, for the PetaMedia Network of Excellence. The cool thing about Near2Me is that in not only makes personalized travel recommendations, but it also focuses on authenticity -- the distinctive "spirit of place", as Luz likes to describe it.

The Near2Me concept tackles the problem of long-tail recommendation, it suggests that you go places where relatively few other people have been: Most of the time we refer to it as Off-The-Beaten-Track. The concept links up with the "Creative Tourism" movement, which holds that travel should be participatory and that travelers should hook up with the living culture of the place as manifested in the lives of the people living there.

How do we get at the authenticity of a place? Well, Luz points out that the act of taking a picture is something like adding a tag to a place. The tag is different than other sorts of tags. For example, it is not like the tags we use on Flikcr, since it doesn't have a lot of semantics. Taking a picture is basically a salience tag, it just says, this came to my attention as interesting. Luz did a study of Flickr where she determined that people who travel, even though they of course take pictures of the "must sees", Notre Dame in Paris, for example, also take pictures of interesting things along the way.

Because the motiviation of the people taking these pictures is something more akin to personal documentation than it is to touristic promotion, they provide a valuable source of information about the authentic, the things that we stumble across when we actually are in a place, along the way to reaching another goal.

The next step is simple, you look at the geotags of the pictures (that little piece of metadata produced by a growing number of cameras that records where the picture was taken) and you use the Amazon principle: people who liked X also liked Y. You look for people who have taken pictures where you have taken pictures and see where else they go. Luz' uses the example of markets: if someone liked the "Borough Market" in London, the system would recommend the "Marche Wilson" in Paris. The technique is called "wormholes" and was developed by Maarten Clements.

At this point, you encounter possibly the most challenging part of long-tail recommendation. You want to give people something new and interested, but none of the place recommendations are going to be obvious since you want to stay away from the bestsellers, the commercial destinations. For this reason, its important to be able to explore recommendations further: OK, the system thinks I should like the Nusantara Museum in Delft, but would I really like it? Why? What is it anyway? What can I see and do there?

Two of the ways that users can explore a place are by browsing pictures and browsing people. The Near2Me concept offers a selection of pictures from the recommended place that are carefully chosen to be both diverse and also representative of a place. Also, it offers a selection of people, who by virtue of the pictures that they take and of the popularity of those pictures, emerge as a sort of community expert for a particular topic. How these algorithms work -- and if they are of use in a working prototype is something in the works targeted for publication in future PetaMedia papers.

On Saturday, I flew back from Dublin after having spent a couple days at DCU. On the way to the airport, the taxi driver asked me where I was going.
"Back to Amsterdam." I said. "I came Wednesday and now I am already going back."
"Oh, Amerdam. Did you bring tulips?"
"No, chocolate." I replied, and then added "The funny thing is, is that I'll probably get some chocolate to bring the other direction as well."
"I'll tell you what you got to get." he said. And then launched into description of how amazing the Baileys Irish Cream chocolates are. I've been to Dublin several times, but I still have to ask people to repeat words sometimes because of the Irish English, which he did quite patiently. In the end the picture emerged that the Baileys Irish Cream chocolates are amazing because they are not just chocolatey, but they offer a real Baileys experience as well.

Last time flying back from Dublin I had seen the Baileys Irish Cream chocolates at the airport and had steered clear -- putting Baileys in chocolate was something, I assumed, they had come up with for the tourists, and that I would do well to avoid it. However, the fact that the taxi driver had no particular reason to tell me this and the fact that he was, himself, to me clearly Irish, I shifted my opinion on the Baileys Irish Cream chocolates from "main stream tourist" to "genuine Irish souvenir". Since I was the one that brought up the chocolate, the taxi driver was certainly not doing product placement. Perhaps Baileys is simply very, very smart with this. But if they are, the Baileys Irish Cream chocolates will still remain authentic to me because the whole story happened at a moment where I was devoting a lot of time and attention to Near2Me and thinking a lot about authenticity of place.

This is Part I of the "Irish Chocolate Discussion", reflections on the conversation I had with a Dublin taxi driver and how that relates to finding things and search systems in general.