Jetlag, bad Internet connectivity, behind on my to do list -- I haven't been myself recently, but I'm working to get back the balance. However, there's one thing over which I have no control: the deteriorating political situation in the Cote d'Ivoire. The worries turn over and over in my mind and I find myself scouring the Internet for information about the current situation. In a moment of clarity, I had some insight that what I am doing has larger ramifications in terms of search:
Emotional stress changes search. I am not myself and I my search-self also appears to be different.
There seem to be two main dimensions to this difference: first, the reason why I am searching is different. I want information, but not because I can use it in any particular way, but rather because knowing has the potential to give me back a sense of control and, second, my search strategies are completely different. Effectively, it's like I've forgotten most of what I know about how to find things on the Internet.
In particular, I found myself going back to the New York Times over and over, like I expected them to publish more than one article a day on the subject. I gradually expanded, just typing "Ivory Coast" into Google, and then clicking the news option.
I don't know how many searches I fired off before realizing that I was only reading English results, and could expect to find more, perhaps, new information from the French press. That is ridiculous! I spent two years of my life working a a project called MultiMatch dedicated to multilingual, multimodal information access. If there is something I should do automatically, it is turn to cross-lingual or multi-lingual search to give me more results and more variety. I should remember that our current search engines don't give us an easy way to tell them which languages we would like to receive results in.
I started searching "Ivory Coast", "Cote d'Ivoire", "Elfenbeinkueste". Sorry, Google, I can't type the umlaut easily with this keyboard -- argh and you seem to not interpret my work around, let me try again "Elfenbeinkuste". Each of these steps that I would usually do without thinking seems particulary painful.
And then I remember social media. Twitter. This is ridiculous! My current project, PetaMedia, has a significant social media component.
I know full well that if a country's government clamps down on the press, then you back off to who's tweeting. I add "twitter" to my Google query and find an article discussing the emergence of a twitter hash tag: #CIV2010. But is there information here coming directly from the Cote d'Ivoire? Is this really (still) the most authoritative and informational hash tag?
Hmm. If I could compare what else is out there. Can I search Twitter directly for other hash tags that are related to this one so that I can compare? Apparently not. How can Twitter not let me search for hash tags? Usually, I would be able to address this question in a sensible fashion -- but now, I just leave it as a search dead-end and zap off to look at individual tweets.
I end up just kind of reading throuh individual tweets and finally find one that contains a recent release from Reuters UPDATE 1-U.N. peacekeepers won't leave Ivory Coast - Ban. Long exhale. It looks like at the moment the world is still with us and won't look away. It's a thin thread, but it's enough to let me break out of the cycle. I didn't know that this was what I was looking for, but when I find it, I also stop searching. At least for today.
Search under stress also happened to me in 2009 in Corfu after the CLEF workshop. There was a storm and the airport was closed down and my flight was canceled. I need to get to Athens to hook up to another ticket I had to Rome...but everyone wanted to get out and planes were booked full. It was one in the morning and I was hooked up to some vague wisps of wireless at a near empty airport. I don't remember the details but about the only query I could come up with was "Is there a boat out of here tomorrow?"
Search under stress is probably correlated with topic: violence, health, shelter, mobility, citizens rights. These are critical topics for our well being and happiness. Search under emotional stress happens when search matters most. It seems to make sense to tackle the challenge of search under stress as different from search in other states of mind. But first -- let me get some sleep.
Sunday, December 19, 2010
Friday, November 19, 2010
Search your own dogfood
How many hours do I spend writing deliverables and reports? I'd rather not count. Here I am on Friday night with a to do list left over from the week that seems only very vaguely connected to my main mission as a researcher, namely to improve multimedia access systems, especially for spoken audio and video with a speech track.
Sometimes it takes writing a blog entry to refocus on the core values of multimedia search. I was going through the pictures from the Searching Spontaneous Conversation Speech workshop in order to find a good one to add to the latest newsletter report, and dang it, if there weren't so many speaker pictures that we ruined because Florian is crouching in the middle in the front, tending to the laptop that we were using to capture the sound.
At Interspeech we discussed the idea of simply recording all the spoken audio at both the MediaEval 2010 workshop and the SSCS 2010 workshop in order to start an audio corpus of workshops to use for research on meeting retrieval. It sounded like a good idea that we would never have the time to pull off, but sure enough, there we were in Italy, and a network of people came together and brought sound equipment from all over and we had ourselves a system for audio capture. I remember the satisfaction in his voice, when Florian announced "We are now recording six channels". Actually, I remember it because I listened to it on the recording afterwards as we started the laborious process of post-processing and I wondered "Gee, what kinds of things were we talking about next to the main presentations."
So here's the refocus. Florian isn't actually ruining the picture. His presence actually underlines what the speaker is talking about -- the slide reads "The ACLD: Speech-based Just-in-Time Retrieval of Meeting Transcripts, Documents and Websites". We have made such a huge step in this direction that in are own lives we can simply decide to capture our spoken content, everyone at the workshop says, "OK, that's cool" and bang we have more data than we know what to do with.
We also did this at SSCS 2008 in Singapore. The videos were online for a while -- we transcribed them using Nuance Audiomining SDK for speech recognition and made them searable with a Lemur-based earch engine. For awhile, we could visit a website and search our own dogfood, as it were. It seems, however, that the multimedia lifecycle got the better of our content: the system was not maintained and now the videos are no longer available online. I don't know if we'll do much better this year, but the point is that we keep on trying. And we have Florian in the middle of the workshop picture reminding us that this attempt may be time consuming, but it is constitutes the core of our research mission.
Sometimes it takes writing a blog entry to refocus on the core values of multimedia search. I was going through the pictures from the Searching Spontaneous Conversation Speech workshop in order to find a good one to add to the latest newsletter report, and dang it, if there weren't so many speaker pictures that we ruined because Florian is crouching in the middle in the front, tending to the laptop that we were using to capture the sound.
At Interspeech we discussed the idea of simply recording all the spoken audio at both the MediaEval 2010 workshop and the SSCS 2010 workshop in order to start an audio corpus of workshops to use for research on meeting retrieval. It sounded like a good idea that we would never have the time to pull off, but sure enough, there we were in Italy, and a network of people came together and brought sound equipment from all over and we had ourselves a system for audio capture. I remember the satisfaction in his voice, when Florian announced "We are now recording six channels". Actually, I remember it because I listened to it on the recording afterwards as we started the laborious process of post-processing and I wondered "Gee, what kinds of things were we talking about next to the main presentations."
So here's the refocus. Florian isn't actually ruining the picture. His presence actually underlines what the speaker is talking about -- the slide reads "The ACLD: Speech-based Just-in-Time Retrieval of Meeting Transcripts, Documents and Websites". We have made such a huge step in this direction that in are own lives we can simply decide to capture our spoken content, everyone at the workshop says, "OK, that's cool" and bang we have more data than we know what to do with.
We also did this at SSCS 2008 in Singapore. The videos were online for a while -- we transcribed them using Nuance Audiomining SDK for speech recognition and made them searable with a Lemur-based earch engine. For awhile, we could visit a website and search our own dogfood, as it were. It seems, however, that the multimedia lifecycle got the better of our content: the system was not maintained and now the videos are no longer available online. I don't know if we'll do much better this year, but the point is that we keep on trying. And we have Florian in the middle of the workshop picture reminding us that this attempt may be time consuming, but it is constitutes the core of our research mission.
Friday, October 29, 2010
ACM Multimedia SSCS 2010 Workshop on Searching Spontaneous Conversational Speech
The Fourth Workshop on Searching Spontaneous Conversational Speech took place on 29 October 2010 at ACM Multimedia. Papers were presented about techniques for speech retrieval, speaker role recognition, spoken term detection and concept detection. Invited speakers addressed challenges for the future of spoken content retrieval, including interview data, multimedia archives and the Spoken Web. The demonstrations were a highlight of the workshop. These were first introduced in a boaster session and then presented to workshop participants in an interactive session. Here's the Wordle Word cloud made from the title and the abstracts of all the papers presented!
Currently, we are getting ready for an upcoming special issue on searching speech in ACM Transactions on Information Systems.
Currently, we are getting ready for an upcoming special issue on searching speech in ACM Transactions on Information Systems.
Sunday, October 24, 2010
MediaEval 2010 Workshop Report
We were delighted that Bill Bowles attended the MediaEval 2010 workshop and that he made us our own MediaEval video trailer, in which he tells the story of MediaEval from his own point of view. The MediaEval 2010 Affect Task was devoted to analyzing Bill's travelogue video from his Travel Project and ranking it by how boring viewers reported it to be. As a filmmaker, another rational reaction would be "Who are these people, what did they do to my video? I don't want to get anywhere near them!" But instead, he came, participated and told us about ourselves using the very same medium we devote so much effort to studying.
I was amazed at how quickly this video accumulated views, it quickly outstripped any video I've ever posted to the Internet. However, if video is not your thing and you want the text version of what happend here is the text of a workshop report written for a project newsletter.
MediaEval 2010 Workshop Report
The MediaEval 2010 workshop was held on Sunday, October 24, 2010 in Pisa, Italy at Santa Croce in Fossabanda. MediaEval is a benchmarking initiative for multimedia retrieval, focusing on speech, language and contextual aspects of multimedia (geographical and social context) and their combination with visual features. Its central sponsor is the PetaMedia Network of Excellence. In total, four tasks were run during MediaEval 2010. To approach the tasks, participants could make use of spoken, visual, and audio content as well as accompanying metadata. Two “Tagging Tasks’ (a version for professional content and one for Internet video) required participants to automatically predict the tags that humans assign to video content. An ‘Affect Task’ involved automatic prediction of viewer-reported boredom for Travelogue video. Finally, a ‘Placing Task’ required participants to automatically predict the geo-coordinates of Flickr video. The Placing Task was co-organized by PetaMedia and Glocal. It was also given special mention in the talk of Gerald Friedland entitled “Multimodal Location Estimation” in the “Brave New Ideas” session at ACM Multimedia 2010.
During the MediaEval 2010 workshop, researchers presented and discussed the algorithms developed and the results achieved on the MediaEval 2010 tasks. The workshop drew 29 participants from 3 continents. More information about the 2010 results including participants’ short working notes papers, are available at: http://www.multimediaeval.org/mediaeval2010
Currently, MediaEval 2010 participants are working towards a special session at the 2010 ACM International Conference on Multimedia Retrieval (ICMR 2010), which will be dedicated to presenting extended results on MediaEval 2010 tasks.
Mediaeval 2011 will be organized again with sponsorship from PetaMedia and in collaboration with other projects from the Media Search Cluster. The task offering in 2011 will be decided on the basis of participants' interest, assessed, as last year, via a survey. At this time, we anticipate that we will run a Tagging Task and a Placing Task as well as a couple innovative other, new tasks as dictated by popularity. If you are interested in participating in MediaEval 2011 or if your project would like to organize a task, please contact Martha Larson m.a.larson@tudelft.nl Additional information on MediaEval 2011 is available on the website: http://www.multimediaeval.org
I was amazed at how quickly this video accumulated views, it quickly outstripped any video I've ever posted to the Internet. However, if video is not your thing and you want the text version of what happend here is the text of a workshop report written for a project newsletter.
MediaEval 2010 Workshop Report
The MediaEval 2010 workshop was held on Sunday, October 24, 2010 in Pisa, Italy at Santa Croce in Fossabanda. MediaEval is a benchmarking initiative for multimedia retrieval, focusing on speech, language and contextual aspects of multimedia (geographical and social context) and their combination with visual features. Its central sponsor is the PetaMedia Network of Excellence. In total, four tasks were run during MediaEval 2010. To approach the tasks, participants could make use of spoken, visual, and audio content as well as accompanying metadata. Two “Tagging Tasks’ (a version for professional content and one for Internet video) required participants to automatically predict the tags that humans assign to video content. An ‘Affect Task’ involved automatic prediction of viewer-reported boredom for Travelogue video. Finally, a ‘Placing Task’ required participants to automatically predict the geo-coordinates of Flickr video. The Placing Task was co-organized by PetaMedia and Glocal. It was also given special mention in the talk of Gerald Friedland entitled “Multimodal Location Estimation” in the “Brave New Ideas” session at ACM Multimedia 2010.
During the MediaEval 2010 workshop, researchers presented and discussed the algorithms developed and the results achieved on the MediaEval 2010 tasks. The workshop drew 29 participants from 3 continents. More information about the 2010 results including participants’ short working notes papers, are available at: http://www.multimediaeval.org/mediaeval2010
Currently, MediaEval 2010 participants are working towards a special session at the 2010 ACM International Conference on Multimedia Retrieval (ICMR 2010), which will be dedicated to presenting extended results on MediaEval 2010 tasks.
Mediaeval 2011 will be organized again with sponsorship from PetaMedia and in collaboration with other projects from the Media Search Cluster. The task offering in 2011 will be decided on the basis of participants' interest, assessed, as last year, via a survey. At this time, we anticipate that we will run a Tagging Task and a Placing Task as well as a couple innovative other, new tasks as dictated by popularity. If you are interested in participating in MediaEval 2011 or if your project would like to organize a task, please contact Martha Larson m.a.larson@tudelft.nl Additional information on MediaEval 2011 is available on the website: http://www.multimediaeval.org
Labels:
affect,
benchmarking,
Flickr video,
Italy,
MediaEval,
Pisa,
tagging,
VideoCLEF
Saturday, October 9, 2010
Drink recommendation
Within the last ten days I've been in Asia, Europe and North America. I've taken jetlag to a new level. Usually there is a reference point, you can say, "It's past midnight in the Netherlands at the moment, my internal clock thinks it's past my bedtime and that's why I am so tired." Now I have no clue why time my internal clock reads.
At the grocery store, I just picked out a four pack of energy drink in order to try to jump start myself and get re-aligned with the cycle of the sun at my current location. I stood for ten minutes in front of the selections, looking at the cans and then reading the labels. I wanted something not too expensive, sugar free and also with guarana. A Brazilian colleague had recommended guarana as one of the best "pick up" ingredients you can get in an energy drink.
What I could use is a good drink recommendation system. The Asian part of this odyssey took place in Tokyo, and the following video was what YouTube there listed as a popular video. It had received 44466 views in the one day since it had been uploaded.
1 dag geleden 44466 keer bekeken
It is a news report on a drink vending machine (a Tokyo fixture) that recommends drinks by taking your picture and doing a little bit of multimedia content analysis that gives it clues as to your age and gender.
In my current situation, age and gender wouldn't have been enough. Rather the system would need information about my internal state -- the camera would have to have noticed the unfocused glaze of my tired eyes. In this situation, internal-state information could be inferred if the system had access to information about my geo-coordinates within the last ten days. Access to a recent history of my sleeping-waking pattern would provide an even better source of evidence.
However, another key bit of information, that would be critical to get to the correct drink would be that at the moment I do not want to be tired. I can't be tired. I don't want something that will relax me -- no chamomile, not yet. I need to work.
The bottom line is clear: barring a system that has access to all that information and the ability to use it in the right way, the Brazilian colleague remains the best source of drink recommendations.
And it looks like the drink is working already, since I have already reached a level of alertness to attempt a blog post.
At the grocery store, I just picked out a four pack of energy drink in order to try to jump start myself and get re-aligned with the cycle of the sun at my current location. I stood for ten minutes in front of the selections, looking at the cans and then reading the labels. I wanted something not too expensive, sugar free and also with guarana. A Brazilian colleague had recommended guarana as one of the best "pick up" ingredients you can get in an energy drink.
What I could use is a good drink recommendation system. The Asian part of this odyssey took place in Tokyo, and the following video was what YouTube there listed as a popular video. It had received 44466 views in the one day since it had been uploaded.
1 dag geleden 44466 keer bekeken
It is a news report on a drink vending machine (a Tokyo fixture) that recommends drinks by taking your picture and doing a little bit of multimedia content analysis that gives it clues as to your age and gender.
In my current situation, age and gender wouldn't have been enough. Rather the system would need information about my internal state -- the camera would have to have noticed the unfocused glaze of my tired eyes. In this situation, internal-state information could be inferred if the system had access to information about my geo-coordinates within the last ten days. Access to a recent history of my sleeping-waking pattern would provide an even better source of evidence.
However, another key bit of information, that would be critical to get to the correct drink would be that at the moment I do not want to be tired. I can't be tired. I don't want something that will relax me -- no chamomile, not yet. I need to work.
The bottom line is clear: barring a system that has access to all that information and the ability to use it in the right way, the Brazilian colleague remains the best source of drink recommendations.
And it looks like the drink is working already, since I have already reached a level of alertness to attempt a blog post.
Tuesday, September 28, 2010
Where's Wikipedia?
The ACM Multimedia Grand Challenge is a high-adrenaline event where researchers from the Multimedia community compete against each other to develop the best solutions to problems posed by industry. For example, Google formulated two challenges, Video Genre Classification and Personal Diaries, in this year's competition.
Today in Tokyo at Interspeech 2010, I stopped to chat with last year's Grand Challenge winner, who is competing once again this year. I was struck anew by the realization that in the pressure-cooker of the Grand Challenge, creativity, raw intelligence, technical competence, competitive drive and off-beat thinking gives rise to lines of attack that might never have emerged in a traditional R&D setting. Such solutions stand to benefit us all.
But is it really only industry who should be formulating the challenges for such competitions? Where, for example, is Wikipedia? If there is any major player in the Internet information arena that deserves a crowd-sourced solution from the research community, it is Wikipedia, the knowledge resource homegrown by collaborative effort.
Wikipedia does truly inspire the research community. Very recently I've witnessed up close how fired up scientists get about Wikipedia. The Tribler team, who sit on the ninth floor of our building, have been sinking unbelievable time and effort into the development of the Swarmplayer V2.0. Their dedication is inspiring and their incredible belief in the power of a distributed solution for videos on Wikipedia is infective.
Datasets from Wikipedia have been used by multiple benchmarking initiatives such ImageCLEF and INEX as well as in MediaEval, the benchmark I co-ordinate. We certainly enjoyed coming up withour own Wikipedia-related task. However, it would be great to hear directly from the Wikimedia Foundation, in the form of a Grand Challenge, what problems they see on the horizon in the next 2-5 years for which the research community could be helpful in generating solutions. The Challenge takes the form of a simple textual description of the problem and researchers do the rest, presenting the solution in form of a system or system demo and a paper describing it.
There's a lot out there of course that I don't know about. For example, just read this post on the ECML PKDD 2010 Data Challenge: Measuring Web Data Quality. But I've never seen a clear Challenge originating from the Wikipedia community and published for the research community.
One aspect that researchers need to think seriously about, however, is the form in which solutions for Wikipedia or developed using Wikipedia data are published. ACM Multimedia Proceedings are not an open access publication. It's a contradiction to carry out research on a free knowledge resource and publish results under conventional copyright. Peer-reviewed open access journals such as the Journal of Digital Information should be preferred when publishing results obtained using Creative Commons licensed data.
Maybe that's actually one Challenge that the Wikimedia Foundation actually has to offer the research community: challenging us to breaking the habit of creating solutions in a rush of creative joy and technical muscle, and then publishing them where they cannot be accessed by everyone.
Today in Tokyo at Interspeech 2010, I stopped to chat with last year's Grand Challenge winner, who is competing once again this year. I was struck anew by the realization that in the pressure-cooker of the Grand Challenge, creativity, raw intelligence, technical competence, competitive drive and off-beat thinking gives rise to lines of attack that might never have emerged in a traditional R&D setting. Such solutions stand to benefit us all.
But is it really only industry who should be formulating the challenges for such competitions? Where, for example, is Wikipedia? If there is any major player in the Internet information arena that deserves a crowd-sourced solution from the research community, it is Wikipedia, the knowledge resource homegrown by collaborative effort.
Wikipedia does truly inspire the research community. Very recently I've witnessed up close how fired up scientists get about Wikipedia. The Tribler team, who sit on the ninth floor of our building, have been sinking unbelievable time and effort into the development of the Swarmplayer V2.0. Their dedication is inspiring and their incredible belief in the power of a distributed solution for videos on Wikipedia is infective.
Datasets from Wikipedia have been used by multiple benchmarking initiatives such ImageCLEF and INEX as well as in MediaEval, the benchmark I co-ordinate. We certainly enjoyed coming up withour own Wikipedia-related task. However, it would be great to hear directly from the Wikimedia Foundation, in the form of a Grand Challenge, what problems they see on the horizon in the next 2-5 years for which the research community could be helpful in generating solutions. The Challenge takes the form of a simple textual description of the problem and researchers do the rest, presenting the solution in form of a system or system demo and a paper describing it.
There's a lot out there of course that I don't know about. For example, just read this post on the ECML PKDD 2010 Data Challenge: Measuring Web Data Quality. But I've never seen a clear Challenge originating from the Wikipedia community and published for the research community.
One aspect that researchers need to think seriously about, however, is the form in which solutions for Wikipedia or developed using Wikipedia data are published. ACM Multimedia Proceedings are not an open access publication. It's a contradiction to carry out research on a free knowledge resource and publish results under conventional copyright. Peer-reviewed open access journals such as the Journal of Digital Information should be preferred when publishing results obtained using Creative Commons licensed data.
Maybe that's actually one Challenge that the Wikimedia Foundation actually has to offer the research community: challenging us to breaking the habit of creating solutions in a rush of creative joy and technical muscle, and then publishing them where they cannot be accessed by everyone.
Labels:
benchmarking,
Google,
grandchallenge,
MediaEval,
Tokyo,
Wikipedia
Saturday, September 18, 2010
MediaEval Tagging Task Professional
DIXIT, a Dutch-language journal for speech and language technology, invited me to do a piece on the "Tagging Task Professional", one of the four multimedia indexing and retrieval tasks that the MediaEval benchmarking initiative ran in 2010. I am posting an English version of the text here on my blog. The piece will appear in December, after the MediaEval 2010 workshop in October (I note that in order to explain the past tense used to describe an event that has not happend yet).
The workshop will be held in a medieval convent called Santa Croce de Fossabanda, located in Pisa, Italy. The photo here is from Flickr user Marius B, licensed under Creative Commons License by-nc-sa. I notice that I do well with attribution if I am going to print material (brochures etc.), but I get sloppy with Power Point. If I know this photo is on my blog, I will be able to mind myself it comes from Marius B quickly in case I want it in future presentations.
Many Minds Make Light Work: Bringing Researchers Together to Work towards Automatic Indexing for Cultural Heritage Multimedia Collections
"Medieval", "mediaeval" and "MediaEval" are all pronounced the same. While "medieval" and "mediaeval" are alternate spellings for a adjective describing something that occurred in the Middle Ages, "MediaEval" is a benchmark initiative that brings researchers together to tackle challenging tasks in the area of multimedia indexing and retrieval. In 2010, a group of researchers worked individually and then met at a medieval convent "Santa Croce in Fossabanda" in Italy. Can a group of MediaEval scientists solve today's challenges of automatic generation of metadata for cultural heritage multimedia content?
Cultural heritage content often takes the form of multimedia and in particular of audio and video recordings. Cultural heritage collections are often staggering in size. The archive of the Netherlands Institute for Sound and Vision houses a breathtaking 250,000 hours of video content and receives and additional 8,000 hours of content broadcast by national broadcasting companies each year. Material that is stored in such a huge collection, but is not adequately annotated, is useless since it can no longer be found by people who wish to view, reuse or otherwise study it. Professional archivists have developed a set of techniques for annotating material with metadata for storage in the archive that will ensure that it can later be found. These techniques have stood the test of time and will continue to be critical for finding multimedia content in large archives in the future. The ability to generate high quality metadata, however, is not enough. Rather, metadata production must be scaled so that incoming material can be appropriately annotated at the rate at which it arrives.
Techniques from the area of Speech and Language Technology hold promise to support archivists in the generation of archival metadata. Here, we specifically look at the problem of generating subject labels (or "keywords") for television broadcasts. Subject labels are terms drawn from the archive thesaurus. Examples of keywords are, Archeology (archeologie), Architecture (architectuur), Chemistry (chemie), Dance (dansen), Film (film), History (geschiedenis), Music (muziek), Paintings (schilderijen), Scientific research (wetenschappelijk onderzoek) and Visual arts (beeldende kunst). Automatic generation of subject labels can help archivists in one of two ways: by providing a list of suggested subject labels for a video, thus narrowing their field or choice, or, by automatically generating a best guess in order to label material which would otherwise go un-annotated due to huge volume of incoming video material and the time constraints of the archive staff.
Automatic generation of subject labels is accomplished by algorithms that make use of several data sources: production metadata for broadcasts, transcripts of the spoken content of broadcasts produced by automatic speech recognition technologies and analysis of the visual content of the broadcast recording. The algorithms apply statistical techniques including word-counts and co-occurrences and also machine learning methods. Current algorithms are, however, far from perfect and their further improvement requires sustained and concerted effort on the part of research scientists.
Many researchers are interested in working on the problem of automatically generating subject labels for cultural heritage material. However, in order for a researcher to begin working in this area, a number of problems must be faced.
MediaEval promotes research progress in the area of automatically generating subject labels for cultural heritage material by running a "Task" devoted to subject labeling for professional archives. A Task is comprised of three parts: a description of the problem, a data set and a set of resources that can be used to solve the problem. Having the problem packaged as a task gives researchers easy entry to understanding the issue from the perspective of the archives and allows licensing of the data from the archive to occur in a streamlined manner. The University of Twente supplies speech recognition transcripts makes it possible for research groups without competence in Dutch-language speech recognition to contribute to developing improved approach to the task. Information about the other tasks offered can be found on the MediaEval website: http://www.multimediaeval.org/
Researchers approach the tasks by first working to solve them individually. They submit their solutions, which are evaluated by the MediaEval organizing committee. Because all researchers working on the same task have used the same data set, the solutions are directly comparable with each other and it is possible to see which approaches provide the best performance for the automatic generation of subject labels. Researchers then gather at a workshop in order to discuss the results, build collaborations and plan approaches for next year. The workshop fosters friendly competition between sites necessary for progress on the issues, but also builds collaboration encouraging sites to bundle their efforts and to avoid duplicating investigation on approaches that have already been shown to be less fruitful.
The MediaEval 2010 workshop was held in Pisa, Italy in October 2010 directly before ACM Multimedia, a large multimedia conference. It was held in a medieval convent "Santa Croce in Fossabanda" that had been converted into a hotel with seminar facilities. A site so evocative of the beauty and the value cultural heritage was particular suited to host researchers focused on the issues that will help improve automatic indexing of tomorrow's cultural heritage content.
The workshop will be held in a medieval convent called Santa Croce de Fossabanda, located in Pisa, Italy. The photo here is from Flickr user Marius B, licensed under Creative Commons License by-nc-sa. I notice that I do well with attribution if I am going to print material (brochures etc.), but I get sloppy with Power Point. If I know this photo is on my blog, I will be able to mind myself it comes from Marius B quickly in case I want it in future presentations.
Many Minds Make Light Work: Bringing Researchers Together to Work towards Automatic Indexing for Cultural Heritage Multimedia Collections
"Medieval", "mediaeval" and "MediaEval" are all pronounced the same. While "medieval" and "mediaeval" are alternate spellings for a adjective describing something that occurred in the Middle Ages, "MediaEval" is a benchmark initiative that brings researchers together to tackle challenging tasks in the area of multimedia indexing and retrieval. In 2010, a group of researchers worked individually and then met at a medieval convent "Santa Croce in Fossabanda" in Italy. Can a group of MediaEval scientists solve today's challenges of automatic generation of metadata for cultural heritage multimedia content?
Cultural heritage content often takes the form of multimedia and in particular of audio and video recordings. Cultural heritage collections are often staggering in size. The archive of the Netherlands Institute for Sound and Vision houses a breathtaking 250,000 hours of video content and receives and additional 8,000 hours of content broadcast by national broadcasting companies each year. Material that is stored in such a huge collection, but is not adequately annotated, is useless since it can no longer be found by people who wish to view, reuse or otherwise study it. Professional archivists have developed a set of techniques for annotating material with metadata for storage in the archive that will ensure that it can later be found. These techniques have stood the test of time and will continue to be critical for finding multimedia content in large archives in the future. The ability to generate high quality metadata, however, is not enough. Rather, metadata production must be scaled so that incoming material can be appropriately annotated at the rate at which it arrives.
Techniques from the area of Speech and Language Technology hold promise to support archivists in the generation of archival metadata. Here, we specifically look at the problem of generating subject labels (or "keywords") for television broadcasts. Subject labels are terms drawn from the archive thesaurus. Examples of keywords are, Archeology (archeologie), Architecture (architectuur), Chemistry (chemie), Dance (dansen), Film (film), History (geschiedenis), Music (muziek), Paintings (schilderijen), Scientific research (wetenschappelijk onderzoek) and Visual arts (beeldende kunst). Automatic generation of subject labels can help archivists in one of two ways: by providing a list of suggested subject labels for a video, thus narrowing their field or choice, or, by automatically generating a best guess in order to label material which would otherwise go un-annotated due to huge volume of incoming video material and the time constraints of the archive staff.
Automatic generation of subject labels is accomplished by algorithms that make use of several data sources: production metadata for broadcasts, transcripts of the spoken content of broadcasts produced by automatic speech recognition technologies and analysis of the visual content of the broadcast recording. The algorithms apply statistical techniques including word-counts and co-occurrences and also machine learning methods. Current algorithms are, however, far from perfect and their further improvement requires sustained and concerted effort on the part of research scientists.
Many researchers are interested in working on the problem of automatically generating subject labels for cultural heritage material. However, in order for a researcher to begin working in this area, a number of problems must be faced.
- It is necessary to have an understanding of the problem -- requires a general knowledge of how subject labels are produced in the archive and what they are used for
- It is necessary to have access to a large amount of example data in order to develop and train algorithms
- It is necessary to have access to data sources such a speech recognition transcripts or visual features. In general, it is not possible to generate these resources in a lab that is not already specialized in these areas
- It is necessary to understand the work that has previously been carried out in the area in order not to duplicate techniques that have already been tried by other researchers
- It is necessary to know how well one's own algorithms compare to the current state of the art.
MediaEval promotes research progress in the area of automatically generating subject labels for cultural heritage material by running a "Task" devoted to subject labeling for professional archives. A Task is comprised of three parts: a description of the problem, a data set and a set of resources that can be used to solve the problem. Having the problem packaged as a task gives researchers easy entry to understanding the issue from the perspective of the archives and allows licensing of the data from the archive to occur in a streamlined manner. The University of Twente supplies speech recognition transcripts makes it possible for research groups without competence in Dutch-language speech recognition to contribute to developing improved approach to the task. Information about the other tasks offered can be found on the MediaEval website: http://www.multimediaeval.org/
Researchers approach the tasks by first working to solve them individually. They submit their solutions, which are evaluated by the MediaEval organizing committee. Because all researchers working on the same task have used the same data set, the solutions are directly comparable with each other and it is possible to see which approaches provide the best performance for the automatic generation of subject labels. Researchers then gather at a workshop in order to discuss the results, build collaborations and plan approaches for next year. The workshop fosters friendly competition between sites necessary for progress on the issues, but also builds collaboration encouraging sites to bundle their efforts and to avoid duplicating investigation on approaches that have already been shown to be less fruitful.
The MediaEval 2010 workshop was held in Pisa, Italy in October 2010 directly before ACM Multimedia, a large multimedia conference. It was held in a medieval convent "Santa Croce in Fossabanda" that had been converted into a hotel with seminar facilities. A site so evocative of the beauty and the value cultural heritage was particular suited to host researchers focused on the issues that will help improve automatic indexing of tomorrow's cultural heritage content.
Tuesday, September 14, 2010
Affordance
"People," continued the taxi driver driving me to the airport in Dublin, "do the strangest things with chocolate." He paused, reflectively, before adding, "I mean in private."
When I didn't immediately respond, he hurried to explain himself. "You know, a Bounty bar?" I did. "I pick the chocolate off of the outside and then eat the inside separately. Do you do that?" As politely as I could I explained that I didn't like Bounty bars. "What do you do then?" he asked. The best thing that I could come up with was Oreo cookies, that I twist them open and eat out the middle, "A lot of people do that," I added. This puzzled him, until he brightened, "Oh, I heard about this biscuit in Australia and you bite off two of the corners and you drink your tea right through the biscuit. It has some sort of a cream filling that just melts as you drink. It's supposed to be just lovely." He thought for a moment. "It's Tam Tam or Yam Yam or something like that it's called."
I tried to imagine the Tam Tam or the Yam Yam and what it might look like. I was in Perth for about four days after SIGIR 2008, but didn't remember any cookies like that. "Do you suppose," I asked him, "that people just take the biscuit out of the package and look at it and think, 'oh, I should break off two of the corners and drink my tea through it' or was there one person who invented it and then it quickly spread as an idea throughout Australia?"
His response surprised me: he laughed! Then, "It's like the comedian," he pronounced. And then he filled me in: there is a comedian one-liner about watching a chicken lay an egg. "Hey, I think I could eat that!" was the punch line.
And so, I end up discussing with a Dublin taxi driver, the principle of affordance, the ability of an object to be acted on in its environment, and, in a larger definition, communicate its use via its appearance.
In multimedia information retrieval, I am obsessed with affordance in this latter sense. At the first glance or very quickly during interaction, the system should implicitly communicate to the user what it does, what the user can do with what it does and the extent to which it can be trusted to reliably do what it does in all cases.
A few years ago, I believed that a speech retrieval system should not show transcripts to users because users are disturbed by errors. Now we are all a lot further. People are used to reading relatively unedited or unconventional text in text messages, blogs (!) and comments. Now, the level of error can signal to the user that the text has been created by a speech recognizer and how well that speech recognizer can be trusted to capture the spoken content of the audio signal.
But his is negative affordance, the message what can't this system do. It is quite possible that negative affordance is much more challenging to communicate to the user since the space of possible non-uses is not intuitively constrained.
And with biscuits, of course, comes the problem of distributed affordance. What works well once does not continue working well with repeated applications. The package of biscuits should tell you, individually, we are delicious, but if you eat the whole package you won't feel nice and full, but instead you will have an unhappy stomach. Even it was written explicitly on the package, I imagine I would mostly ignore that message.
This is Part III (final part!) of the "Irish Chocolate Discussion", reflections on the conversation I had with a Dublin taxi driver and how that relates to finding things and search systems in general.
When I didn't immediately respond, he hurried to explain himself. "You know, a Bounty bar?" I did. "I pick the chocolate off of the outside and then eat the inside separately. Do you do that?" As politely as I could I explained that I didn't like Bounty bars. "What do you do then?" he asked. The best thing that I could come up with was Oreo cookies, that I twist them open and eat out the middle, "A lot of people do that," I added. This puzzled him, until he brightened, "Oh, I heard about this biscuit in Australia and you bite off two of the corners and you drink your tea right through the biscuit. It has some sort of a cream filling that just melts as you drink. It's supposed to be just lovely." He thought for a moment. "It's Tam Tam or Yam Yam or something like that it's called."
I tried to imagine the Tam Tam or the Yam Yam and what it might look like. I was in Perth for about four days after SIGIR 2008, but didn't remember any cookies like that. "Do you suppose," I asked him, "that people just take the biscuit out of the package and look at it and think, 'oh, I should break off two of the corners and drink my tea through it' or was there one person who invented it and then it quickly spread as an idea throughout Australia?"
His response surprised me: he laughed! Then, "It's like the comedian," he pronounced. And then he filled me in: there is a comedian one-liner about watching a chicken lay an egg. "Hey, I think I could eat that!" was the punch line.
And so, I end up discussing with a Dublin taxi driver, the principle of affordance, the ability of an object to be acted on in its environment, and, in a larger definition, communicate its use via its appearance.
In multimedia information retrieval, I am obsessed with affordance in this latter sense. At the first glance or very quickly during interaction, the system should implicitly communicate to the user what it does, what the user can do with what it does and the extent to which it can be trusted to reliably do what it does in all cases.
A few years ago, I believed that a speech retrieval system should not show transcripts to users because users are disturbed by errors. Now we are all a lot further. People are used to reading relatively unedited or unconventional text in text messages, blogs (!) and comments. Now, the level of error can signal to the user that the text has been created by a speech recognizer and how well that speech recognizer can be trusted to capture the spoken content of the audio signal.
But his is negative affordance, the message what can't this system do. It is quite possible that negative affordance is much more challenging to communicate to the user since the space of possible non-uses is not intuitively constrained.
And with biscuits, of course, comes the problem of distributed affordance. What works well once does not continue working well with repeated applications. The package of biscuits should tell you, individually, we are delicious, but if you eat the whole package you won't feel nice and full, but instead you will have an unhappy stomach. Even it was written explicitly on the package, I imagine I would mostly ignore that message.
This is Part III (final part!) of the "Irish Chocolate Discussion", reflections on the conversation I had with a Dublin taxi driver and how that relates to finding things and search systems in general.
Geotags and Geotrails
At the moment I am sitting in a German IC. I'm doing "Duivendrecht" ... "Apeldorn" ... "Hannover" ... "Berlin" and if I had a GPS that was logging my position, you would see that I was doing a certain trajectory at a certain speed. For me there is something soothing about sitting in the train watching Germany roll by. You've made the decision to go there and now all you have to do is wait and it will happen. In this state of doing nothing while doing something, I can unleash my thoughts, do a bit of mental housekeeping and just general feel like I'm absorbed into the German landscape. It's always been like this for me, and I can highly recommend it. If you try it within the Netherlands, you probably have to go to Groningen since the other trajectories would be too short.
In the last entry I mentioned Creative Tourism. This train ride to Berlin is my way of really feeling in touch with Germany -- a way of living and a way of being. If your reading posts in chronological order, you'll recall that our Near2Me concept links up two places people. If enough people take pictures at places with two different geotags, then the place must be related. If you like one, then you like the other. But what I am doing right now is not associated with a single geotag, it is associated with an entire trajectory and also a very specific "train speed". Perhaps our systems will be richer if we include "geotrails", a path with parameters of time and space. These are the doodles we draw and redraw in our travel experiences, patterns that represent something we like to do, but might also represent the type of thing that we should try in the future when we feel inclined to "branch out".
Two other examples for which geotrails could be helpful come to mind. The discussion about chocolate that I had with the taxi driver that drove me to the airport in Dublin on Saturday really stuck with me. But it is one of the things that strikes me nearly every time I am there. You can really talk to the taxi drivers. Not always, but sometimes you have these amazing conversations and that seems to be more important to them than the tip. Once I had a Dublin taxi driver refuse the tip.
I had one conversation where the taxi driver asked me what I did and I said I worked in a multimedia information retrieval lab. "So what do you do there?" I didn't know where to start, so I described to him one of the systems at our lab that processes videos of soccer games and does highlight detection. He thought that was interesting, and I then asked him what he thought counted as a highlight in football. Is it only the goals and the penalties, or what other parts do you want to see if you are watching a summary to the game. He gave a thoughtful answer to this questions. If you would look at my Dublin geotrails, there are a series of characteristic doodles, often ending at the airport that represent these experiences I have had in taxis.
My Amsterdam geotrails, on the other hand, show me giving the taxi stand wide berth. The taxis will drive you in circles there. They have to, sometimes, given the geometry of the city, but they'll add their own embellishments that drive your final price up. The only place that is worse is Brussels.
In Amsterdam, I prefer the bike. You'll see a lot of bike doodles. These are slower than the taxis and also go on roads where cards don't drive. Biking is the authentic Amsterdam experience. Visitors to the city often participate in the local culture by renting bicycles and riding around. This is Creative Tourism the way it was conceived to be carried out.
However, you would see the difference in the geotrails. Amsterdamers bike every day from point A to B and back again. Maybe there is C and D as well, but in general they know their routes, every stop light, every bump, every place where another bike might come out from an unexpected directions. They also know their routes at the time of the day that they characteristically ride them, for example, in the early morning, there was always a characteristic amount of traffic when I biked out from the Lelygracht in the center of Amsterdam where I used to live to the Science Park. Routes are optimized so that they do not take minute longer than they absolutely have to.
Slipping here back into the topic of my last blog, I would like to make the point that the tourist geotrails are totally different, they are slower, different times of the day and involve indirect routes. Tourists wobble a bit, they maybe haven't ridden a bike since they were kids. They stop unexpectedly to consult their maps. They stop for puddles -- and Amsterdamer has seen the route dry the week before and knowing exactly how deep the puddle is going to be rides through it. Tourists also stop for red lights -- Amsterdamers know which lights are conventionally ignored by bicyclists.
Don't recommend to me a quick straight shot to the Science Park if I am on vacation. I want to do the trail with a little wobble that ends me up at the Van Gogh Museum. Is that the authentic Amsterdam experience? Maybe not. But if you push the limit, authenticity is in the eye of the beholder. For the Amsterdamer, the authentic bicycle culture consists of complaining of how badly tourists ride bicycles. Take away commercial tourism and the city loses some of its characteristic spirt, the tension between those that live there and those that play there. I'll leave it to the reader to decide if this would affect Amsterdam's charm.
This is Part II of the "Irish Chocolate Discussion", reflections on the conversation I had with a Dublin taxi driver and how that relates to finding things and search systems in general.
In the last entry I mentioned Creative Tourism. This train ride to Berlin is my way of really feeling in touch with Germany -- a way of living and a way of being. If your reading posts in chronological order, you'll recall that our Near2Me concept links up two places people. If enough people take pictures at places with two different geotags, then the place must be related. If you like one, then you like the other. But what I am doing right now is not associated with a single geotag, it is associated with an entire trajectory and also a very specific "train speed". Perhaps our systems will be richer if we include "geotrails", a path with parameters of time and space. These are the doodles we draw and redraw in our travel experiences, patterns that represent something we like to do, but might also represent the type of thing that we should try in the future when we feel inclined to "branch out".
Two other examples for which geotrails could be helpful come to mind. The discussion about chocolate that I had with the taxi driver that drove me to the airport in Dublin on Saturday really stuck with me. But it is one of the things that strikes me nearly every time I am there. You can really talk to the taxi drivers. Not always, but sometimes you have these amazing conversations and that seems to be more important to them than the tip. Once I had a Dublin taxi driver refuse the tip.
I had one conversation where the taxi driver asked me what I did and I said I worked in a multimedia information retrieval lab. "So what do you do there?" I didn't know where to start, so I described to him one of the systems at our lab that processes videos of soccer games and does highlight detection. He thought that was interesting, and I then asked him what he thought counted as a highlight in football. Is it only the goals and the penalties, or what other parts do you want to see if you are watching a summary to the game. He gave a thoughtful answer to this questions. If you would look at my Dublin geotrails, there are a series of characteristic doodles, often ending at the airport that represent these experiences I have had in taxis.
My Amsterdam geotrails, on the other hand, show me giving the taxi stand wide berth. The taxis will drive you in circles there. They have to, sometimes, given the geometry of the city, but they'll add their own embellishments that drive your final price up. The only place that is worse is Brussels.
In Amsterdam, I prefer the bike. You'll see a lot of bike doodles. These are slower than the taxis and also go on roads where cards don't drive. Biking is the authentic Amsterdam experience. Visitors to the city often participate in the local culture by renting bicycles and riding around. This is Creative Tourism the way it was conceived to be carried out.
However, you would see the difference in the geotrails. Amsterdamers bike every day from point A to B and back again. Maybe there is C and D as well, but in general they know their routes, every stop light, every bump, every place where another bike might come out from an unexpected directions. They also know their routes at the time of the day that they characteristically ride them, for example, in the early morning, there was always a characteristic amount of traffic when I biked out from the Lelygracht in the center of Amsterdam where I used to live to the Science Park. Routes are optimized so that they do not take minute longer than they absolutely have to.
Slipping here back into the topic of my last blog, I would like to make the point that the tourist geotrails are totally different, they are slower, different times of the day and involve indirect routes. Tourists wobble a bit, they maybe haven't ridden a bike since they were kids. They stop unexpectedly to consult their maps. They stop for puddles -- and Amsterdamer has seen the route dry the week before and knowing exactly how deep the puddle is going to be rides through it. Tourists also stop for red lights -- Amsterdamers know which lights are conventionally ignored by bicyclists.
Don't recommend to me a quick straight shot to the Science Park if I am on vacation. I want to do the trail with a little wobble that ends me up at the Van Gogh Museum. Is that the authentic Amsterdam experience? Maybe not. But if you push the limit, authenticity is in the eye of the beholder. For the Amsterdamer, the authentic bicycle culture consists of complaining of how badly tourists ride bicycles. Take away commercial tourism and the city loses some of its characteristic spirt, the tension between those that live there and those that play there. I'll leave it to the reader to decide if this would affect Amsterdam's charm.
This is Part II of the "Irish Chocolate Discussion", reflections on the conversation I had with a Dublin taxi driver and how that relates to finding things and search systems in general.
Authentic, Personalized Travel Recommendation
At the moment a lot of my time and attention is being devoted to Near2Me, which is a concept for a travel recommender developed by a colleague of mine, Luz Caballero, for the PetaMedia Network of Excellence. The cool thing about Near2Me is that in not only makes personalized travel recommendations, but it also focuses on authenticity -- the distinctive "spirit of place", as Luz likes to describe it.
The Near2Me concept tackles the problem of long-tail recommendation, it suggests that you go places where relatively few other people have been: Most of the time we refer to it as Off-The-Beaten-Track. The concept links up with the "Creative Tourism" movement, which holds that travel should be participatory and that travelers should hook up with the living culture of the place as manifested in the lives of the people living there.
How do we get at the authenticity of a place? Well, Luz points out that the act of taking a picture is something like adding a tag to a place. The tag is different than other sorts of tags. For example, it is not like the tags we use on Flikcr, since it doesn't have a lot of semantics. Taking a picture is basically a salience tag, it just says, this came to my attention as interesting. Luz did a study of Flickr where she determined that people who travel, even though they of course take pictures of the "must sees", Notre Dame in Paris, for example, also take pictures of interesting things along the way.
Because the motiviation of the people taking these pictures is something more akin to personal documentation than it is to touristic promotion, they provide a valuable source of information about the authentic, the things that we stumble across when we actually are in a place, along the way to reaching another goal.
The next step is simple, you look at the geotags of the pictures (that little piece of metadata produced by a growing number of cameras that records where the picture was taken) and you use the Amazon principle: people who liked X also liked Y. You look for people who have taken pictures where you have taken pictures and see where else they go. Luz' uses the example of markets: if someone liked the "Borough Market" in London, the system would recommend the "Marche Wilson" in Paris. The technique is called "wormholes" and was developed by Maarten Clements.
At this point, you encounter possibly the most challenging part of long-tail recommendation. You want to give people something new and interested, but none of the place recommendations are going to be obvious since you want to stay away from the bestsellers, the commercial destinations. For this reason, its important to be able to explore recommendations further: OK, the system thinks I should like the Nusantara Museum in Delft, but would I really like it? Why? What is it anyway? What can I see and do there?
Two of the ways that users can explore a place are by browsing pictures and browsing people. The Near2Me concept offers a selection of pictures from the recommended place that are carefully chosen to be both diverse and also representative of a place. Also, it offers a selection of people, who by virtue of the pictures that they take and of the popularity of those pictures, emerge as a sort of community expert for a particular topic. How these algorithms work -- and if they are of use in a working prototype is something in the works targeted for publication in future PetaMedia papers.
On Saturday, I flew back from Dublin after having spent a couple days at DCU. On the way to the airport, the taxi driver asked me where I was going.
"Back to Amsterdam." I said. "I came Wednesday and now I am already going back."
"Oh, Amerdam. Did you bring tulips?"
"No, chocolate." I replied, and then added "The funny thing is, is that I'll probably get some chocolate to bring the other direction as well."
"I'll tell you what you got to get." he said. And then launched into description of how amazing the Baileys Irish Cream chocolates are. I've been to Dublin several times, but I still have to ask people to repeat words sometimes because of the Irish English, which he did quite patiently. In the end the picture emerged that the Baileys Irish Cream chocolates are amazing because they are not just chocolatey, but they offer a real Baileys experience as well.
Last time flying back from Dublin I had seen the Baileys Irish Cream chocolates at the airport and had steered clear -- putting Baileys in chocolate was something, I assumed, they had come up with for the tourists, and that I would do well to avoid it. However, the fact that the taxi driver had no particular reason to tell me this and the fact that he was, himself, to me clearly Irish, I shifted my opinion on the Baileys Irish Cream chocolates from "main stream tourist" to "genuine Irish souvenir". Since I was the one that brought up the chocolate, the taxi driver was certainly not doing product placement. Perhaps Baileys is simply very, very smart with this. But if they are, the Baileys Irish Cream chocolates will still remain authentic to me because the whole story happened at a moment where I was devoting a lot of time and attention to Near2Me and thinking a lot about authenticity of place.
This is Part I of the "Irish Chocolate Discussion", reflections on the conversation I had with a Dublin taxi driver and how that relates to finding things and search systems in general.
Thursday, August 19, 2010
Internet of Hearts and Minds
The flood disaster in Pakistan moved slowly from the periphery of my vision to central focus. An e-mail arrived from the IEEE Foundation calling for donations to the IEEE Pakistan Engineering Educational and Professional Development Rebuilding Fund. A4 posters asking for support have been posted in the elevators. But today things got really real when I was asked to help edit a piece written by a TU-Delft PhD student studying flooding in Pakistan that described the disaster. The piece gave a succinct overview of the current situation and its potential for further deterioration and long-term damage, within Pakistan and internationally and, hopefully, will reach a wide readership. It seems that my social network needs to reach out an grab and shake me before I can turn my attention to an event of such a staggering scale and figure out how I can add my humble little building block towards an overall solution.
The vision formulated by The Social Computer initiative says that we should expect more. Bascially, the Social Computer is the act of collaborative computation by the people, of the people, for the people. Human computer convergence on this scale has the power to solve problems: we link up not only our individual intellectual capacities, but also our individual abilities to interact with an influence our direct environments. The goal is to "tackle large scale social problems that are beyond our current capabilities".
It shouldn't take weeks for the Pakistan disaster to filter through to me and for me to start understanding what I possibly might do. But the fact that it eventually does filter through at all, that in some way I end up really feeling connected do my little bit, illustrates the untapped power of the The Social Computer. In order to focus on internet-scale social computation as the force that revolutionizes humanity as the whole by fostering the best impulses of individual, I call it "The Internet of Hearts and Minds." The next miracle of modern technology is the one we make ourselves by linking ourselves together to form The Social Computer -- it may be a slow and silent revolution, but it is one with the power to help us all.
The vision formulated by The Social Computer initiative says that we should expect more. Bascially, the Social Computer is the act of collaborative computation by the people, of the people, for the people. Human computer convergence on this scale has the power to solve problems: we link up not only our individual intellectual capacities, but also our individual abilities to interact with an influence our direct environments. The goal is to "tackle large scale social problems that are beyond our current capabilities".
It shouldn't take weeks for the Pakistan disaster to filter through to me and for me to start understanding what I possibly might do. But the fact that it eventually does filter through at all, that in some way I end up really feeling connected do my little bit, illustrates the untapped power of the The Social Computer. In order to focus on internet-scale social computation as the force that revolutionizes humanity as the whole by fostering the best impulses of individual, I call it "The Internet of Hearts and Minds." The next miracle of modern technology is the one we make ourselves by linking ourselves together to form The Social Computer -- it may be a slow and silent revolution, but it is one with the power to help us all.
Monday, July 26, 2010
Advances in Multimedia Retrieval Tutorial at ACM Multimedia 2010
I worked a long day today and now I'm home and have eaten dinner and thinking about how to relax. I'd like to watch an episode of Merlin on the Internet. Preferably legal and one I've never seen before -- wouldn't mind paying if it was a site I trusted. That seems like a pretty complicated search and my prediction is that it will lead to frustration. So here I am writing in my blog instead.
Today the longplay description of our upcoming tutorial on Frontiers in Multimedia Search went online. We want to start out by addressing the question of how can multimedia search benefit people's daily lives, at work and otherwise. I'm feeling rather a strong need for the benefit of multimedia at the moment. If I can't have my Merlin, right now I wouldn't mind browsing back through recordings of the SIGIR presentations that I heard last week -- and maybe some of the ones that I have missed.
Then we are planning to go on to take a look at new approaches to multimedia retrieval that we divide into three categories (I include a couple of my own notes on each):
Today the longplay description of our upcoming tutorial on Frontiers in Multimedia Search went online. We want to start out by addressing the question of how can multimedia search benefit people's daily lives, at work and otherwise. I'm feeling rather a strong need for the benefit of multimedia at the moment. If I can't have my Merlin, right now I wouldn't mind browsing back through recordings of the SIGIR presentations that I heard last week -- and maybe some of the ones that I have missed.
Then we are planning to go on to take a look at new approaches to multimedia retrieval that we divide into three categories (I include a couple of my own notes on each):
- Making the most of the user: In motion on the Internet we dribble information behind us. We tag, we query, we click, we brush over a page without a second glance. We have the capacity to glance at a set of snippets, glaze over what does not interest us to find what does. Making the most of the user is about letting the search engine turn the computational crank and do the look ups, leaving those fine-grained semantic judgments to the human brain.
- Making the most of the collection: Sometimes the collection can speak for itself. Pseudo-relevance feedback may dilute our queries, but it also is a valuable tool for increasing recall. And then there is collaborative filtering: Making use of patterns that we as users leave behind -- but now at the collection or community level.
- Making the most of individual items: What is important here is how you can do the best that you can with noisy sources of features (speech recognition, visual concept detection) to represent items. You don't need to necessarily provide a complete representation of an item -- information that helps distinguish items or keeps them from getting confused can sometimes be a big help.
Friday, July 23, 2010
SIGIR 2010 Crowdsourcing for Search Evaluation Workshop
We used Amazon Mechanical Turk (MTurk) to gather annotations for the video corpus to be used in the Affect Task at the MediaEval 2010 benchmark evaluation. The task involves automatically identifying videos that viewers report to be particularly boring. We wrote the corpus development up and submitted it to the Crowdsourcing for Search Evaluation Workshop at SIGIR 2010. Initially we wondered a bit if the paper was appropriate for the workshop, since we were working on affect and not directly on search. But we were glad that we took the risk and went for it. The paper was accepted and the workshop was great -- right on target with our interests.
Soleymani, M. and Larson, M. Crowdsourcing for Affective Annotation of Video: Development of a Viewer-reported Boredom Corpus. In Proceedings of the SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation.
We also received the Runner Up for the Most Innovative Paper Award, which was sponsored by Microsoft Bing. Thank you! We are already considering how to get the most bang for our Bing bucks. Probably it will flow directly back into MTurk for our next crowdsourcing project.
Soleymani, M. and Larson, M. Crowdsourcing for Affective Annotation of Video: Development of a Viewer-reported Boredom Corpus. In Proceedings of the SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation.
We also received the Runner Up for the Most Innovative Paper Award, which was sponsored by Microsoft Bing. Thank you! We are already considering how to get the most bang for our Bing bucks. Probably it will flow directly back into MTurk for our next crowdsourcing project.
Labels:
affect,
crowdsourcing,
MediaEval,
SIGIR,
Switzerland,
VideoCLEF
Sunday, July 18, 2010
Emotive speech and navigation systems
During a recent family weekend, my best friend, my mother and I found ourselves in a car using my aunt's navigation system to guide us to our destination. We quickly developed a love-hate relationship with the device -- our feelings of annoyance generally outweighing our gratefulness of having been guided efficiently to our destination.
My mom then circulated an article from CNN entitled Why GPS voices are so condescending. And my aunt mailed back, 'Hey, isn't this what you work on?'
The answer to that question is, yes, well, not quite. I go the other direction. Instead of automatically producing emotive speech, I start with a recording of emotive speech and automatically analyze how it was produced. We just got a paper accepted in a session entitled "Paralanguage" at Interspeech 2010 :
Jochems, B., Larson, M., Ordelman, R., Poppe, R. and Thruong, K. Towards Affective State Modeling in Narrative and Conversational Settings. Proceedings of Interspeech 2010 (to appear).
The CNN article also falls into the category of paralanguage. Paralanguage is basically the things that we do with speech that modifies the factual content or conventional meaning of what we are saying. In this case, it's adding emotive nuance.
The designers of navigation systems are stuck with the following impasse: If a navigation system is good, it will always be right. Socially, there is a tabu against always being right. An always-right system will always be perceived as condescending, be its voice ever so loving and sweet. That's simply the way that social behavior works -- we count on each other to act responsibly, but not to pretend that we're perfect. The implication is that we, as humans, will never truly adopt the metaphor of "it's just a person telling me where to drive" for a navigation system if that system's understood purpose is to deliver infallibility.
In my opinion, what the designers of navigational systems should do is to use the voice of someone who enjoys special social status and as such, "gets away" with being always right. For example, theoretical physicist Stephen Hawking. His smarts are generally acknowledged to transcend the smarts of the rest of us mere mortals. Interestingly, he also speaks using a computer voice because he has neuro-muscular distrophy. It wouldn't take a whole lot of memory space on your little navigation device in order to produce a believable rendition of his speech.
The issue also has a huge safety aspect (which is also raised in the CNN article). If the navigational system uses emotive speech in a very convincing manner, it is smooth sailing. However, what if something goes wrong? To the driver, it will be like a thunder-bolt out of a blue sky. Everything was going fine, and all of a sudden the device turned and lashed-out with an emotively inappropriate direction. Possibly, this would happen at a critical driving point. The driver shouldn't be so comfortable with the device as to completely exclude the possibility that it goes way off the mark.
Basically, a car navigation system presents us with another instance of the Paradox of 'Simplicity'. It takes a lot of very complicated innards to make a device that drivers perceive simply as a human telling us how to get there. The paradox comes in when that device does something wrong and all of a sudden the human is stuck both solving the immediate driving issue and also compensating for the apparently inexplicable (those complicated innards!) failure of the system. In this case, for example, a beautifully real rendition of a plaintive tone pleading "Turn back! Turn back!" when actually we find ourselves stuck in the express lane in heavy traffic.
The theoretical physicist persona would help to lessen the impact of such errors. Sorry, Stephen Hawking, but theoretical physicists can get away with being socially inappropriate once in a while without throwing us into a state of shock -- we assume that they are simply busy on a higher plane and don't mean to really insult or confuse us.
However, instead of talking to Stephen Hawking about a deal to have him donate his authority to make navigational systems safer, navigational system companies (according to CNN) are looking into fitting the systems with the driver's own voices. It sounds cool, until you think about some of the implications.
First, there are probably people who don't react well to their own voices. Perhaps I could accept my own voice reminding myself of the route to somewhere I've been before, but my own voice directing me to somewhere I have never been, for example, Makuhari, Japan (where Interspeech 2010 will take place in September) is absolutely implausible. I know I can't trust myself on that one.
Second, drivers need to be encouraged not to turn off their human intelligence when driving with a navigational system. The system doesn't tell you, "Stop here, the light is red". Listening to your own voice is probably not the right way to ensure that you are actively applying the underlying rules and your own common sense to driving.
Third, it's not uncommon to rent a car borrow someone else's car or navigational system on a single-case basis. For example, my aunt lent us her device for one trip. Wouldn't we like our devices more if they were one-size-fits all? Just as Walter Cronkite provided widespread satisfaction as the voice of the evening news, what's wrong with generally-acceptable central voice for all navigation systems?
Fourth, it's not only the driver would needs to listen to the navigation system. With several people in the car, navigation often involves pooling knowledge of the route and negotiating consensus. If the driver's voice is talking on the navigational system, the passengers are shut out of the process. For maximally safe driving, you don't want a "back seat driver", but a co-pilot who is engaged in the process is very helpful.
Fifth, it is not clear that the navigation system companies are the ones that should be making the decision about how navigation system personae can be made more acceptable to drivers. If they can convince individual drivers that they need to have a personalized voice for their system will open up an incredible new opportunity for profit for navigation device companies. On top of the system and the route information, they will also be able to sell you your own personna.
Additionally, a universal "Stephen Hawking" solution, which I am arguing may actually be safer, would make it impossible for navigation system companies to distinguish themselves from each other on the basis of the differential appeal of their navigation personna and is simply not in companies best business interest.
My suggestion is simply to learn to love the condescending dead-pan delivery of your current navigation system -- demanding anything different may prompt the designers of navigation systems to make the situation a whole lot worse.
Don't we do this already? How often have you ever been directed somewhere by a fellow human issuing emotively inappropriate directions? You've reminded yourself to take some deep breaths, stay concentrated on the road and gotten there in the end. We shouldn't demand from our automatic devices more than what we get from our fellow human beings.
P.S. Whoa, this claims to be a blog on the topic of search, what does this have to do with search? OK. You've caught me. Sometimes I just write things here because I know that I can find them again.
My mom then circulated an article from CNN entitled Why GPS voices are so condescending. And my aunt mailed back, 'Hey, isn't this what you work on?'
The answer to that question is, yes, well, not quite. I go the other direction. Instead of automatically producing emotive speech, I start with a recording of emotive speech and automatically analyze how it was produced. We just got a paper accepted in a session entitled "Paralanguage" at Interspeech 2010 :
Jochems, B., Larson, M., Ordelman, R., Poppe, R. and Thruong, K. Towards Affective State Modeling in Narrative and Conversational Settings. Proceedings of Interspeech 2010 (to appear).
The CNN article also falls into the category of paralanguage. Paralanguage is basically the things that we do with speech that modifies the factual content or conventional meaning of what we are saying. In this case, it's adding emotive nuance.
The designers of navigation systems are stuck with the following impasse: If a navigation system is good, it will always be right. Socially, there is a tabu against always being right. An always-right system will always be perceived as condescending, be its voice ever so loving and sweet. That's simply the way that social behavior works -- we count on each other to act responsibly, but not to pretend that we're perfect. The implication is that we, as humans, will never truly adopt the metaphor of "it's just a person telling me where to drive" for a navigation system if that system's understood purpose is to deliver infallibility.
In my opinion, what the designers of navigational systems should do is to use the voice of someone who enjoys special social status and as such, "gets away" with being always right. For example, theoretical physicist Stephen Hawking. His smarts are generally acknowledged to transcend the smarts of the rest of us mere mortals. Interestingly, he also speaks using a computer voice because he has neuro-muscular distrophy. It wouldn't take a whole lot of memory space on your little navigation device in order to produce a believable rendition of his speech.
The issue also has a huge safety aspect (which is also raised in the CNN article). If the navigational system uses emotive speech in a very convincing manner, it is smooth sailing. However, what if something goes wrong? To the driver, it will be like a thunder-bolt out of a blue sky. Everything was going fine, and all of a sudden the device turned and lashed-out with an emotively inappropriate direction. Possibly, this would happen at a critical driving point. The driver shouldn't be so comfortable with the device as to completely exclude the possibility that it goes way off the mark.
Basically, a car navigation system presents us with another instance of the Paradox of 'Simplicity'. It takes a lot of very complicated innards to make a device that drivers perceive simply as a human telling us how to get there. The paradox comes in when that device does something wrong and all of a sudden the human is stuck both solving the immediate driving issue and also compensating for the apparently inexplicable (those complicated innards!) failure of the system. In this case, for example, a beautifully real rendition of a plaintive tone pleading "Turn back! Turn back!" when actually we find ourselves stuck in the express lane in heavy traffic.
The theoretical physicist persona would help to lessen the impact of such errors. Sorry, Stephen Hawking, but theoretical physicists can get away with being socially inappropriate once in a while without throwing us into a state of shock -- we assume that they are simply busy on a higher plane and don't mean to really insult or confuse us.
However, instead of talking to Stephen Hawking about a deal to have him donate his authority to make navigational systems safer, navigational system companies (according to CNN) are looking into fitting the systems with the driver's own voices. It sounds cool, until you think about some of the implications.
First, there are probably people who don't react well to their own voices. Perhaps I could accept my own voice reminding myself of the route to somewhere I've been before, but my own voice directing me to somewhere I have never been, for example, Makuhari, Japan (where Interspeech 2010 will take place in September) is absolutely implausible. I know I can't trust myself on that one.
Second, drivers need to be encouraged not to turn off their human intelligence when driving with a navigational system. The system doesn't tell you, "Stop here, the light is red". Listening to your own voice is probably not the right way to ensure that you are actively applying the underlying rules and your own common sense to driving.
Third, it's not uncommon to rent a car borrow someone else's car or navigational system on a single-case basis. For example, my aunt lent us her device for one trip. Wouldn't we like our devices more if they were one-size-fits all? Just as Walter Cronkite provided widespread satisfaction as the voice of the evening news, what's wrong with generally-acceptable central voice for all navigation systems?
Fourth, it's not only the driver would needs to listen to the navigation system. With several people in the car, navigation often involves pooling knowledge of the route and negotiating consensus. If the driver's voice is talking on the navigational system, the passengers are shut out of the process. For maximally safe driving, you don't want a "back seat driver", but a co-pilot who is engaged in the process is very helpful.
Fifth, it is not clear that the navigation system companies are the ones that should be making the decision about how navigation system personae can be made more acceptable to drivers. If they can convince individual drivers that they need to have a personalized voice for their system will open up an incredible new opportunity for profit for navigation device companies. On top of the system and the route information, they will also be able to sell you your own personna.
Additionally, a universal "Stephen Hawking" solution, which I am arguing may actually be safer, would make it impossible for navigation system companies to distinguish themselves from each other on the basis of the differential appeal of their navigation personna and is simply not in companies best business interest.
My suggestion is simply to learn to love the condescending dead-pan delivery of your current navigation system -- demanding anything different may prompt the designers of navigation systems to make the situation a whole lot worse.
Don't we do this already? How often have you ever been directed somewhere by a fellow human issuing emotively inappropriate directions? You've reminded yourself to take some deep breaths, stay concentrated on the road and gotten there in the end. We shouldn't demand from our automatic devices more than what we get from our fellow human beings.
P.S. Whoa, this claims to be a blog on the topic of search, what does this have to do with search? OK. You've caught me. Sometimes I just write things here because I know that I can find them again.
Saturday, July 17, 2010
IF discouraged THEN write good reviews
Ever get a Bad Review? I don't mean one where the reviewer gives constructive criticism and recommends rejection. I mean one that is really bad in the sense that it is unhelpful, off-topic, lacking in rigor, poorly written, pedantic or pompous. It takes a lot of energy to sort through these sorts of reviews, find the wheat discard the chaff and make sure that the experience doesn't drag you down to the point of derailing a potentially productive scientific endeavor. Bad Reviews sometimes even recommend acceptance. Acceptance leads, perhaps not to disappointment at the moment, but rather to more general scientific disheartenment: Is this really the level of intellectual standards that characterizes the field to which I have chosen to devote my career?
A surprisingly satisfying way to push back against Bad Reviews, is to engage in reflection upon one's own reviewing skills and strive to improve them. This course of action is not going to have an immediately measurable effect of improving the system as a whole, but it does restore a sense of balance. Especially if you interact with a lot of students, you have an amazing opportunity to teach them to review. There's something cheering about knowing that scientists that you have mentored are not going to be the ones generating the Bad Reviews of the next generation.
In order to be able to tell people quickly about my own reviewing values and my campaign to constantly improve my own reviews, I have packed the points I consider while reviewing into a scheme that I call IF THEN:
But what if you are already a world class reviewer? What then? Like musicians we must remember that excellence is not static. Moshe Vardi introduced a rule for reviewing an editorial in the current edition of Communications of the ACM. The rule reads, "Write a review as if you are writing it to yourself." He calls it The Golden Rule of Reviewing. Most people are still working on putting in to practiced the Golden Rule they learned in their childhoods. The Bad Reviews will keep on coming, and about the only thing we have control over is how we review back.
A surprisingly satisfying way to push back against Bad Reviews, is to engage in reflection upon one's own reviewing skills and strive to improve them. This course of action is not going to have an immediately measurable effect of improving the system as a whole, but it does restore a sense of balance. Especially if you interact with a lot of students, you have an amazing opportunity to teach them to review. There's something cheering about knowing that scientists that you have mentored are not going to be the ones generating the Bad Reviews of the next generation.
In order to be able to tell people quickly about my own reviewing values and my campaign to constantly improve my own reviews, I have packed the points I consider while reviewing into a scheme that I call IF THEN:
- I is for Issue: Does the paper motivate the issue that it addresses and then close the loop in the end, convincing the reader that it has accomplished what it set out to do?
- F is for Fit: Does the paper fit with the call for papers of the conference or scope of the journal to which it was submitted?
- T is for Technical soundness: Do the authors apply solid, state-of-the-art experimental and/or analytical method?
- H is for Historical context: Do the authors present the context of their work? (Including both the related work and the outlook onto future work.)
- E is for Exposition: Is the paper clearly written and a pleasure to read? Is the information it contains complete and comprehensive?
- N is for Novelty: Is the idea new in the field? Is it the sort of innovation that is destined to make an impact?
But what if you are already a world class reviewer? What then? Like musicians we must remember that excellence is not static. Moshe Vardi introduced a rule for reviewing an editorial in the current edition of Communications of the ACM. The rule reads, "Write a review as if you are writing it to yourself." He calls it The Golden Rule of Reviewing. Most people are still working on putting in to practiced the Golden Rule they learned in their childhoods. The Bad Reviews will keep on coming, and about the only thing we have control over is how we review back.
Subscribe to:
Posts (Atom)