Saturday, September 18, 2010

MediaEval Tagging Task Professional

DIXIT, a Dutch-language journal for speech and language technology, invited me to do a piece on the "Tagging Task Professional", one of the four multimedia indexing and retrieval tasks that the MediaEval benchmarking initiative ran in 2010. I am posting an English version of the text here on my blog. The piece will appear in December, after the MediaEval 2010 workshop in October (I note that in order to explain the past tense used to describe an event that has not happend yet).

The workshop will be held in a medieval convent called Santa Croce de Fossabanda, located in Pisa, Italy. The photo here is from Flickr user Marius B, licensed under Creative Commons License by-nc-sa. I notice that I do well with attribution if I am going to print material (brochures etc.), but I get sloppy with Power Point. If I know this photo is on my blog, I will be able to mind myself it comes from Marius B quickly in case I want it in future presentations.

Many Minds Make Light Work: Bringing Researchers Together to Work towards Automatic Indexing for Cultural Heritage Multimedia Collections

"Medieval", "mediaeval" and "MediaEval" are all pronounced the same. While "medieval" and "mediaeval" are alternate spellings for a adjective describing something that occurred in the Middle Ages, "MediaEval" is a benchmark initiative that brings researchers together to tackle challenging tasks in the area of multimedia indexing and retrieval. In 2010, a group of researchers worked individually and then met at a medieval convent "Santa Croce in Fossabanda" in Italy. Can a group of MediaEval scientists solve today's challenges of automatic generation of metadata for cultural heritage multimedia content?

Cultural heritage content often takes the form of multimedia and in particular of audio and video recordings. Cultural heritage collections are often staggering in size. The archive of the Netherlands Institute for Sound and Vision houses a breathtaking 250,000 hours of video content and receives and additional 8,000 hours of content broadcast by national broadcasting companies each year. Material that is stored in such a huge collection, but is not adequately annotated, is useless since it can no longer be found by people who wish to view, reuse or otherwise study it. Professional archivists have developed a set of techniques for annotating material with metadata for storage in the archive that will ensure that it can later be found. These techniques have stood the test of time and will continue to be critical for finding multimedia content in large archives in the future. The ability to generate high quality metadata, however, is not enough. Rather, metadata production must be scaled so that incoming material can be appropriately annotated at the rate at which it arrives.

Techniques from the area of Speech and Language Technology hold promise to support archivists in the generation of archival metadata. Here, we specifically look at the problem of generating subject labels (or "keywords") for television broadcasts. Subject labels are terms drawn from the archive thesaurus. Examples of keywords are, Archeology (archeologie), Architecture (architectuur), Chemistry (chemie), Dance (dansen), Film (film), History (geschiedenis), Music (muziek), Paintings (schilderijen), Scientiļ¬c research (wetenschappelijk onderzoek) and Visual arts (beeldende kunst). Automatic generation of subject labels can help archivists in one of two ways: by providing a list of suggested subject labels for a video, thus narrowing their field or choice, or, by automatically generating a best guess in order to label material which would otherwise go un-annotated due to huge volume of incoming video material and the time constraints of the archive staff.

Automatic generation of subject labels is accomplished by algorithms that make use of several data sources: production metadata for broadcasts, transcripts of the spoken content of broadcasts produced by automatic speech recognition technologies and analysis of the visual content of the broadcast recording. The algorithms apply statistical techniques including word-counts and co-occurrences and also machine learning methods. Current algorithms are, however, far from perfect and their further improvement requires sustained and concerted effort on the part of research scientists.

Many researchers are interested in working on the problem of automatically generating subject labels for cultural heritage material. However, in order for a researcher to begin working in this area, a number of problems must be faced.
  1. It is necessary to have an understanding of the problem -- requires a general knowledge of how subject labels are produced in the archive and what they are used for
  2. It is necessary to have access to a large amount of example data in order to develop and train algorithms
  3. It is necessary to have access to data sources such a speech recognition transcripts or visual features. In general, it is not possible to generate these resources in a lab that is not already specialized in these areas
  4. It is necessary to understand the work that has previously been carried out in the area in order not to duplicate techniques that have already been tried by other researchers
  5. It is necessary to know how well one's own algorithms compare to the current state of the art.
The purpose of a benchmarking initiative is to address these problems and let researchers concentrate their energy on the hard work and creative thinking that it takes to develop new algorithms for important tasks. MediaEval is one of several benchmarking initiatives that pursue this paradigm. The special topic area addressed by MediaEval is multimedia, with a focus on on speech, language and social features and how they can be combined with visual features.

MediaEval promotes research progress in the area of automatically generating subject labels for cultural heritage material by running a "Task" devoted to subject labeling for professional archives. A Task is comprised of three parts: a description of the problem, a data set and a set of resources that can be used to solve the problem. Having the problem packaged as a task gives researchers easy entry to understanding the issue from the perspective of the archives and allows licensing of the data from the archive to occur in a streamlined manner. The University of Twente supplies speech recognition transcripts makes it possible for research groups without competence in Dutch-language speech recognition to contribute to developing improved approach to the task. Information about the other tasks offered can be found on the MediaEval website:

Researchers approach the tasks by first working to solve them individually. They submit their solutions, which are evaluated by the MediaEval organizing committee. Because all researchers working on the same task have used the same data set, the solutions are directly comparable with each other and it is possible to see which approaches provide the best performance for the automatic generation of subject labels. Researchers then gather at a workshop in order to discuss the results, build collaborations and plan approaches for next year. The workshop fosters friendly competition between sites necessary for progress on the issues, but also builds collaboration encouraging sites to bundle their efforts and to avoid duplicating investigation on approaches that have already been shown to be less fruitful.

The MediaEval 2010 workshop was held in Pisa, Italy in October 2010 directly before ACM Multimedia, a large multimedia conference. It was held in a medieval convent "Santa Croce in Fossabanda" that had been converted into a hotel with seminar facilities. A site so evocative of the beauty and the value cultural heritage was particular suited to host researchers focused on the issues that will help improve automatic indexing of tomorrow's cultural heritage content.