Sunday, November 10, 2013

The Importance of Information Friction: Safer Societies and Subtle Satisfactions

Wet leaves abound on the ground these days in Europe, and when I walk the streets they lead me to think about the critical role that friction plays in our worlds. I own a special pair of dance shoes that give just the right amount of friction to let me glide across the floor. More would prevent me from dancing. But less, oh, less could be disastrous---movement from one spot to another would be uncontrolled and dancing would be impossible.

If you stop to think about it, it really is a similar phenomenon that makes the stories that we tell interesting. Imagine that there were no friction and that the punchline of a joke would always simply slip out of our mouths with the first line. Its not a nasty as taking a tumble during the tango, but if we collectively lost the ability to tell jokes and stories, it would not be making the world a better place.

For us to entertain each other in this way, some kind of an "information friction" is necessary, i.e., a force that holds back everyone knowing everything all the time.

This observation seems a bit of a trivial, given that concerns about eroding "information friction" are more frequently related to "big issues" involving, e.g., involving safety in society.

For example, people worry about private information being leaked that can do them serious damage. This is a legitimate concern. We are thankful when the dangers of sharing are treated constructively by the news media, e.g., ConsumerWatch: Social Media Users May Be Revealing Too Much About Location « CBS San Francisco

Here, the benefit of information friction is clear. On Twitter information about people's location travels as far and as fast as light. Unlike light, it's not gone once the source stops emitting, but rather hangs around and reveals patterns capable of haunting or even hurting, should they be traced and exploited by people with evil intent. We still want to share with others around us, but this sharing should have some natural limits, just like the physical forces that keep us from flying around without control in the real world.

We share the vision of a world in which governments can exploit patterns in real-time personal data to prevent of mitigate the effects of epidemics and catastrophes:
But another "big issue" is that we don't want to trust every government with all the data in the world. We would appreciate a bit of information friction if it could keep our data a bit closer to home and in hands that we feel we can count on because we know them well.

Here, I would like to explore another, less dramatic aspect to information friction: I would like to highlight the critical role that friction plays in the fun, loveliness and delight of life. In addition to jokes, as I described above, think also about movies. We see online communities carefully creating information friction: adding spoiler alerts so that viewers don't come to know the end of the movie until the time is perfectly right for it.

Browsers offer private browsing, which seems to me the only way that it is possible to buy anyone a gift these days---anyone (family member, roommate, colleague), that might happen to glance at the screen of your computer that is. Currently, my browser is happily showering me with recommendations for flights to the destinations that I have recently investigated using my search engine. Some may see benefit in this added inconvenience for people who are trying to deceive their loved ones. But the romantics among us regret the passing of the days in which we could book a surprise getaway trip for two without having to remember to adjust the browser settings.

We need a certain amount of information friction to make it possible to give gifts to each other. To surprise each other with unexpected kindnesses. To perform magic tricks, pass on secret recipes to the next generation, to have fun with puzzles and riddles and generally create a sense of joy and wonder.

So yes, limits to the flow of information will keep us safer from the dangers of misuse or misinterpretation of private data by evil others. But it will also keep the underlying exchanges that bind us together in relationships, social groups and societies alive and well. We need to be able to tell each other stories and to give each other little presents. For these exchanges to work, information friction must be balanced to admit flow, but also to restrict it in just the right way.

It's not a simple balance to achieve. In the real world, so much is simply handed us by physics. Friction simply exists and doesn't have to be created for a specific purpose. The digital world, on the other hand, provides a serious challenge: we can't count on social gravity giving us a constant acceleration.

I'll end by noting the reason that I think it is so important to relate the need for the right amount of information friction not only to "safer societies" but also to the smaller things like jokes and surprises, what I call "subtle satisfactions" in the title of this post. It is fatiguing to self-control our level of social sharing with alarmist thoughts preventing the revelation of too much information leading to burglary or kidnapping. Rather, our brains prefer focusing on the enjoyable and comforting parts of life. We can exercise a more gentle rein by considering the positive: Contributions to protecting individuals' privacy online also uphold a world where your loved one can give you the most wonderful gift of your life on your birthday without having to execute technical gymnastics.

Rather than only working towards an world in which bad things are impossible, we should realize that we are also working towards a world in which the good things stay possible. Achieving and maintaining information friction balance is a lot of work, so I do not doubt that we can use both motivators.

Sunday, October 27, 2013

Power behind MediaEval: Reflecting on 2013

Power behind MediaEval 2013
The MediaEval 2013 Workshop was held in the Barri Gòtic, the Gothic Quarter, of Barcleona on 18-19 October, just before ACM Multimedia 2013 (held in Barcelona). The workshop was attended by 100 participants, who presented and discussed the results that they had a achieved on 11 tasks, 6 main tasks, and on 5 "Brave New Tasks" that ran for the first time this year.

The workshop produced a working notes proceedings containing 96 short working papers, and has been published by

The purpose of MediaEval is to offer tasks to the multimedia community that  support research progress on multimedia challenges that have a social or a human aspect. The tasks are autonomous and each run by a different team of task organizers. My role within MediaEval is to guide the process of running tasks, which involves providing feedback to task organizers and sending out the cues that keep the tasks ruining smoothly and on time. Today I did a quick count that revealed that during the 2013 season, I wrote 1529 personal emails to people that contained the keyword "MediaEval" in them.

What makes MediaEval work, however, cannot be expressed in numbers. Rather, it is the dedication and intensive effort of a large group of people, who propose and organize tasks and carry out the logistics that make the workshop come together. My motivation to continue MediaEval year after year stems largely from an underlying sense of awe at what these people do: both at the work that I am aware of and also at the many things that they do behind the scenes that make largely invisible. These people are the power behind MediaEval. Here I represent them with the picture above, which are the power plugs arranged by Xavi Anguera from Telefonica with the assistive effort of Bart Thomee from Yahoo! Research. The process involved a combination of precision car driving and applied electrical engineering.

In the airplane back from Barcelona yesterday, I finished processing the responses that we received from the participant surveys (collected during the workshop), input from the organizers meeting (held on Sunday after the workshop), and feedback that people gave me verbally during ACM Multimedia (last week). These points are summarized below.

Thus endeth MediaEval 2013, but at the same time beginneth the season of MediaEval 2014. Hope to have you aboard.

Community Feedback from the MediaEval 2013 Multimedia Benchmarking Season + Workshop

The most important feedback point this year was the new structure of the workshop, which was very well received. This year the workshop was faster paced and we introduced poster sessions. We were happy that people liked the short talks and that the poster sessions were considered to be useful and productive. There is a clear trend to preferring there to be more discussion time at the workshop, both in the presentation sessions and in the poster sessions. An idea for the future is to separate passive poster time (posters are hanging and people can look at them but the presenter need not be present) from active poster time (presenter is standing at the poster).

The number one most frequent request was for MediaEval to provide more detailed information. This request was made with respect to a range of areas: descriptions of the tasks should always strive to be maximally explicit; descriptions of the evaluation methods should be detailed and available in a timely manner; task overview talks at the workshop should contain examples and descriptions that allow a general audience (i.e., people who did not participate in the task) to understand the task easily.

Other suggestions were to increase consistency check and continue to promote industry involvement. Finally, requests for more time for preparation of presentations and to explicitly invite (and support) groups to make demos with the posters.

The organizers meeting on Sunday was the source of additional feedback. Task organization requires a huge amount of time and dedication from task organization teams and it is important that this is distributed as evenly as possible across the year and across people. In general, tasks would benefit from additional practical guidance on organization. This includes task management and evaluation methodologies. Since MediaEval is a decentralized system, the source of this guidance must be people with past experience with task organization and communication between tasks. Here, the bi-weekly telcos for organizers are an important tool.

In the coming year, the awards and sponsorship committee can expect an expanded role. The outreach to early-career researchers and to researchers located outside of Europe (in the form of travel grants) is seen by the organizers to be not merely a "nice-to-have", but rather a central part of MediaEval's mission. There is solid consensus about the usefulness of the MediaEval Distinctive Mentions (MDMs). MDMs are peer-to-peer certificates awarded by task organizers to each other or to the participants of their tasks. The MDMs  allow the community to send public messages between members of the community, and especially to point out participant submissions that are highly innovative or have particularly high potential (although they may not have been top scorers according to the official evaluation metric). It is important to make clear that the MediaEval Distinctive Mention is not an "award", since the process by which they are chosen is intentionally kept very informal. In the coming year, we will be investigating the issue of whether MediaEval should introduce a five-year impact award, that would be more formal in nature. The peer-to-peer MDMs will be maintained, although and effort will be made to make them increasingly transparent.

In general we were satisfied with the process used to produce the proceedings. Having groups do an online check of their metadata was helpful. If future years also involve proceedings with 50+ papers, we will need to further streamline the schedule for submission---with the ultimate goal of having the proceedings online at the moment that the workshop opens.

Tuesday, October 22, 2013

CrowdMM 2013: Crowdsourcing in Mutlimedia Emerged or Emerging?

CrowdMM 2013, the 2nd International ACM Workshop on Crowdsourcing for Multimedia, was held in conjunction with ACM Multimedia 2013 on 22 October 2013. This workshop is the second edition of the CrowdMM series, which I have previously written about here. This year it was organized by Wei-Ta Chu (National Chung Cheng University in Taiwan) and Kuan-Ta Chen (Academia Sinica in Taiwan) and myself, with critical support from Tobias Hossfeld (University of Wuerzburg) and Wei Tsang Ooi (NUS). The workshop received support from two projects funded by the European Union, Qualinet and CUbRIK.

During the workshop, we had an interesting panel discussion with the topic "Crowdsourcing in Mutlimedia Emerged or Emerging?" The members of the panel were Daniel Gatica-Perez from Idiap (who keynoted the workshop with a talk entitled, "When the Crowd Watches the Crowd: Understanding Impressions in Online Conversational Video"), Tobias Hossfeld (who organized this year's Crowdsourcing for Multimedia Ideas Competition) and Mohammad Soleymani from Imperial College London (together we presented a tutorial on Crowdsourcing for Multimedia Research the day before). The image above was taken of the whiteboard where I attempted to accumulate the main points raised by the audience and the panel members during the discussion. The purpose of this post is to give a summary of these points.

At the end of the panel, the panel together with the audience decided that crowdsourcing for multimedia has not yet reached its full potential, and therefore should be considered "emerging" rather than already "emerged". This conclusion was interesting in light of the fact that the panel discussion revealed many areas in which crowdsourcing represents an extension of previously existing practices, or stands to benefit from established techniques or theoretical frameworks. These factors are arguments that can be marshaled in support of the "emerged" perspective. However, in the end the arguments for "emerging" had the clear upper hand.

Because the ultimate conclusion was "emerging", i.e., that the field is still experiencing development, I decided to summarize the panel discussion not as a series of statements, but rather as a list of questions. Please note that this summary is from my personal perspective and may not exactly represent what was said by the panelists and the audience during the panel. Any discrepancies, I hope, rather than being bothersome, will provide seeds for future discussion.

Summary of the CrowdMM 2013 Panel Discussion
"Crowdsourcing for Multimedia: Emerged or Emerging"

Understanding: Any larger sense of purpose that can be shared in the crowdsourcing ecosystem could be valuable to increase motivation and thereby quality. What else can we do to fight worker alienation? Why don't taskaskers ask the crowdworkers who they are? And vice versa?

Best Practices: There is no magic recipe for crowdsourcing for multimedia. Couldn't the research community be doing more to share task design, code and data? Would that help? Factors that contribute to the success of crowdsourcing are watertight task design (test, test, test, test, test, test the task design before running a large scale experiment), detailed examples or training sessions, inclusion of verification questions, and making workings aware of the larger meaning of their work. Do tasks have to be fun? Certainly, they should run smoothly so that crowdworkers can hit a state of "blissful productivity".

History: Many of the issues of crowdsourcing are the same ones encountered when we carry out experiments in the lab. Do we make full use of the carry over? Crowdsourcing experiments can be validated by corresponding experiments that have been carried out in a lab environment. Do we do this often enough?

Markets: Many of the issues of crowdsourcing are related to economics, and in particular to the way that the laws of supply and demand operate in an economic market. Have we made use of theories and practices from this area?

Diversity: Why is crowdsourcing mostly used for image labeling by the community? What about other applications such as system design and test? What about techniques that combine human and conventional computing in online systems?

Reproducibility: Shouldn't reproducibility be the ultimate goal of crowdsourcing? Are we making the problem to simple in the cases that we struggle with reproducibility? Understanding the input of the crowd as being influenced by multiple dimensions can help us to better design crowdsourcing experiments that are highly replicable.

Reliability: Have we made use of reliability theory? How about test/retest reliability used in psychology?

Uncertainty: Are we dealing with noisier data? Or has crowdsourcing actually allowed us to move to more realistic data? Human consensus is the upper limit of what we can derive from crowdsourcing, but does human consensus not in turn depend on how well the task has been described to crowdworkers? What don't we do more to exploit the whole framework of probability theory?

Gamification: Will it solve the issues of how to effectively and appropriately incentivize the crowd? Should the research community be the ones to push forward gamification (Does any of us realize how many people and resources it takes to make a really successful commercial game?)

Design: Aren't we forgetting about a lot of work that has been done in interaction design?

Education: Can we combine crowdsourcing systems with systems that help people learn skills that are useful in real life? In this way, crowdworkers receive more from the system in exchange for their crowdwork, in addition to just money.

Cats: Labeling cats is not necessarily an easy task. Is a stuffed animal a cat? How about a kitten?

Wednesday, September 4, 2013

Towards Responsible and Sustainable Crowdsourcing

Humans are the ultimate intelligent systems. Units of human work can be used to address the problems studied in the fields of pattern recognition and artificial intelligence. After years of research to crack certain tough problems, mere utterance of the phrase "human cycle" makes it seem like someone turned on a light in the room. Suddenly, we feel we are no longer feeling our way forward in darkness as we develop solutions. Instead, a bright world of new possibilities has been opened.

The excitement that crowdsourcing has generated in computer science is related to the fact that large crowdsourcing platforms make it possible to apply abstraction to human input to the system. It is not necessary to consider who exactly provides the input, or how they "compute" it, rather the human processor can be treated as a black box. The magic comes when it is possible to make a "call the the crowd" and be sure that there will be a crowdworker there to return a value in response to that call.

However, crowdsourcing raises a whole new array of issues. At the same time that we excitedly pursue the potential of "Artifical artificial intelligence" (as it's called by MTurk), it is necessary to also remember "Human human computation".

I am not an ethicist, and my first foray into crowdsourcing ethics was relatively recent and necessarily superficial. In fact, I started by typing the word "ethics" into my favorite mainstream search engine and picking a definition to study that seemed to me to be authoritative. However, I am convinced that the community of crowdworkers and taskaskers together form an ecosystem and that the main threat to this ecosystem is that we treat it irresponsibly.

In other words, we should not throw out everything that we have learned over centuries of human civilization about creating healthy and happy societies, stable economies and safe and fulfilled individuals in our quest to create new systems. Ultimately, these systems must serve humanity as a whole, and not disproportionately or detrimentally lean on the portion of the population that serves as crowdworkers.

Because of this conviction, I have put together a set of slides about responsible crowdsourcing that serve as notes on the ethical aspects of crowdsourcing. At a recent Dagstuhl seminar entitled, "Crowdsourcing: From Theory to Practice and Long-Term Perspectives" I used the slides to make a presentation intended to serve as a basis for opening a discussion on the ethical issues of crowdsourcing.

The hopeful part of this undertaking is that it revealed many solutions to address ethical aspects of crowdsourcing. Some of them pose challenges that are just as exciting as the ones that motivated us to turn to crowdsourcing in the first place.

Please see the references in the slides, and also these links:

Saturday, August 24, 2013

What is multimedia?

These days a mirrored ceiling is an unambiguous call to take a photo.
Yesterday evening after the conclusion of a very successful First Workshop on Speech, Language, and Audio in Multimedia (SLAM 2013) in Marseille, participants naturally drifted to various scenic spots for debriefing, including the Vieux Port. There, over a glass of pastis (predictable, given the locale) the conversation naturally moved to the question, "What is multimedia?"

One obvious answer to the question is the Wikipedia definition of multimedia, "Multimedia is media and content that uses a combination of different content forms". The classic example, is of course, video, which has a visual modality and also an audio modality. Other examples include social images (for example, images on Flickr have a visual modality, but also have tags and geo-tags) and podcasts (which have an audio modality and also a textual modality included in their RSS feeds).

One can argue about that answer, for example, by pointing out that some people define multimedia as being any non-text media. For example, an image, like the one above. The image was taken at the new events pavilion in the Vieux Port in Marseille. The events pavilion is basically a set of columns with a plane laid on top of them, the bottom of the plane is shiny, so that when you stand under it you are looking up at a ceiling, which is a large mirror.

In my view, an image in isolation cannot be taken to be multimedia since it includes a single medium, name pixels. It become multimedia in conjunction with this blog post, which adds an additional medium, namely text.

Another dimension for the debate on multimedia is whether it must necessarily involve human communication. The combination of this image and this blogpost were created by me with the intent to communicate a message to a certain audience, i.e., the readership of my blog (which, as I have previously mentioned, is largely a few fellow researchers in conjunction with future instantiations of myself).

Researchers who share my view of multimedia, insist on the point that multimedia must contain a message. It must come into being as an act of communication and also be consumed in a process that involves the interpretation of meaning. This definition excludes a set of geo-tagged surveillance videos as being multimedia, although they would involve two different media, namely video content and geo-coordinates.

Note that when you require multimedia to contain a message created with explicit human intent, you enter a bit of a slippery slope. If a human being had set up the surveillance cameras with the specific intent of creating a body of information that would give fellow humans information on current street conditions, then we are back to multimedia.

The slipperiness of the definition of multimedia reveals to us something very important about our field. In order to know whether or not something is multimedia, it is not sufficient to examine the multimedia data, rather it is also necessary to look at the production and consumption chain. Multimodal data in some contexts remains just data, but if an act of encoding and decoding meaning is involved, then the same multimodal data must be considered to be multimedia.

A report on SLAM 2013 has appeared in SIGMM Records. The SLAM 2014 website went online immediately after SLAM 2013 and the anticipation is already building.

Monday, August 12, 2013

SMTH vs. Catchy: The Science Inside

Catchy logo 
While all the hype is going on about SMTH, students here in the Netherlands have their heads down and are focusing on their science.

The number one SMTH competitor is Catchy, and you can read about how it works in this paper:

Rijnboutt, Eric, Hokke, Olivier, Kooij, Rob, and Bidarra, Rafael. 2013. A robust throw detection library for mobile games. Proceedings of Foundations of Digital Games (FDG 2013).

The Catchy team created Catchy as the final thesis project in their undergraduate Computer Science  program at Delft University of Technology in the Netherlands.

Rumor has it that the Catchy app was such an irresistible hit that they had the whole committee standing around playing Catchy as they defended their thesis (...a happy end to a theoretical discussion on assumptions concerning initial accelerations).

You will notice that the thesis is dated June 2012, so if you called Catchy the original SMTH you would not be factually inaccurate.

Which one is installed on my own phone? In the choice between SMTH and Catchy, I go for the game that the science geeks play.

Oh, and in case in addition to the scientific paper, you also wanted the download link:

Sunday, August 11, 2013

Semantics for Multimedia Research: Getting the Definitions of "Semantics" Right

Mossy Tree

There seem to be two different definitions of semantics in use in the multimedia community. Having been trained as a linguist, I have a strong opinion about which one of these more useful for multimedia research.

The first I call semantics as interpretation. Basically, here, semantics is considered to be anything that a human might remark when viewing and trying to interpret multimedia. Concentrating, for the sake of example, on images, this first definition would consider the semantic analysis of the image above to result in the conclusion that the image depicts a tree. The image is associated with the semantics "tree" because someone looking at the picture can imaginably say, "Oh, this picture shows a tree."

The second I call semantics as signification. This is the definition of semantics that I argue is the more useful one. Signification has to do with the use of systems of signs. Systems of signs arise because humans interact with each other and in doing so develop communication conventions. Human language is the quintessential example of a type of system of signs.

These systems of signs are shaped by the nature and the capacity of our brains. However, in contrast to perception, the process of interpretation of a sign (or set of signs) requires making reference to a set of "guidelines of use" that are not primarily a product of human physiology, but rather owe their existence to human interaction. Some sign systems are established quickly, or even in the course of a single conversation or exchange, but in general the process of human language evolution is measured in millennia rather than minutes.

This semantics-as-signification definition can be stated more formally as
Semantics is the creation and interpretation of individual and/or inter-playing signs guided by human cognition and communication convention.
Note that dictionary definitions of "semantics" mention "signification" and "signs": see Merriam WebsterOED and Wikipedia, suggesting that the semantics-as-signification definition is the more generally accepted version, and that the semantics-as-interpretation definition may be rather particular to the field of multimedia research. I will argue that it is not only particular to the field, but that it is one "innovation" that we are actually better off without.

Let's revisit the image above in light of the semantics-as-signification definition. Under this definition the image above is also associated with the semantics "tree", but not merely because someone looking at the picture can imaginably say, "Oh, this picture shows a tree." Rather, critically, it is because, also, "tree" belongs to a larger system of signs. It is a concept that humans agree is part of the way in which we communicate about the world.

Let's turn now to consider the status of the moss growing on the tree in the image. Both definitions would associate the image with the semantics "moss", since "moss" is a concept just like "tree". However, we can imagine a human looking at the moss in the picture and "seeing" something different. That person might remark about this image "Oh, this picture shows north." This remark is enough to confirm that under the semantics-as-interpretation definition, the image can be associated with the semantics "northiness".

However, in order to come to a conclusion concerning the status of "northiness" under the semantics-as-signification definition, we need to look a little further.

Specifically, for the semantics-as-signification definition, we need to be able to assume that "northiness" belongs to a larger systems of signs used to communicate. In other words,  we need to convince ourselves of the existence of a system of signs under which this kind of image communicates "northiness".

OK. That's easy. Effectively, we have just made up exactly such a system. I tell you that this image depicts "northiness" and you can judge the next image that I show you. We will both agree that this new image is associated with "northiness". We have just invented a new convention and are using it to interact. We may not want to claim that we have created an entirely new system of signs, but we have just "re-newed" the existing system by extending it with a new sign.

Is it always possible to identify a system of signs such that we are justified in applying the semantics-as-signification definition? Is the difference between the two definitions irrelevant?

We can safely assume that, yes, it is always possible to identify such a system of signs. Human brains are creative and flexible. If I gave you another picture similar to the one above, would you be able to apply our new concept of "northiness" to it? I would imagine that you would apply it flawlessly. Also, human interaction are directed towards communication. If I use a word that you don't know, your first reaction is to try to figure out what I meant by that word. Unlike a conventional computer, your cognitive system does not shut down and throw an "input unknown" message, but instead it makes an attempt to integrate the new word into its inventory of conventions.

In sum, if no existing system of signs is already available, presto, we can create one. Further, our natural tendencies predispose us to create one automatically, frequently without even realizing that we are doing it.

However, even if we agree we can always force semantics that complies with the semantics-as-interpretation definition to fit the semantics-as-signification definition, it might not always be a good idea to do so. The difference between the two definitions is very relevant indeed. 

Look at the problem with "northiness". In order for someone to be able to use "northiness" to communicate, that person would have to have read this blog post. If we want to consider "northiness" as part of the semantics of the image, the fact that people need a specialized background knowledge imposes a rather constricting limitation on the number of people who would have access to the "semantics" of the image (i.e., who could productively make use of "northiness" semantics), and thus on the applicability of our multimedia analysis.

Further, "northiness" can be seen as a rather irresponsible invention on my part. The connection between moss growth and direction is tenuous at best, and I shouldn't be suggesting that moss is a reliable indicator for finding one's way out of a woods. That could go very wrong. Personal (in contrast to conventional) interpretations threaten to limit the applicability of our multimedia analysis.

The key issue is this: If we force semantics that complies to the semantics-as-interpretation definition to fit the semantics-as-signification definition, we lose track of our assumptions about the nature of the underlying system of signs. We don't know if we should now throw research effort into creating visual detectors to identify "northiness" semantics in images, or if "tree" detectors are more important. By forcing ourselves to reason carefully about the underlying systems of signs, we can make those kind of decisions in a more informed manner.
Note that I am not suggesting that it is possible to sit down and enumerate all the signs that are part of any given system of signs. These systems are not finite; nor are they mutually exclusive. (My suggestion for how to best attack this issue is to turn to pre-lexical semantics, previously discussed here.)

What I am arguing is that the "mere" act of explicitly acknowledging that the system of signs must necessarily exist provides a productive constraint on our models because it prevents us from extending semantic interpretations unconsciously or arbitrarily (e.g., it prevents me from suddenly declaring the reality and importance of "northiness" without further substantiation.)

Adopting the semantics-as-signification definition implies understanding meaning to be something that arises via a process of negotiation between two or more human communicators with respect to a set of established conventions. In the absence of consensus between multiple humans, there is no meaning, i.e., no semantics. This is the theoretical basis for choices to focus multimedia research on analyzing those aspects of images and video which have a high inter-annotator agreement.

Am I opening an "If a tree falls in the forest and no one hears it does it really make a noise" debate? The image I have chosen above perhaps even invites that. However, let's close with another related image and question: "If this picture shows northiness and no one else sees it, do we really want to consider northiness to be a semantic concept for the purposes of multimedia research?" My point is that we get a lot further a lot faster if we just accept that the answer is "no".

It's not that we don't find "northiness" interesting. In fact, visual detectors capable of predicting the direction of the compass faced by the camera at the moment the image is captured have a number of potential applications. It's just that unless we stop for careful consideration before considering "northiness" to be part of the image semantics, we are in danger of sliding down a slippery slope. This slope leads to the quite unscientific habit of inventing our systems of signs as we go along, convenience-driven and possibly largely unconsciously.

Sunset straight on

Saturday, July 13, 2013

Event Detection in Multimedia: Different definitions, different research challenges.

Yesterday, someone asked me for a pointer to work in the area of event detection in multimedia content. This mail prompted me to finally get out a blog post that explains the distinction between the different sorts of underlying challenges that researchers are referring to when they discuss events in multimedia.

A simple definition of an event is a triple (t, p, a), consisting of a moment a time t, a place in space p, and one or more actors a. For example,  at (t=) 2pm 13 July 2013. at the (p=) Faculty of Electrical Engineering, Mathematics and Computer Science at Delft University of Technology, (a=) I am now involved in a "blog-post writing" event.

Let's look at that definition of event in terms of aspects that matter to us as multimedia researchers. If you took a video of me (here and now), another human looking at your video might notice that I am also eating a salad. Consequently, the video could equally be considered to depict a "lunch-eating event". For this reason, it makes sense to also introduce a fourth variable "v", to arrive at (t, p, a, v). The "v"stands for the name for the action or the activity part of the event. I use "v" for "verb" since these names corresponds to verbs or can be expressed by phrases involving verbs.

Note that there are at least three basic ways to name (or label) events when it comes to multimedia: (1) name the event from the perspective of the/an actor (I, the actor, call it a lunch eating event, because I know it is lunch.) (2) name the event from the perspective of the person recording the multimedia (The person sees me engaged in an eating event, but do not necessarily know or care that it is lunch.) and (3) name the event from the perspective of the/a person looking at the multimedia in a time and place other than when and where it was captured (The person sees me sitting at a computer, but does not notice or want to pay attention to the salad.) Many times these three perspectives collapse and there is only a single label that would be relevant, but it should be kept in mind that they do not necessarily do so. We risk over-simplifying the world and losing valuable information if we assume that they can be conflated. Instead, multimedia systems must be careful to maintain multiple views, i.e., a video that for one person (e.g., a government official) depicts a riot, might for another (e.g., a concerned citizen) depict a demonstration.

The (t, p, a, v) definition of an event is sometimes constrained by a fourth factor, namely, advanced human planning. Multimedia research that looks at planned events focuses on events that humans organize for social purposes and that can therefore be anticipated in advance of their occurrence. This group of events includes events like concerts, games, conferences and parties. It is generally referred to as "Social Event Detection".

The Social Event Detection (SED) task at the MediaEval Multimedia Benchmark started in 2011 and has been drawing a steadily increasing number of participants each year.  MediaEval SED 2013 offers the most ambitious and interesting SED task to date. The SED Task Organizers have organized workshops and special sessions at various conferences, for example, recently the Special Session on Social Events in Web Multimedia at ICMR 2013. The MediaEval bibliography includes a relatively up-to-date list of the papers that have been published regarding the MediaEval SED task.

The SED task is defined such that its multimedia aspect arises because addressing the task requires combining different information sources (text, photos, videos) from different social communities on the Web. Note that it is the use of the (t, p, a, v) definition of an event and not per se the social nature of the data that distinguishes SED from other types of event detection in multimedia.

Another important type of event detection is defined as involving not the full (t, p, a, v), but rather (v). In other words, this variant of event detection is interested not in any specific event, but in detecting the occurrence of instances of a particular event type. This type of event detection is referred to as Multimedia Event Detection and has been offered as a task in TRECVid since 2013.  Examples of these sorts of events are "Birthday party" (from TRECVid MED 2011) and "Giving directions to a location" and "Winning a race without a vehicle" (from TRECVid MED 2012).

If you consider only the labels that they use to refer to events, SED and MED look very much the same. However, it is important to remember that for MED, multimedia that is considered relevant to the event "birthday party" can depict any birthday party at any time, at any place around the world. In other words, for MED "birthday party" is an instantiation of any event of the type birthday party. Only (v) and not the full (t, p, a, v) are part of the definition. For SED, "birthday party" would be for example, my birthday party, taken on my birthday in 2013, at the particular place at which I celebrated.

Again make note of the task definition of MED. The MED task is defined such that its multimedia aspect arises because addressing the task requires combining different modalities within the same video (visual + audio channel). Typically, the data is not social video per se. Note that it is the use of the (v) definition of an event and not per se the nature of the data that distinguishes MED from other types of event detection in multimedia.

My impression is that some researchers in the community are convinced that the way forward for the research community is to first detect (v) (i.e., apply the MED event definition) and then filter the detection results to be constrained to (t, p, a, v) (i.e., generate a list of results that follows the SED definition).

Before making this assumption, I would urge researchers to carefully contemplate the use scenario of their applications. For example, if I have one picture from a birthday party I attended and I want to search for other pictures of the same birthday party on the Internet, it does not make any sense at all to solve the complete MED birthday party detection problem as the first step in the process.

As far as weddings go, if we are using visual features it's tempting to rely on the presence of that beautiful white wedding gown to detect instances of v = "wedding". However, despite an initial impression that the gown provides a stable visual indicator, its not going to get you very far in a  real-world data set:

Leaving courthouse on first day of gay marriage in Washington
Ultimately, we need to look at both (t, p, a, v) and (v) and all the definitions of event detection in multimedia that lie in between. Luckily there are events in which researchers with all perspectives come together. I have now finished both my lunch and my blog post and, as a final note, I leave you with an example of just such an event:

Vasileios Mezaris, Ansgar Scherp, Ramesh Jain, Mohan Kankanhalli, Huiyu Zhou, Jianguo Zhang, Liang Wang, and Zhengyou Zhang. 2011. Modeling and representing events in multimedia. In Proceedings of the 19th ACM international conference on Multimedia (MM '11). ACM, New York, NY, USA, 613-614

Friday, June 28, 2013

Discriminative Email Writing

The best email writers that I know write discriminatively. I don't me that the are discriminating in what they write or how they phrase it. Rather, I mean that when they write they try to discriminate that email that they are writing from a thousand other emails that they have also written and also from a thousand other emails that they can imagine have recently flooded the in-boxes of people that they are writing to. They try to make sure that every email is self-contained and does not contain dependencies that need to be traced through its predecessors in order to ensure full interpretation.

At the moment, our computer science bachelors students are carrying out their final presentations and my inbox is flooded with emails that all read "Hi, here is our thesis." OK. I allowed myself to exaggerate that statement just a bit for dramatic effect. It isn't, however, that gross of an exaggeration.

I decided that I would develop a list of emailing rules, and if I can get them refined enough, I will use them to help guide students on how they can most efficiently communicate with me (hoping, in the process to help them towards writing more professional emails and at the same time keeping my own email skills honed).
  • Please put the name of the project that you are communicating about in the subject line of the email.
  • Please put a keyword that reveals the nature of the issue in the subject line of the email. For urgent issues, you can also include the words "time sensitive".
  • Please start the email with one sentence that states what you need or what you are asking. Then, if there are additional explanations, put them in the second paragraph. The supporting information is very welcome, but please do not bury the main question or request deep in the email.
  • If you are referencing a past email, please include a copy of the past email in your email. 
  • If you are referring to a date, please write Friday 28 June and not "next Friday". If you need to refer to a time, please repeat the relevant dates, e.g., "A week before our presentation, which is scheduled for Thursday 4 July."
  • If you are referring to a URL mentioned in a past email, please repeat the URL in every future mail that needs to make the same reference.
  • If you are write an email and expect that the answer will be relevant to the whole project team, please put all the members of the team on cc so that the answer can go via reply-to-all instead of forwarding them the answer afterwards. (This let's everyone involved in the communication know exactly who knows what.)
It is curious to me, because thinking discriminatively seems to come naturally to our students. The question "Which keyword query do you need to use in order to find X document that you read last week on your hard drive" is not a particularly hard one. Put differently, as humans it seems that we can think of a document,  and relatively easily come up with a word or phrase that will discriminate that document from hundreds of others we read. It would be nice if would could more consistently apply the inverse sort of thinking when we write email so that we ourselves generate documents that are very easy to discriminate.

Tuesday, April 30, 2013

The Five Runs Rule: Less is More in Multimedia Benchmarking

MediaEval is a multimedia benchmarking initiative that offers tasks in the area of multimedia access and retrieval that have a human or a social aspect to them. Teams sign up to participate, carry out the tasks, submit results, and then present their findings at the yearly workshop.

I get a lot of questions about something in MediaEval that is called the "five-runs rule". This blogpost is dedicated to explaining what it is, where it came from, and why we continue to respect it from year to year.

Participating teams in MediaEval develop solutions (i.e., algorithms and systems) that address MediaEval tasks. The results that they submit to a MediaEval task are the output generated by these solutions when they are applied to the task test data set. For example, this output might be a set of predicted genre labels for a set of test videos. A complete set of output generated by a particular solution is called a "run". You can think of a run as the results generated by an experimental condition. The five-run rule states that any given team can only submit of five sets of results to a MediaEval task in any given year.

The simple answer to why MediaEval tasks respect the five-runs rule is "They did it in CLEF'. CLEF is the Cross Language Evaluation Forum, now the Conference and Labs of the Evaluation Forum, cf. MediaEval began as a track of CLEF in 2008 called VideoCLEF and at that time we adopted the five-runs rule and have been using it ever since.

The more interesting answer to why MediaEval tasks respect the five-runs rule is "Because it makes us better by forcing us to make choices". Basically, the five-runs rule forces participants during the process of developing their task solutions to think very carefully about what the best possible approach to the problem would be an focus their effort there. The rule encourages them to use a development set (if the task provides one) in order to inform the optimal design of their approach and select their parameters.

The five-runs rule discourages teams from "trying everything" and submitting a large number of runs, as if evaluation was a lottery. Not that we don't like playing the lottery every once in a while, however, if we choose our best solutions, rather than submitting them all, we help to avoid over-fitting new technologies that we develop to a particular data set and a particular evaluation metric. Also, if we think carefully about why we choose a certain approach when developing a solution, we will have better insight about why the solution worked or failed to work...which gives us a clearer picture of what we need to try next.

The practical advantage of the five-runs rule is that it allows the MediaEval Task Organizers to more easily distill a "main message" from all the runs that are submitted to a task in a given year: the participants have already provided a filter by submitting only the techniques that they find most interesting or predict will yield the best performance.

The five-runs rule also keeps Task Organizers from demanding too much of the participants. Many tasks discriminate between General Runs ("anything goes") and Required Runs that impose specific conditions (such as "exploit metadata" or "speech only"). The purpose of having these different types of runs is to make sure that there are a minimum number of runs submitted to the task that investigate specific opportunities and challenges presented by the data set and the task. For example, in the Placing Task, not a lot of information is to be gained from comparing metadata-only runs directly with pixel-only runs. Instead, a minimum number of teams have to look at both types of approaches in order for us to learn something about how the different modalities contribute to solving the overall task. Obviously, if there are two many required runs, participating teams will be constrained in the dimensions along which they can innovate, and that would hold us back.

Another practical advantage of the five-runs rule has arisen in the past years when tasks, led by Search and Hyperlinking and also Visual Privacy, have started carrying out post hoc analysis. Here, in order to deal with the volumes of the runs that need to be reviewed by human judges (even if we exploit crowdsourcing approaches), it is very helpful to have a small focused set of results.

Many tasks will release the ground truth for their test sets at the workshop, so teams that have generated many runs can still evaluate them. We encourage the practice of extending the two-page MediaEval working notes paper into a full paper for submission to another venue, in particular international conferences and journals. In order to do this, it is necessary to have the ground truth. Some tasks do not release the ground truth for their test sets because the test set of one year becomes the development set of the next year, and we try to keep the playing field as level as possible for new teams that are joining the benchmark (and may not have participated in previous year's editions of a task).

In the end, people generally have the experience that when they are writing up their results in their two-page working notes paper, and trying to get to the bottom of what worked well and why in your failure analysis, they are generally quite happy that they are dealing with not more than five runs.

Saturday, March 30, 2013

Multimedia Readymades

This blog post is a continuation of and a response to Cynthia Liem's call to collect "hidden gems" of internet multimedia made during her TedX Delft talk "Every bit of it" (cf. Cynthia says that original bits of multimedia that we share on the Internet have the feel of "randomness" or of "low quality", but that what they are are "entry points" to things of value, if we can discover them and polish them. In this blog post I relate discovered bits of multimedia to the surrealist concept of "Readymades".

An art exhibition that I visited some 25 years ago in Frankfurt was probably what set me first to thinking about hidden gems and the process by which value is assigned to what would conventionally be considered mundane.

My most recent hidden gem is a video that captures the comments of people on something that has been created not by an act of coming-into-being, but by an act of putting-into-context.

From one perspective, the video is about Fountain, which is one of the Readymades of Marcel Duchamp. Wikipedia gives us the definition of the Readymade from André Breton and Paul Éluard's Dictionnaire abrégé du Surréalisme: "an ordinary object elevated to the dignity of a work of art by the mere choice of an artist." 

The Fountain is a urinal that has been taken out of context in two ways: first, it lies flat rather than hanging on a wall in the position it would need to be in order to be used, and, second, it was submitted by Marcel Duchamp in 1917 as a work of art to an art exhibition.

One could quip, "Beauty lies in the eye of the beholder." and let it go at that. However, scratch a little deeper, look beyond the shock-value of the object being a urinal, and a different message becomes clear: The artifact itself has little importance, and merely provides a trigger for the process of Fountain becoming something interesting and worthwhile.

Marcel Duchamp 'FOUNTAIN' - IS IT ART? from arlen figgis on Vimeo.

From another perspective, the video is about what happens when you take a widely-recognized work of art and display it to people out of context. In the video, the urinal is put into a public toilet in Liverpool and people, who recognize the urinal as Fountain, are asked to comment on it.
Of all the people in the video, only two comment on the fact that they are actually standing in a toilet looking at a toilet. The relative inattention to this point suggests that the fact that Fountain is finally in its "home" surroundings does not have much impact on people's opinions of it. No one in the video is looking at the urinal and having anything that at all related to a "The-Emperor-has-no-clothes" moment. Instead, they seem to have the same thoughts and feeling as they did when they walked into the toilet: they are interested, and they acknowledge it as something worthwhile.

I consider this video to be rather unfinished (yet a "diamond in the rough") because it has the power to make a point that it does not explicitly make. The un-made point is this: it is non-trivial to undo the Fountain effect, i.e., the urinal stubbornly resists returning to un-interestingness. Once a gem has emerged from a mundane object, the gem-status is difficult to shake.

These two perspectives on what this video is about represent two poles of a spectrum of interactions that we (as users, uploaders and viewers) have with bits of online multimedia that people capture and upload to the internet.

In some moments, we amuse, educate and otherwise occupy ourselves through layers of rediscovery, and the popularity, quality, origin, and perhaps even topic of the original multimedia object could not be less relevant.

In other moments, we forget about  rediscovering hidden gems and return to what it widely-acknowledged to be interesting or worthwhile, we cling to The Cannon or flock to watch the blockbusters.

Our goal as multimedia information retrieval researchers should be to develop techniques that support this entire spectrum.

In Frankfurt 25 years ago, my encounter with Readymades revealed to me that it is not particularly useful to assume that there is a sharp boundary between creation and discovery. It made clear that significance arises through the dialogue representing an interplay between a large number of voices representing the general public and more limited number of voices recognized as authorities.

These poles exist now, they existed 25 years ago (pre-Internet), it was happening in 1917, and from there it seems a relatively straightforward extension to the idea that they were there before and that their importance will continue.

It is multimedia systems that give us access to the hits and classics, but that also allow us to discover the triggers ("entry points" in Cynthia's words) from which interesting and worthwhile objects emerge and to support the interplay of contributions and opinions necessary for emergence.

It's a difficult problem. I rather think that the number of people that share my understanding of my hidden gem is quite limited, and that we would need a retrieval or recommendation system that would reach all of them in order to really add any polish. Or maybe it is enough for the gem to be personal, although I am far from sure that it is easy to ensure this video is still discoverable in future years when I again return to the topic of the emergent value of the mundane.

Friday, March 29, 2013

Multimedia Bits

Cynthia Liem recently gave a TedX Delft talk called "Every bit of it". I have the pleasure and the honor of being her colleague at the Delft Multimedia Information Retrieval Lab. I missed being at the talk in person, but finally found time today to watch it on YouTube.

What she is saying is so critical---it at the same time both fresh and timeless---that the talk deserves a more in-depth reaction than a tweet or re-tweet. This post summarizes the message that I heard in Cynthia's talk.

Cynthia makes the point that every bit on the Internet has meaning to the original person who put it there. These are of course, multimedia bits, that she is speaking of videos and images that have been captured by people and shared on the Internet. An important question in this era of Big Data: each of these "bits" of multimedia is captured by someone for some reason. Cynthia tells us that we can add worth to these bits by enhancing them with other bits; in particular, she shows us wondrous transformations that can be brought about by adding music to video.

Towards the end of the video, she says that a big question that with face with multimedia is "What is relevant?"

She points out that we tend to focus for the obvious in our understanding of relevance, e.g., popularity and quality, and that because of this focus, we miss material that people do not know exists.

Instead, the "bits" of multimedia on the Internet should be seen as an entrance to the world, not as the final product, but a place to start. A diamond in the rough that needs to be polished in order to add value.

At the end of the video, she asks the audience to keep their eyes open for a rediscovered hidden gem....I discuss my gem in my next post:

Saturday, March 2, 2013

Visual Relatedness is in the Eye of the Beholder: Remember Paris


How do we know if a tag is related to the visual content of an image? In this blogpost, I am going to argue that in order to answer that question, it is first necessary to decide who "we" is. In other words, it is necessary to first define who is the person or persons who are judging visual relatedness, and only then ask the question is this tag related to the visual content of the image.

I'll start out by remarking that an alternate way of approaching the issue is to get rid of the human judge all together. For example, this paper:

Aixin Sun and Sourav S. Bhowmick. 2009. Image tag clarity: in search of visual-representative tags for social images. In Proceedings of the first SIGMM workshop on Social media (WSM '09). ACM, New York, NY, USA, 19-26.

provides us with a clearly-defined notion of the "visual representativeness" of tags. A tag is considered to be visually representative if it describes the visual content of a photo. "Sunset" and "beach" are visually representative and "Asia" and "2008" may not be. A tag is visually representative if it associated with images whose visual representations diverge from that of the overall collection. The model in this paper uses a visual clarity score, which is the Kullback-Leibler divergence of language models based on visual-bag-of-words representations.

Why don't we like this alternative? Well, this definition of visual representativeness does not reflect visual representativeness as perceived by humans. It's not clear that we really are helping ourselves build multimedia systems that serve human users if we make things less complicated by getting rid of the human judge.

The issue is the following: Humans have no problem confirming that an image depicting a pagoda at sunset and an image depicting a busy intersection with digital billboards both depict "Asia".  There is something about the visual content of these two images that is representative of "Asia", and it seems to be a simple leap from there to conclude that the tag "Asia" is related to the visual content of these images.

But there was a time in my life where I didn't know what a pagoda was. It was less long ago than one may think (although certainly before the workshop at which the paper above was presented, held at ACM Multimedia 2009 in Beijing), which prompts me to think further.

A solution might be the following: We could stipulate that in my pre-pagoda-awareness years, I should have been excluded from the set of people who gets to judge if photos are related to Asia. But then would would then have to worry about my familiarity with digital billboards, and then the next Asia indicator and on and on until I and everyone that I know is excluded from the set of people who gets to judge the visual relatedness of photos to tags. In short, this solution does not lead to a clearer definition of how we can know that a tag relates to the visual content of an image.

Why do things get so complicated? The problem, I argue, is that we ask the question of a pair: "For this image and this tag (i,t) is the visual content of the image related to this tag?"  This question does not lead to a well-defined answer.

The answer is, however, well defined if we ask the question of a triple: "For this image, this tag and this person or group of people (i,t,P): is the visual content of the image related to this tag in the judgement of this person or group of people?" In other words, we need to look for the relationship between tags and the visually depicted content of images in the eye of the beholder.

We can then perform a little computational experiment: Put person or people P in a room and expose them to the visual content of image i and ask the yes/no question "Is tag t related to image i?"

The answer of P is going to depend on the method that P uses in order to reason from the visual content of i to the relatedness of tag t. Here's a list of different Ps who are able to identify Paris for different reasons.

(i, "paris" P1): I took the picture and when I see it, I remember it.
(i, "paris" P2): I was there when the picture was taken and when I see it, I remember this moment.
(i, "paris" P3): Someone told me about a picture that was taken in Paris and there is something that I see in this picture that tells me that this must be it.
(i, "paris" P4): I know of another picture that looks just like this one and it was labeled Paris.
(i, "paris" P5): I've seen other pictures like this an recognize it (the specific buildings that appear).
(i, "paris" P6): I've been there and recognize characteristics of the place (the type of architecture).
(i, "paris" P7): I am a multimedia forensic expert and have established a chain of logic that identifies the place as Paris.

Perhaps even more are possible. What is clear is the following: It would be nice if we would have ended up with two P's: expert annotators and non-expert annotators. However, it looks like what we have is judgements that are based on quite a few differences in personal history, previous exposure, world knowledge, and expertise.

If we want to develop truly useful algorithms that validate the match between the visual content and the tag, we have a lot more work to do, in order to cover all the (i,t,P).

The key is to get a chance to question enough Ps. Multimedia research needs the Crowd.

Friday, February 15, 2013

Crowdsourcing for Multimedia: ACM Multimedia and The Crowd

Crowdsourcing for multimedia is a set of techniques that leverage human intelligence and a large number of individual contributors in order to tackle challenges in multimedia that are conventionally approached using automatic methods. Exploiting the crowd means taking advantage of human computation where it can help support multimedia algorithms and multimedia systems the most.

ACM Multimedia 2013 has introduced a new Crowdsourcing for Multimedia area:
The area cuts clean across traditional multimedia areas, touching upon nearly every topic relevant for multimedia. The area casts a wide net to include the full range of research results and novel ideas in multimedia that are made possible by the crowd, i.e., they exploit crowdsourcing principles and techniques.

How did this new area arise? Crowdsourcing's grand debut at ACM Multimedia is CrowdMM On Octber 29, 2012 the CrowdMM 2012 International ACM Workshop on Crowdsourcing for Multimedia was held in conjunction with ACM Multimedia 2012 in Nara, Japan. The workshop kicked-off with a keynote entitled "PodCastle and Songle: Crowdsourcing-Based Web Services for Spoken Content Retrieval and Active Music Listening" by Masataka Goto of the National Institute of Advanced Industrial Science and Technology (AIST), Japan. These two systems dazzled the audience and gave us a foretaste of the possibilities that the power of the crowd opens for the multimedia community. An interesting day of talks, posters and discussion ensured, culminating in a panel (summarized below).

The organizers of CrowdMM 2012 hope that both the ACM Multimedia area (focused on groundbreaking research results) and also CrowdMM workshop (focused on methodology, exploratory work and on researcher interaction) will provide a solid foundation that allows crowdsourcing for multimedia to grow within the multimedia community to reach its full potential.

"Crowdsourcing for multimedia: At a crossroad or on a superhighway?"

Summary of the Panel Discussion at CrowdMM 2012 

What is the potential of crowdsourcing for ACM Multimedia?
We need The Crowd to allow us to build larger, up-to-date dictionaries for multimedia annotation. We also need The Crowd to create ground truth at a large scale.

The combining techniques for active learning and for incentivizing human contributions will contribute to many different specific multimedia problems.

In all cases, both quality control will be important and also making it fun for The Crowd to contribute, e.g., continuing to build entertaining games to collect Crowd contributions. User engagement breeds quality: for example, Songle provides services that are enjoyable to use and attracts good workers naturally. 
What are the limitations of crowdsourcing for multimedia?
Data from non-experts is valuable, but for some tasks we need experts. We need methods that will allow us to identify experts, for example, with domain knowledge. The multimedia community can potentially address the problem of having access to "the right crowd", by joining forces to cultivate a community of crowdsourcing workers who deliver high quality annotations for specific multimedia domains.

How would ACM Multimedia be different had crowdsourcing been invented 20 years ago? If crowdsourcing had existed 20 years ago, we would make much more effective use of active learning paradigms, i.e., algorithms that would interactively query human annotators to obtain new labels for certain multimedia items.

Crowdsourcing makes possible large scale multimedia annotations. Even if crowdsourcing existed 20 years ago, we may not have the tools and techniques to deal with large scale data.

The challenge today is to realize the potential of multimedia, both in venturing into new domains for research and also in scaling up our systems to exploit larger amounts of human labeled data for training and also for evaluation.

In short, the panel concluded, it’s up to ACMMM to catch up with the crowd.

A big thank you to my fellow organizers for the work that they put into making CrowdMM 2012 such a success:

Wednesday, January 2, 2013

Brave New Tasks: Incubating Innovation in the MediaEval Multimedia Benchmark

An important innovation of MediaEval 2012 was the "Brave New Task" track. MediaEval is a multimedia benchmark that offers and promotes challenging new tasks in the area of multimedia access and retrieval. We focus on tasks that emphasize multiple modalities ("The 'multi' in multimedia") and that have social and human relevance.

Brave New Tasks were introduced because we noticed that there is rather a tension between benchmark evaluation and innovation. Benchmarking is essentially a conservative activity: we want to compare algorithms on the same task using the same data set and the same evaluation procedure. This sameness allows us to chart progress with respect to the state of the art, especially over the course of time. How do we innovate, when the key strength of benchmarking is that we repeatedly do the same thing?

We innovate by tackling new problems. However, in order to create a successful benchmarking task from a new problem, a number of questions must be answered. Is the problem suitable for evaluation in a benchmark setting? What sort of data is needed to evaluate solutions developed by benchmark participants? How much effort is needed to create ground truth? Do we need to refine our definition of the task and of the evaluation procedure? Is there an actual chance that algorithms can be developed to solve the task and what resources are needed? Is there a critical mass of interest in solving this problem? Are the solutions appropriate for application?

The easy way forward would be to insist that there are clear answers to all of these questions prior to running a task. In some cases, is will be possible to gather the answers. In others, however, it will not. Forcing tasks to have answers before attempting to create a benchmark poses a serious risk that researchers will avoid the truly challenging and innovative tasks because they receive the message that they need to "play it safe."

People in the MediaEval community rattle their swords and shields when they are told that they need to "play it safe." Brave New Tasks support innovation in MediaEval by incubating tasks in their first year, allowing the task organizers to answer these questions. We value the advantages that the conservative aspects of benchmarking bring to the community, but we also thrive by taking risks. The Brave New Task track is a lightly protected space that allows us to take the risks that allow our benchmark to continue to renew itself.

Because people have asked me about Brave New Tasks in MediaEval 2013 and "How did you do it?" I am providing here a more detailed description of how it works and how we anticipate that it will develop in 2013:

To start, let me write a few words about makes a main "mainstream" task in MediaEval (i.e., a task that is not a Brave New Task). At the end of the calendar year, MediaEval solicits proposals from teams who are interested in organizing a task in the next MediaEval season. For an example, see the MediaEval 2013 call for task proposals. Whether a proposal is accepted as a MediaEval task depends on the interest expressed on the MediaEval survey. The survey is published in the first days of January and circulated widely to the larger research community.

During the survey, task proposers gather information on who is interested in carrying out their tasks. By the time the survey concludes, the proposers must have promises from five core participants (who are not themselves organizers) who will cross the finish line of the task (including submitting, results writing the working notes paper and attending the MediaEval workshop) come "hell or high water". This selection criteria is set up so that we have a minimum number of results to compare across sites for any given task---if there are only one or two, we don't get the "benchmark effect".

Tasks that the community finds interesting and promising, but that do not necessarily meet these stringent selection criteria, can be selected as Brave New Tasks. The difference between a Brave New Task and other MediaEval tasks is that these tasks are new, and ideally also a scientifically risky (in the responsible sense of "risky").

Brave New Tasks are run "by invitation only". The "invitation only" clause does not make the task exclusive: anyone who asks the task organizers can be granted an invitation. Instead, the clause allows the tasks to handle unexpected situation by, if necessary, decoupling their schedules from the main task schedules to accommodate unforeseen delays in data set releases. Participants of past editions of MediaEval will recognize the usefulness of a mechanism that makes the benchmark robust to unexpected situations.

Further,  Brave New Tasks do not require their participants to submit working notes papers or attend the workshop. The "only" requirement that the task must fulfill is to contribute an overview paper in the MediaEval working notes proceedings that sums up the task and presents and outlook for future years. One or more of the organizers attends the workshop to make the presentation and participate in the discussion about whether the task should target developing into a mainstream task in the next year.

A "Brave New Task" is encouraged to go far beyond the minimum requirement. In fact, 2012 saw one of the Brave New Tasks "Search and Hyperlinking" achieve the scope of a mainstream task, with six working notes papers from task participants appearing in the MediaEval 2012 working notes proceedings. The task was effectively indistinguishable from mainstream tasks in its contribution to the benchmark.

In 2013, we plan to strengthen the Brave New Task track by providing them with more central support. The tasks will be run under the same infrastructure as the mainstream tasks and decoupled from the schedule only if it is absolutely necessary. They will also be given the option of using the central registration system.

Brave New Tasks have been a successful innovation in MediaEval 2012 and one that we hope to strengthen in the future. I'd like to end by pointing out that it is not so much the "rules" of Brave New Tasks that have made them such a success, but rather the efforts of the Brave New Task organizers. Success is dependent on having a group of devoted researchers with a vision for a new task idea and the capacity and stamina to see it through the first year...including long hours spent reading related work, developing new evaluation metrics (if necessary), contacting and following up with participants, collecting data and creating ground truth. It is not so much the tasks themselves that are brave, but the organizers who are fearless and relentless in their pursuit of innovation.

Forward, charge!