Saturday, July 13, 2013

Event Detection in Multimedia: Different definitions, different research challenges.

Yesterday, someone asked me for a pointer to work in the area of event detection in multimedia content. This mail prompted me to finally get out a blog post that explains the distinction between the different sorts of underlying challenges that researchers are referring to when they discuss events in multimedia.

A simple definition of an event is a triple (t, p, a), consisting of a moment a time t, a place in space p, and one or more actors a. For example,  at (t=) 2pm 13 July 2013. at the (p=) Faculty of Electrical Engineering, Mathematics and Computer Science at Delft University of Technology, (a=) I am now involved in a "blog-post writing" event.

Let's look at that definition of event in terms of aspects that matter to us as multimedia researchers. If you took a video of me (here and now), another human looking at your video might notice that I am also eating a salad. Consequently, the video could equally be considered to depict a "lunch-eating event". For this reason, it makes sense to also introduce a fourth variable "v", to arrive at (t, p, a, v). The "v"stands for the name for the action or the activity part of the event. I use "v" for "verb" since these names corresponds to verbs or can be expressed by phrases involving verbs.

Note that there are at least three basic ways to name (or label) events when it comes to multimedia: (1) name the event from the perspective of the/an actor (I, the actor, call it a lunch eating event, because I know it is lunch.) (2) name the event from the perspective of the person recording the multimedia (The person sees me engaged in an eating event, but do not necessarily know or care that it is lunch.) and (3) name the event from the perspective of the/a person looking at the multimedia in a time and place other than when and where it was captured (The person sees me sitting at a computer, but does not notice or want to pay attention to the salad.) Many times these three perspectives collapse and there is only a single label that would be relevant, but it should be kept in mind that they do not necessarily do so. We risk over-simplifying the world and losing valuable information if we assume that they can be conflated. Instead, multimedia systems must be careful to maintain multiple views, i.e., a video that for one person (e.g., a government official) depicts a riot, might for another (e.g., a concerned citizen) depict a demonstration.

The (t, p, a, v) definition of an event is sometimes constrained by a fourth factor, namely, advanced human planning. Multimedia research that looks at planned events focuses on events that humans organize for social purposes and that can therefore be anticipated in advance of their occurrence. This group of events includes events like concerts, games, conferences and parties. It is generally referred to as "Social Event Detection".

The Social Event Detection (SED) task at the MediaEval Multimedia Benchmark started in 2011 and has been drawing a steadily increasing number of participants each year.  MediaEval SED 2013 offers the most ambitious and interesting SED task to date. The SED Task Organizers have organized workshops and special sessions at various conferences, for example, recently the Special Session on Social Events in Web Multimedia at ICMR 2013. The MediaEval bibliography includes a relatively up-to-date list of the papers that have been published regarding the MediaEval SED task.

The SED task is defined such that its multimedia aspect arises because addressing the task requires combining different information sources (text, photos, videos) from different social communities on the Web. Note that it is the use of the (t, p, a, v) definition of an event and not per se the social nature of the data that distinguishes SED from other types of event detection in multimedia.

Another important type of event detection is defined as involving not the full (t, p, a, v), but rather (v). In other words, this variant of event detection is interested not in any specific event, but in detecting the occurrence of instances of a particular event type. This type of event detection is referred to as Multimedia Event Detection and has been offered as a task in TRECVid since 2013.  Examples of these sorts of events are "Birthday party" (from TRECVid MED 2011) and "Giving directions to a location" and "Winning a race without a vehicle" (from TRECVid MED 2012).

If you consider only the labels that they use to refer to events, SED and MED look very much the same. However, it is important to remember that for MED, multimedia that is considered relevant to the event "birthday party" can depict any birthday party at any time, at any place around the world. In other words, for MED "birthday party" is an instantiation of any event of the type birthday party. Only (v) and not the full (t, p, a, v) are part of the definition. For SED, "birthday party" would be for example, my birthday party, taken on my birthday in 2013, at the particular place at which I celebrated.

Again make note of the task definition of MED. The MED task is defined such that its multimedia aspect arises because addressing the task requires combining different modalities within the same video (visual + audio channel). Typically, the data is not social video per se. Note that it is the use of the (v) definition of an event and not per se the nature of the data that distinguishes MED from other types of event detection in multimedia.

My impression is that some researchers in the community are convinced that the way forward for the research community is to first detect (v) (i.e., apply the MED event definition) and then filter the detection results to be constrained to (t, p, a, v) (i.e., generate a list of results that follows the SED definition).

Before making this assumption, I would urge researchers to carefully contemplate the use scenario of their applications. For example, if I have one picture from a birthday party I attended and I want to search for other pictures of the same birthday party on the Internet, it does not make any sense at all to solve the complete MED birthday party detection problem as the first step in the process.

As far as weddings go, if we are using visual features it's tempting to rely on the presence of that beautiful white wedding gown to detect instances of v = "wedding". However, despite an initial impression that the gown provides a stable visual indicator, its not going to get you very far in a  real-world data set:

Leaving courthouse on first day of gay marriage in Washington
Ultimately, we need to look at both (t, p, a, v) and (v) and all the definitions of event detection in multimedia that lie in between. Luckily there are events in which researchers with all perspectives come together. I have now finished both my lunch and my blog post and, as a final note, I leave you with an example of just such an event:

Vasileios Mezaris, Ansgar Scherp, Ramesh Jain, Mohan Kankanhalli, Huiyu Zhou, Jianguo Zhang, Liang Wang, and Zhengyou Zhang. 2011. Modeling and representing events in multimedia. In Proceedings of the 19th ACM international conference on Multimedia (MM '11). ACM, New York, NY, USA, 613-614