Saturday, August 24, 2013

What is multimedia?

These days a mirrored ceiling is an unambiguous call to take a photo.
Yesterday evening after the conclusion of a very successful First Workshop on Speech, Language, and Audio in Multimedia (SLAM 2013) in Marseille, participants naturally drifted to various scenic spots for debriefing, including the Vieux Port. There, over a glass of pastis (predictable, given the locale) the conversation naturally moved to the question, "What is multimedia?"

One obvious answer to the question is the Wikipedia definition of multimedia, "Multimedia is media and content that uses a combination of different content forms". The classic example, is of course, video, which has a visual modality and also an audio modality. Other examples include social images (for example, images on Flickr have a visual modality, but also have tags and geo-tags) and podcasts (which have an audio modality and also a textual modality included in their RSS feeds).

One can argue about that answer, for example, by pointing out that some people define multimedia as being any non-text media. For example, an image, like the one above. The image was taken at the new events pavilion in the Vieux Port in Marseille. The events pavilion is basically a set of columns with a plane laid on top of them, the bottom of the plane is shiny, so that when you stand under it you are looking up at a ceiling, which is a large mirror.

In my view, an image in isolation cannot be taken to be multimedia since it includes a single medium, name pixels. It become multimedia in conjunction with this blog post, which adds an additional medium, namely text.

Another dimension for the debate on multimedia is whether it must necessarily involve human communication. The combination of this image and this blogpost were created by me with the intent to communicate a message to a certain audience, i.e., the readership of my blog (which, as I have previously mentioned, is largely a few fellow researchers in conjunction with future instantiations of myself).

Researchers who share my view of multimedia, insist on the point that multimedia must contain a message. It must come into being as an act of communication and also be consumed in a process that involves the interpretation of meaning. This definition excludes a set of geo-tagged surveillance videos as being multimedia, although they would involve two different media, namely video content and geo-coordinates.

Note that when you require multimedia to contain a message created with explicit human intent, you enter a bit of a slippery slope. If a human being had set up the surveillance cameras with the specific intent of creating a body of information that would give fellow humans information on current street conditions, then we are back to multimedia.

The slipperiness of the definition of multimedia reveals to us something very important about our field. In order to know whether or not something is multimedia, it is not sufficient to examine the multimedia data, rather it is also necessary to look at the production and consumption chain. Multimodal data in some contexts remains just data, but if an act of encoding and decoding meaning is involved, then the same multimodal data must be considered to be multimedia.

A report on SLAM 2013 has appeared in SIGMM Records. The SLAM 2014 website went online immediately after SLAM 2013 and the anticipation is already building.

Monday, August 12, 2013

SMTH vs. Catchy: The Science Inside

Catchy logo 
While all the hype is going on about SMTH, students here in the Netherlands have their heads down and are focusing on their science.

The number one SMTH competitor is Catchy, and you can read about how it works in this paper:

Rijnboutt, Eric, Hokke, Olivier, Kooij, Rob, and Bidarra, Rafael. 2013. A robust throw detection library for mobile games. Proceedings of Foundations of Digital Games (FDG 2013).

The Catchy team created Catchy as the final thesis project in their undergraduate Computer Science  program at Delft University of Technology in the Netherlands.

Rumor has it that the Catchy app was such an irresistible hit that they had the whole committee standing around playing Catchy as they defended their thesis (...a happy end to a theoretical discussion on assumptions concerning initial accelerations).

You will notice that the thesis is dated June 2012, so if you called Catchy the original SMTH you would not be factually inaccurate.

Which one is installed on my own phone? In the choice between SMTH and Catchy, I go for the game that the science geeks play.

Oh, and in case in addition to the scientific paper, you also wanted the download link:

Sunday, August 11, 2013

Semantics for Multimedia Research: Getting the Definitions of "Semantics" Right

Mossy Tree

There seem to be two different definitions of semantics in use in the multimedia community. Having been trained as a linguist, I have a strong opinion about which one of these more useful for multimedia research.

The first I call semantics as interpretation. Basically, here, semantics is considered to be anything that a human might remark when viewing and trying to interpret multimedia. Concentrating, for the sake of example, on images, this first definition would consider the semantic analysis of the image above to result in the conclusion that the image depicts a tree. The image is associated with the semantics "tree" because someone looking at the picture can imaginably say, "Oh, this picture shows a tree."

The second I call semantics as signification. This is the definition of semantics that I argue is the more useful one. Signification has to do with the use of systems of signs. Systems of signs arise because humans interact with each other and in doing so develop communication conventions. Human language is the quintessential example of a type of system of signs.

These systems of signs are shaped by the nature and the capacity of our brains. However, in contrast to perception, the process of interpretation of a sign (or set of signs) requires making reference to a set of "guidelines of use" that are not primarily a product of human physiology, but rather owe their existence to human interaction. Some sign systems are established quickly, or even in the course of a single conversation or exchange, but in general the process of human language evolution is measured in millennia rather than minutes.

This semantics-as-signification definition can be stated more formally as
Semantics is the creation and interpretation of individual and/or inter-playing signs guided by human cognition and communication convention.
Note that dictionary definitions of "semantics" mention "signification" and "signs": see Merriam WebsterOED and Wikipedia, suggesting that the semantics-as-signification definition is the more generally accepted version, and that the semantics-as-interpretation definition may be rather particular to the field of multimedia research. I will argue that it is not only particular to the field, but that it is one "innovation" that we are actually better off without.

Let's revisit the image above in light of the semantics-as-signification definition. Under this definition the image above is also associated with the semantics "tree", but not merely because someone looking at the picture can imaginably say, "Oh, this picture shows a tree." Rather, critically, it is because, also, "tree" belongs to a larger system of signs. It is a concept that humans agree is part of the way in which we communicate about the world.

Let's turn now to consider the status of the moss growing on the tree in the image. Both definitions would associate the image with the semantics "moss", since "moss" is a concept just like "tree". However, we can imagine a human looking at the moss in the picture and "seeing" something different. That person might remark about this image "Oh, this picture shows north." This remark is enough to confirm that under the semantics-as-interpretation definition, the image can be associated with the semantics "northiness".

However, in order to come to a conclusion concerning the status of "northiness" under the semantics-as-signification definition, we need to look a little further.

Specifically, for the semantics-as-signification definition, we need to be able to assume that "northiness" belongs to a larger systems of signs used to communicate. In other words,  we need to convince ourselves of the existence of a system of signs under which this kind of image communicates "northiness".

OK. That's easy. Effectively, we have just made up exactly such a system. I tell you that this image depicts "northiness" and you can judge the next image that I show you. We will both agree that this new image is associated with "northiness". We have just invented a new convention and are using it to interact. We may not want to claim that we have created an entirely new system of signs, but we have just "re-newed" the existing system by extending it with a new sign.

Is it always possible to identify a system of signs such that we are justified in applying the semantics-as-signification definition? Is the difference between the two definitions irrelevant?

We can safely assume that, yes, it is always possible to identify such a system of signs. Human brains are creative and flexible. If I gave you another picture similar to the one above, would you be able to apply our new concept of "northiness" to it? I would imagine that you would apply it flawlessly. Also, human interaction are directed towards communication. If I use a word that you don't know, your first reaction is to try to figure out what I meant by that word. Unlike a conventional computer, your cognitive system does not shut down and throw an "input unknown" message, but instead it makes an attempt to integrate the new word into its inventory of conventions.

In sum, if no existing system of signs is already available, presto, we can create one. Further, our natural tendencies predispose us to create one automatically, frequently without even realizing that we are doing it.

However, even if we agree we can always force semantics that complies with the semantics-as-interpretation definition to fit the semantics-as-signification definition, it might not always be a good idea to do so. The difference between the two definitions is very relevant indeed. 

Look at the problem with "northiness". In order for someone to be able to use "northiness" to communicate, that person would have to have read this blog post. If we want to consider "northiness" as part of the semantics of the image, the fact that people need a specialized background knowledge imposes a rather constricting limitation on the number of people who would have access to the "semantics" of the image (i.e., who could productively make use of "northiness" semantics), and thus on the applicability of our multimedia analysis.

Further, "northiness" can be seen as a rather irresponsible invention on my part. The connection between moss growth and direction is tenuous at best, and I shouldn't be suggesting that moss is a reliable indicator for finding one's way out of a woods. That could go very wrong. Personal (in contrast to conventional) interpretations threaten to limit the applicability of our multimedia analysis.

The key issue is this: If we force semantics that complies to the semantics-as-interpretation definition to fit the semantics-as-signification definition, we lose track of our assumptions about the nature of the underlying system of signs. We don't know if we should now throw research effort into creating visual detectors to identify "northiness" semantics in images, or if "tree" detectors are more important. By forcing ourselves to reason carefully about the underlying systems of signs, we can make those kind of decisions in a more informed manner.
Note that I am not suggesting that it is possible to sit down and enumerate all the signs that are part of any given system of signs. These systems are not finite; nor are they mutually exclusive. (My suggestion for how to best attack this issue is to turn to pre-lexical semantics, previously discussed here.)

What I am arguing is that the "mere" act of explicitly acknowledging that the system of signs must necessarily exist provides a productive constraint on our models because it prevents us from extending semantic interpretations unconsciously or arbitrarily (e.g., it prevents me from suddenly declaring the reality and importance of "northiness" without further substantiation.)

Adopting the semantics-as-signification definition implies understanding meaning to be something that arises via a process of negotiation between two or more human communicators with respect to a set of established conventions. In the absence of consensus between multiple humans, there is no meaning, i.e., no semantics. This is the theoretical basis for choices to focus multimedia research on analyzing those aspects of images and video which have a high inter-annotator agreement.

Am I opening an "If a tree falls in the forest and no one hears it does it really make a noise" debate? The image I have chosen above perhaps even invites that. However, let's close with another related image and question: "If this picture shows northiness and no one else sees it, do we really want to consider northiness to be a semantic concept for the purposes of multimedia research?" My point is that we get a lot further a lot faster if we just accept that the answer is "no".

It's not that we don't find "northiness" interesting. In fact, visual detectors capable of predicting the direction of the compass faced by the camera at the moment the image is captured have a number of potential applications. It's just that unless we stop for careful consideration before considering "northiness" to be part of the image semantics, we are in danger of sliding down a slippery slope. This slope leads to the quite unscientific habit of inventing our systems of signs as we go along, convenience-driven and possibly largely unconsciously.

Sunset straight on