N-grams: semantics

There seem to be two different definitions of semantics in use in the multimedia community. Having been trained as a linguist, I have a strong opinion about which one of these more useful for multimedia research.

The first I call semantics as interpretation. Basically, here, semantics is considered to be anything that a human might remark when viewing and trying to interpret multimedia. Concentrating, for the sake of example, on images, this first definition would consider the semantic analysis of the image above to result in the conclusion that the image depicts a tree. The image is associated with the semantics "tree" because someone looking at the picture can imaginably say, "Oh, this picture shows a tree."

The second I call semantics as signification. This is the definition of semantics that I argue is the more useful one. Signification has to do with the use of systems of signs. Systems of signs arise because humans interact with each other and in doing so develop communication conventions. Human language is the quintessential example of a type of system of signs.

These systems of signs are shaped by the nature and the capacity of our brains. However, in contrast to perception, the process of interpretation of a sign (or set of signs) requires making reference to a set of "guidelines of use" that are not primarily a product of human physiology, but rather owe their existence to human interaction. Some sign systems are established quickly, or even in the course of a single conversation or exchange, but in general the process of human language evolution is measured in millennia rather than minutes.

This semantics-as-signification definition can be stated more formally as:

Semantics is the creation and interpretation of individual and/or inter-playing signs guided by human cognition and communication convention.

Note that dictionary definitions of "semantics" mention "signification" and "signs": see Merriam Webster, OED and Wikipedia, suggesting that the semantics-as-signification definition is the more generally accepted version, and that the semantics-as-interpretation definition may be rather particular to the field of multimedia research. I will argue that it is not only particular to the field, but that it is one "innovation" that we are actually better off without.

Let's revisit the image above in light of the semantics-as-signification definition. Under this definition the image above is also associated with the semantics "tree", but not merely because someone looking at the picture can imaginably say, "Oh, this picture shows a tree." Rather, critically, it is because, also, "tree" belongs to a larger system of signs. It is a concept that humans agree is part of the way in which we communicate about the world.

Let's turn now to consider the status of the moss growing on the tree in the image. Both definitions would associate the image with the semantics "moss", since "moss" is a concept just like "tree". However, we can imagine a human looking at the moss in the picture and "seeing" something different. That person might remark about this image "Oh, this picture shows north." This remark is enough to confirm that under the semantics-as-interpretation definition, the image can be associated with the semantics "northiness".

However, in order to come to a conclusion concerning the status of "northiness" under the semantics-as-signification definition, we need to look a little further.

Specifically, for the semantics-as-signification definition, we need to be able to assume that "northiness" belongs to a larger systems of signs used to communicate. In other words, we need to convince ourselves of the existence of a system of signs under which this kind of image communicates "northiness".

OK. That's easy. Effectively, we have just made up exactly such a system. I tell you that this image depicts "northiness" and you can judge the next image that I show you. We will both agree that this new image is associated with "northiness". We have just invented a new convention and are using it to interact. We may not want to claim that we have created an entirely new system of signs, but we have just "re-newed" the existing system by extending it with a new sign.

Is it always possible to identify a system of signs such that we are justified in applying the semantics-as-signification definition? Is the difference between the two definitions irrelevant?

We can safely assume that, yes, it is always possible to identify such a system of signs. Human brains are creative and flexible. If I gave you another picture similar to the one above, would you be able to apply our new concept of "northiness" to it? I would imagine that you would apply it flawlessly. Also, human interaction are directed towards communication. If I use a word that you don't know, your first reaction is to try to figure out what I meant by that word. Unlike a conventional computer, your cognitive system does not shut down and throw an "input unknown" message, but instead it makes an attempt to integrate the new word into its inventory of conventions.

In sum, if no existing system of signs is already available, presto, we can create one. Further, our natural tendencies predispose us to create one automatically, frequently without even realizing that we are doing it.

However, even if we agree we can always force semantics that complies with the semantics-as-interpretation definition to fit the semantics-as-signification definition, it might not always be a good idea to do so. The difference between the two definitions is very relevant indeed.

Look at the problem with "northiness". In order for someone to be able to use "northiness" to communicate, that person would have to have read this blog post. If we want to consider "northiness" as part of the semantics of the image, the fact that people need a specialized background knowledge imposes a rather constricting limitation on the number of people who would have access to the "semantics" of the image (i.e., who could productively make use of "northiness" semantics), and thus on the applicability of our multimedia analysis.

Further, "northiness" can be seen as a rather irresponsible invention on my part. The connection between moss growth and direction is tenuous at best, and I shouldn't be suggesting that moss is a reliable indicator for finding one's way out of a woods. That could go very wrong. Personal (in contrast to conventional) interpretations threaten to limit the applicability of our multimedia analysis.

The key issue is this: If we force semantics that complies to the semantics-as-interpretation definition to fit the semantics-as-signification definition, we lose track of our assumptions about the nature of the underlying system of signs. We don't know if we should now throw research effort into creating visual detectors to identify "northiness" semantics in images, or if "tree" detectors are more important. By forcing ourselves to reason carefully about the underlying systems of signs, we can make those kind of decisions in a more informed manner.

Note that I am not suggesting that it is possible to sit down and enumerate all the signs that are part of any given system of signs. These systems are not finite; nor are they mutually exclusive. (My suggestion for how to best attack this issue is to turn to pre-lexical semantics, previously discussed here.)

What I am arguing is that the "mere" act of explicitly acknowledging that the system of signs must necessarily exist provides a productive constraint on our models because it prevents us from extending semantic interpretations unconsciously or arbitrarily (e.g., it prevents me from suddenly declaring the reality and importance of "northiness" without further substantiation.)

Adopting the semantics-as-signification definition implies understanding meaning to be something that arises via a process of negotiation between two or more human communicators with respect to a set of established conventions. In the absence of consensus between multiple humans, there is no meaning, i.e., no semantics. This is the theoretical basis for choices to focus multimedia research on analyzing those aspects of images and video which have a high inter-annotator agreement.

Am I opening an "If a tree falls in the forest and no one hears it does it really make a noise" debate? The image I have chosen above perhaps even invites that. However, let's close with another related image and question: "If this picture shows northiness and no one else sees it, do we really want to consider northiness to be a semantic concept for the purposes of multimedia research?" My point is that we get a lot further a lot faster if we just accept that the answer is "no".

It's not that we don't find "northiness" interesting. In fact, visual detectors capable of predicting the direction of the compass faced by the camera at the moment the image is captured have a number of potential applications. It's just that unless we stop for careful consideration before considering "northiness" to be part of the image semantics, we are in danger of sliding down a slippery slope. This slope leads to the quite unscientific habit of inventing our systems of signs as we go along, convenience-driven and possibly largely unconsciously.

In the field of multimedia, we spend so much time in discussions about semantic annotations (such as tags, or concept labels used for automatic concept detection) and whether they are objective or subjective. Usually the discourse runs along the lines of "Objective metadata is worth our effort, subjective metadata is too personal to either predict or be useful." Somehow the underlying assumption in these discussions is that we all have access to an a priori understanding of the distinction between "subjective" and "objective" and that this distinction is of some specific relevance to our field of research.

My position is that, as engineers building multimedia search engines, if we want to distinguish between subjective and objective we should do so using a model. We should avoid listening to our individual gut feelings on the issue (or wasting time talking about them). Instead, we should adopt a the more modern notion of "human computational relevance" which, since the rise of crowdsourcing, has entered into conceivable reach.

The underlying model is simple: Given a definition of a demographic that can be used to select a set of human subjects and a definition of a functional context in the real world inhabited by those subjects, the level of subjectivity or objectivity of an individual label is defined as the percentage of of human subjects who would say "yes, that label belongs with that multimedia item". The model can be visualized as follows:

Fig. 1: The relevance of a tag to an object is defined as the proportion of human subjects (pictured as circles) within a real-world functional context and drawn from a well-defined demographic that agree on a tag. I claim that this is the only notion of the objective/subjective distinction relevant for our work in developing multimedia search engines.

Under this view of the world, the distinction between subjective and objective reduces to the inter-annotator agreement under controlled conditions. I maintain that the level of inter-annotator agreement will also reflect the usefulness that the tag will have deployed within a multimedia search engine designed for use within the domain defined by the functional context by the people in the demographic. If we want to assimilate personalized multimedia search into this picture we can define it within a functional context for a demographic consisting only of one person.

This model reduces the subjective/objective difference to a estimation of the utility of a particular annotation within the system. The discussions we should be spending our time on are the ones about how to tackle the daunting task of implementing this model so as to generate a reliable estimates of human computational relevance.

As mentioned above, the model is intended to be implemented on a crowdsourcing platform that will produce an estimate of the relevance of each label for each multimedia item. I am as deeply involved as I am with crowdsourcing HIT design because am trying to find a principled manner to constrain worker pools with regard to demographic specifications and with regard to the specifications of a real-world function for multimedia objects. At the same time, we need useful estimators of the extent to which the worker pool deviates from the idealized conditions.

These are daunting tasks and will, without doubt, require well-motivated simplifications of the model. It should be clear that I don't claim that the model makes things suddenly 'easy'. However, it is clearly a more principled manner of moving forward than debate on the subjectivity vs. objectivity difference.

Continued...

N-grams

Sunday, August 11, 2013

Semantics for Multimedia Research: Getting the Definitions of "Semantics" Right

Saturday, August 13, 2011

Subjectivity vs. Objectivity in Multimedia Indexing

Search Happens

Search This Blog

Blog Archive

Labels

Twitter