There seem to be two different definitions of semantics in use in the multimedia community. Having been trained as a linguist, I have a strong opinion about which one of these more useful for multimedia research.
The first I call semantics as interpretation. Basically, here, semantics is considered to be anything that a human might remark when viewing and trying to interpret multimedia. Concentrating, for the sake of example, on images, this first definition would consider the semantic analysis of the image above to result in the conclusion that the image depicts a tree. The image is associated with the semantics "tree" because someone looking at the picture can imaginably say, "Oh, this picture shows a tree."
The second I call semantics as signification. This is the definition of semantics that I argue is the more useful one. Signification has to do with the use of systems of signs. Systems of signs arise because humans interact with each other and in doing so develop communication conventions. Human language is the quintessential example of a type of system of signs.
These systems of signs are shaped by the nature and the capacity of our brains. However, in contrast to perception, the process of interpretation of a sign (or set of signs) requires making reference to a set of "guidelines of use" that are not primarily a product of human physiology, but rather owe their existence to human interaction. Some sign systems are established quickly, or even in the course of a single conversation or exchange, but in general the process of human language evolution is measured in millennia rather than minutes.
This semantics-as-signification definition can be stated more formally as:
Semantics is the creation and interpretation of individual and/or inter-playing signs guided by human cognition and communication convention.Note that dictionary definitions of "semantics" mention "signification" and "signs": see Merriam Webster, OED and Wikipedia, suggesting that the semantics-as-signification definition is the more generally accepted version, and that the semantics-as-interpretation definition may be rather particular to the field of multimedia research. I will argue that it is not only particular to the field, but that it is one "innovation" that we are actually better off without.
Let's revisit the image above in light of the semantics-as-signification definition. Under this definition the image above is also associated with the semantics "tree", but not merely because someone looking at the picture can imaginably say, "Oh, this picture shows a tree." Rather, critically, it is because, also, "tree" belongs to a larger system of signs. It is a concept that humans agree is part of the way in which we communicate about the world.
Let's turn now to consider the status of the moss growing on the tree in the image. Both definitions would associate the image with the semantics "moss", since "moss" is a concept just like "tree". However, we can imagine a human looking at the moss in the picture and "seeing" something different. That person might remark about this image "Oh, this picture shows north." This remark is enough to confirm that under the semantics-as-interpretation definition, the image can be associated with the semantics "northiness".
However, in order to come to a conclusion concerning the status of "northiness" under the semantics-as-signification definition, we need to look a little further.
Specifically, for the semantics-as-signification definition, we need to be able to assume that "northiness" belongs to a larger systems of signs used to communicate. In other words, we need to convince ourselves of the existence of a system of signs under which this kind of image communicates "northiness".
OK. That's easy. Effectively, we have just made up exactly such a system. I tell you that this image depicts "northiness" and you can judge the next image that I show you. We will both agree that this new image is associated with "northiness". We have just invented a new convention and are using it to interact. We may not want to claim that we have created an entirely new system of signs, but we have just "re-newed" the existing system by extending it with a new sign.
Is it always possible to identify a system of signs such that we are justified in applying the semantics-as-signification definition? Is the difference between the two definitions irrelevant?
We can safely assume that, yes, it is always possible to identify such a system of signs. Human brains are creative and flexible. If I gave you another picture similar to the one above, would you be able to apply our new concept of "northiness" to it? I would imagine that you would apply it flawlessly. Also, human interaction are directed towards communication. If I use a word that you don't know, your first reaction is to try to figure out what I meant by that word. Unlike a conventional computer, your cognitive system does not shut down and throw an "input unknown" message, but instead it makes an attempt to integrate the new word into its inventory of conventions.
In sum, if no existing system of signs is already available, presto, we can create one. Further, our natural tendencies predispose us to create one automatically, frequently without even realizing that we are doing it.
However, even if we agree we can always force semantics that complies with the semantics-as-interpretation definition to fit the semantics-as-signification definition, it might not always be a good idea to do so. The difference between the two definitions is very relevant indeed.
Look at the problem with "northiness". In order for someone to be able to use "northiness" to communicate, that person would have to have read this blog post. If we want to consider "northiness" as part of the semantics of the image, the fact that people need a specialized background knowledge imposes a rather constricting limitation on the number of people who would have access to the "semantics" of the image (i.e., who could productively make use of "northiness" semantics), and thus on the applicability of our multimedia analysis.
Further, "northiness" can be seen as a rather irresponsible invention on my part. The connection between moss growth and direction is tenuous at best, and I shouldn't be suggesting that moss is a reliable indicator for finding one's way out of a woods. That could go very wrong. Personal (in contrast to conventional) interpretations threaten to limit the applicability of our multimedia analysis.
The key issue is this: If we force semantics that complies to the semantics-as-interpretation definition to fit the semantics-as-signification definition, we lose track of our assumptions about the nature of the underlying system of signs. We don't know if we should now throw research effort into creating visual detectors to identify "northiness" semantics in images, or if "tree" detectors are more important. By forcing ourselves to reason carefully about the underlying systems of signs, we can make those kind of decisions in a more informed manner.
Note that I am not suggesting that it is possible to sit down and enumerate all the signs that are part of any given system of signs. These systems are not finite; nor are they mutually exclusive. (My suggestion for how to best attack this issue is to turn to pre-lexical semantics, previously discussed here.)
What I am arguing is that the "mere" act of explicitly acknowledging that the system of signs must necessarily exist provides a productive constraint on our models because it prevents us from extending semantic interpretations unconsciously or arbitrarily (e.g., it prevents me from suddenly declaring the reality and importance of "northiness" without further substantiation.)
Adopting the semantics-as-signification definition implies understanding meaning to be something that arises via a process of negotiation between two or more human communicators with respect to a set of established conventions. In the absence of consensus between multiple humans, there is no meaning, i.e., no semantics. This is the theoretical basis for choices to focus multimedia research on analyzing those aspects of images and video which have a high inter-annotator agreement.
Am I opening an "If a tree falls in the forest and no one hears it does it really make a noise" debate? The image I have chosen above perhaps even invites that. However, let's close with another related image and question: "If this picture shows northiness and no one else sees it, do we really want to consider northiness to be a semantic concept for the purposes of multimedia research?" My point is that we get a lot further a lot faster if we just accept that the answer is "no".
It's not that we don't find "northiness" interesting. In fact, visual detectors capable of predicting the direction of the compass faced by the camera at the moment the image is captured have a number of potential applications. It's just that unless we stop for careful consideration before considering "northiness" to be part of the image semantics, we are in danger of sliding down a slippery slope. This slope leads to the quite unscientific habit of inventing our systems of signs as we go along, convenience-driven and possibly largely unconsciously.