Saturday, August 13, 2011

Subjectivity vs. Objectivity in Multimedia Indexing

In the field of multimedia, we spend so much time in discussions about semantic annotations (such as tags, or concept labels used for automatic concept detection) and whether they are objective or subjective. Usually the discourse runs along the lines of "Objective metadata is worth our effort, subjective metadata is too personal to either predict or be useful." Somehow the underlying assumption in these discussions is that we all have access to an a priori understanding of the distinction between "subjective" and "objective" and that this distinction is of some specific relevance to our field of research.

My position is that, as engineers building multimedia search engines, if we want to distinguish between subjective and objective we should do so using a model. We should avoid listening to our individual gut feelings on the issue (or wasting time talking about them). Instead, we should adopt a the more modern notion of "human computational relevance" which, since the rise of crowdsourcing, has entered into conceivable reach.

The underlying model is simple: Given a definition of a demographic that can be used to select a set of human subjects and a definition of a functional context in the real world inhabited by those subjects, the level of subjectivity or objectivity of an individual label is defined as the percentage of of human subjects who would say "yes, that label belongs with that multimedia item". The model can be visualized as follows:

Fig. 1: The relevance of a tag to an object is defined as the proportion of human subjects (pictured as circles) within a real-world functional context and drawn from a well-defined demographic that agree on a tag. I claim that this is the only notion of the objective/subjective distinction relevant for our work in developing multimedia search engines.

Under this view of the world, the distinction between subjective and objective reduces to the inter-annotator agreement under controlled conditions. I maintain that the level of inter-annotator agreement will also reflect the usefulness that the tag will have deployed within a multimedia search engine designed for use within the domain defined by the functional context by the people in the demographic. If we want to assimilate personalized multimedia search into this picture we can define it within a functional context for a demographic consisting only of one person.

This model reduces the subjective/objective difference to a estimation of the utility of a particular annotation within the system. The discussions we should be spending our time on are the ones about how to tackle the daunting task of implementing this model so as to generate a reliable estimates of human computational relevance.

As mentioned above, the model is intended to be implemented on a crowdsourcing platform that will produce an estimate of the relevance of each label for each multimedia item. I am as deeply involved as I am with crowdsourcing HIT design because am trying to find a principled manner to constrain worker pools with regard to demographic specifications and with regard to the specifications of a real-world function for multimedia objects. At the same time, we need useful estimators of the extent to which the worker pool deviates from the idealized conditions.

These are daunting tasks and will, without doubt, require well-motivated simplifications of the model. It should be clear that I don't claim that the model makes things suddenly 'easy'. However, it is clearly a more principled manner of moving forward than debate on the subjectivity vs. objectivity difference.