Friday, November 30, 2012

ImageNet and the Edge of the World: On visual concept labels for images

Google Image Search results for the query "two-year-old horse"
ImageNet ( is a collection of images depicting concepts in the lexical database WordNet ( ImageNet consists of groups of images that illustrate the same WordNet concept. On WordNet a concept is a set of cognitive synonyms, or words that are understood to express the same thing.

ImageNet is very cool. If I were a kid, this would have been better than any of those other picture is fun just to click through and explore what exists in the world.

I recently made a video about this paper:

Jia Deng; Wei Dong; Socher, R.; Li-Jia Li; Kai Li; Li Fei-Fei; , "ImageNet: A large-scale hierarchical image database," Proc. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009.  pp.248-255.

The video's down below, in case you want to get a quick overview of the paper...but in the video I am mostly focusing on discussing the crowdsourcing methods applied to create ImageNet.

Here, I will discuss "Edge of the ImageNet Image World". I identify the edge with the idea of a concept being "difficult to be illustrated". This concept I found mentioned in footnote 1 of the  paper:

About 20% of the synsets have very few images, because either there are very few web images available, e.g. “vespertilian bat”, or the synset by definition is difficult to be illustrated by images, e.g. "two-year-old horse". (p. 249)

I would like to make the point that we may be radically underestimating the importance of "difficult to be illustrated" visual concepts in our multimedia information indexing systems.

Difficult to be illustrated? It is quite obvious that it is not inherently difficult to have a picture of a two-year-old horse.

I have a horse, two years ago I watched it being born, and I take a picture of it. Finished.

What is difficult is to find a group of people (annotators or users) who will look only at the picture (knowing nothing about me and the horse) and agree that the horse is two years old.

It is difficult for two reasons:
  1. Context of use: The concept "two-year old horse" is difficult to pin down exactly. Does a horse that is two-years and one day old still count as a two-year-old horse? It depends on what you are using the picture for. If you are using it for a collection of "horses on their second birthdays" it won't count. However, if you are using it to illustrate horses that are less than full grown, that day doesn't matter.
  2. Background of user: You have to know something about horses to distinguish a horse that is a foal (under one year) from one that is a colt or a filly (which Wikipedia tells us are terms that may be used until the horse is 3 or 4).
The ImageNet paper claims that "ImageNet aims to provide the most comprehensive and diverse coverage of the image world".

As multimedia researchers, we seem to assume that these "difficult to illustrate concepts" represent some marginal part of multimedia meaning. I mean, it's less that 20% of the concepts in WordNet that have this problem, so isn't it a good first approximation to just ignore them and focus on the 80% that are easily illustrated by images?

Context of use: We can just concentrate on the formal definitions of concepts. It's about delivering precise results lists when we search for images isn't it? Under that view, we can solve point 1. by deciding to use the most restrictive definition possible: the horse that turned three yesterday is no longer a two-year-old horse.

OK. So we all are totally annoyed at the guy who just sits immobile when we say, "Hand me that red screwdriver?" You climb down from the ladder just to hear him say, "I see a crimson screwdriver, but no red screwdriver." We are annoyed because we know that language is built to be used, and part of that use is the fact that we accommodate the meanings of words within their contexts of use.

But we learn to live with it. We realize the guy is literally right, so we grab the screwdriver ourselves and climb back up the ladder. We could learn to live with image search engines that behave like that as well, couldn't we?

Background of user: We can just concentrate on what the "man on the street" thinks about the image. It's about delivering results that are generally recognizable and not results that require some expert insight, isn't it? Under that view we can solve point 2. by deciding to use what a member of the general public would say about the image: it's a horse, probably not a grown up horse, but there's no telling if it's two years old.

Whoa. Hold your horses right there! Who gets to then decide who constitutes the "man on the street" of the "general public"?

Many people that I meet on the street in my daily life are not going to know the difference between quite obvious concepts like "bananas" vs. "plantains". It depends on what street I chose to look at.

With respect to many streets in Western Europe, "plantain" would be "difficult to be illustrated": people that can identify them are somehow considered experts. Not so in West Africa.

Irresponsible intuitions: In a split second, we as multimedia researchers can make a decision that seems "obvious", but that on closer consideration has potential to come back and haunt us.

We are reinforced to make these "obvious" decisions because they are the ones that allow us to continue on with our research with a minimal investment of resources in creating labeled image sets.

If I use restrictive, formal decisions, I don't have to turn to actual users of image search engines to try to understand how the "language of concepts" that they use when they search.

I also don't have to try to dig down to more subtle forms of cultural bias that exist in WordNet. Who of us has time to read a volume on cultural bias in dictionaries with contributions from 40 scholars?

In the end, although "difficult to be illustrated concepts" may constitute 20% of the concepts in WordNet, we have no idea of what percent of actual user image need might be related to these concepts. It could be huge!

Edge of the world: Google somehow gets it right. The search results at the top are returned by Google Images in response to the query "two-year-old horse". The first image occurs on the Internet in conjunction with the text "2 Year old Buckskin Quarter horse Colt". Someone apparently took a picture of their two-year-old horse and that seems to be right.

In the next picture, it's the kid and not the horse that's two, but that's pretty obviously wrong, and even amusing.

At the very least, this discussion allows us that to conclude that if ImageNet covers "The Image World", that is a very flat world indeed. It is easy to follow a "difficult to be illustrated" concept to the end of that world and stand there looking over the edge...

...ImageNet is a valuable research tool and serves the community well. However, we should all be aware of exactly where the edge of the ImageNet world is, not that we want to avoid it, but perhaps because that is exactly the place from which we want to leap off.