Friday, May 7, 2010

Relevant to the query "List of Internet Video Genres"

From the perspective of automatic multimedia content analysis, there is vast difference in visual content between video that was produced for the purpose of being understood and absorbed by an audience and video that was captured with no explicit communicative intent. If two people shake hands in a film, the audience will know that they are shaking hands and it will fit with the dialogue in the sound track and with the over all story. If two people shake hands on a surveillance video one can occlude the other or perhaps they're passing a cigarette lighter who knows.

It seems like creator intent is an important clue for visual indexing of video for retrieval. But the example above is simplistic. If we want to find video on the internet, there are a whole range of intents between film and surveilance. What are these varieties? Today I thought I could type "list of internet video genres" into my favorite mainstream search engine and have it spit me out a list of things that we as Internet users do when we make video. That didn't happen. I spend some time in amused pondering over the juxtaposition in The six most baffling genres on YouTube. But I wanted something a bit more comprehensive than that (with a different tone), so I'm posting my own list.

Captured video: Walking through the room and not knowing the camera is on. Interesting as a curiosity, but serves no specific purpose.
Life-log: I know the camera is on but don’t really think about it. Serves the purposes of off-line memory.

Surveillance video: I put the camera in some particular place to capture the scene, but the people in the scene don’t pay attention to it. Serves the purpose or providing extra ears and eyes.

Home video: I am obviously holding a camera and pointing at people. Not trying to do anything but “get the feel of the situation on video” Serves the purposes of off-line memory. The act of making the video is also inherently entertaining and it might not necessarily be watched.

Event: I am documenting an event, like a wedding. The video is meant to portray both the compliance of the event to social convention and also its uniqueness. Ideally, I want to see the face of the bride and groom and hear the “I do”. Serves the purpose of memory, but may also be considered a public document that attests to status.

Meeting video/lecture video: I know the camera is on but don’t really think about it. Serves the purposes of off-line memory and possibly institutional record. Used to rewatch things that might have been missed the first time. The camera has a specific position – other material such as slides or white board shots might be present.

Testimony: A narrator recounts and experience. Spoken audio is unscripted but declarative factual statements. The narrative is usually temporally organized. The visuals are “convenience visuals”, but there are typical camera angles: frontal shot, shot of interviewer in dialogue with the interviewee. Viewer acquires some declarative knowledge, but basically, it’s an impression of the situation. 

How-to video: Demonstrates how to do something. The visuals are key, with the camera angle chosen to give maximal information. Items depicted in the visual tract have a high probability of being named in the speech track. After watching this video the viewer is intended to have procedural knowledge of the task.

Learning video: The video acts out scenes and viewers are invited to put themselves into the scene. This video is a surrogate for experience. Here, the camera angle is carefully chosen and any spoken audio must be clearly captured (esp. in the case of language learning videos.)
 Again, viewer acquires procedural knowledge, but it is via vicarious doing and not via showing. In contrast to how-to video, learning videos are only intended to be watched once.

Interview/review: I want to get someone’s view or opinion or tell my own. Can be planned or relatively unscripted. Contain opinions or attitudes. Factual statements are made to support opinions. The visuals may be convenience visuals, but may be planned convey the feel of that person.

Report: Following a script, I report a certain even that happened. The statements I make are factual. The visuals provide depictions of the objects mentioned (broadcast news). In the end the viewer has acquired declarative knowledge.

Documentary: Reports a sequence of events subordinate to some sort of ordering. Temporal ordering is common, or they may be ordered in order to support a thesis. Documentaries include a narrative line: they open questions and resolve them. In the end the viewer has acquired declarative knowledge.

Film (or TV Series): Narrative created for the purposes of entertainment, but may have other elements (didactic, community memory). The basis of film is a complex “contract” between the filmmaker and the viewer, which rests on a series of established conventions that have been developed over the history of filmmaking (the literature traces this system of conventions back into novels). The nature of this contract varies from film genre to film genre. Narratives are created by setting up viewer expectations and then either fulfilling the expectations or failing to fulfill them. Scenes are carefully composed to carry out scene setting, introduce characters and to depict events. Organization can be temporal or otherwise. Shots are set up so that the viewers understand what is going on (in all but a few exceptions, the main action in the shot will be readily visible, for example, the moment that he passes her the gun, the moment that she kisses him). In addition to understanding the plot, the film aims to create a mood for the viewer. Conventions are used to create mood include music, timing shifts (quick shot sequence used to portray the passage or time), lighting, camera angles. Further, films are created to delight the viewers with their film craft, which can involve strict adherence to filmmaking principles (including references to other films) or creative breaks with convention.

Art: With art the contract between the viewer and the creator is not as complex as with film. In fact, it can be considered utterly simple. The viewer simply has to agree, “this is art”. The impression the video has on the viewer is decoupled from the intention of the creator to a greater extent than in film. Art closes the circle and resembles captured video in that it the video is an object in of itself. It assumes a purpose in the act of viewing. Art events are not necessarily depicted so that the viewer understands the “plot”, the conveyance of a certain mood may be highly viewer dependent and there is no narrative. If there is a speech track, there is no predictable coupling between the speech track and the visual channel.

Object: Sometimes we make video and we don't know why. Neither videographer or viewer would readily commit to the "art" label. Even a video that we have made ourselves become objects that inhabits the world of objects, things we come across, think about, try to fit in to the larger pictures. Some of these videos are the ones whose existence in an of itself explains and expands the role of video. Here perhaps there is only a person and a camera and it is not appropriate to speak of intent. Video happens.

In sum, the intent of the creator can be used as a basis of a typology containing many different kinds of video. Each is different from the other in several important respects, including, (1) what information is packaged by the creator into the visual channel and (2) what the relationship is between the visual channel and the audio track. In light of this typology, it is rather curious that we consider multimedia information retrieval to be a single discipline. Instead, every genre presents us with a unique set of challenges -- an entirely different range of issues that need to be face to provide retrieval algorithms that succeed in meeting users' needs.

Thursday, May 6, 2010

Knowing where to search

At the end of last year, VideoCLEF became MediaEval. I thought it was a great name for a multimedia retrieval benchmark evaluation and a mashed up a new logo in an enthused rush. When I needed to go back and find the original illumated "M" that I used, it seemed to be the perfect job for content based image search. I recalled the Best Paper from ACM Multimedia 2009 onVisual Query Suggestion and headed off to Bing image search to try my hand at some combined text and image search.

I quickly found myself wishing that I had more options. In particular, I wanted to chose more than a single image at a time that was related to my query. The VIPER group at the University of Geneva has a Cross-Model search engine that lets you select multiple relavant images for each feedback iteration. You can also select a set of images for negative feedback, which would have been helpful.

But for this particular search, the Bing option to limiting search to black & white proved helpful. After a few iterations, I came up with some nice looking results that gave me a sense that I was really moving the right direction.

However, my search did not return the "M" that I had originally used. I went to Google images, formulated and reformulated. "Letter M illuminated", "Medieval manuscript M", "Illuminated medieval letter"...nothing seemed to help. Arg! Isn't this task easy? Shouldn't this just be duplicate detection?

Then I remembered that when I was looking for the original "M" I wanted to make sure that there would be no licensing issues so that MediaEval could use it freely. I had been experimenting at the time with the Creative Commons search engine so I went back there and put in the simplest of all possible queries "Illuminated M."

Bingo. The original M from the Chronica Polonorum on Wikimedia commons.

How often when we are searching do we remember that Web search is all about recall? Multimodal relevance feedback may expand our queries, but it also limits our results. If I weren't engaging in known-item search I would have never known the "M" I was missing. Similarity along radically simplistic visual dimensions is useful, but enevitably something will fall between the cracks. Thankfully it seldom seems to matter, but we shouldn't let our awareness that we might be missing something slip from our consciousness.

The more interesting observation was that the key to re-finding my image was reconstructing the way that I found it in the first place. Not only knowledge about the "M", but also detailed knowledge of where and how I should be looking for it turned out to be critical.

The search process is entertaining in and of itself. I am not going to reveal how much time I was willing to devote to finding that "M" and browsing through the images that Bing came up with as similar. The visual feedback did turn up a useful by-product -- not the direct target of my search: a beautiful high-resolution "M" that should satisfy gripes about the low quality of our MediaEval logo.