Sunday, June 6, 2010

Looking for a Scientific Programmer

I'm starting a cool new project on speech-based access to images, but I am in need of a programmer. I'm trying to find the right person -- ideally I would like someone who also had an interest in the process of design and evaluation of the system. The person probably just finished their masters and is trying to get an idea of an area for a PhD, or just generally thinks that one year of experience in the Multimedia Information Retrieval Lab at Delft University of Technology would be enriching. Unfortunately, trying to get someone like this just led to me loosing my dream candidate to a PhD program in Groningen. Yikes! The project's starting on 1 September 2010.

The formal qualifications are listed below. Thanks in advance if you help me find a match between my needs and a candidate.

Profile for scientific programmer in the Delft Multimedia Information Retrieval Lab at the TU-Delft
Contact: Martha Larson m.a.larson@tudelft.nl
  • Experience developing web applications using a web development stack, (one of LAMP, Java/Tomcat, ASP.NET/C#)
  • Experience with designing HTTP-based server APIs
  • Alternatively or in addition: Experience with html/Javascript/AJAX/Flash/Silverlight
  • Alternatively or in addition: Interest or experience with Android
  • Experience with speech recognition, dialog systems, audio spatialization a benefit
  • Experience with programming in a research environment a plus
  • Proficiency in English, both spoken and written
The person needs to be an EU citizen, but if I find a great candidate who is not, I am willing to attempt to "battle the system" to get him/her.

P.S. This post represents an experiment in making use of my social network. If was were more up-to-date I suppose I'd be using Linked-In or Facebook, but neither are come naturally to me, somehow.

Friday, May 7, 2010

Relevant to the query "List of Internet Video Genres"

From the perspective of automatic multimedia content analysis, there is vast difference in visual content between video that was produced for the purpose of being understood and absorbed by an audience and video that was captured with no explicit communicative intent. If two people shake hands in a film, the audience will know that they are shaking hands and it will fit with the dialogue in the sound track and with the over all story. If two people shake hands on a surveillance video one can occlude the other or perhaps they're passing a cigarette lighter who knows.

It seems like creator intent is an important clue for visual indexing of video for retrieval. But the example above is simplistic. If we want to find video on the internet, there are a whole range of intents between film and surveilance. What are these varieties? Today I thought I could type "list of internet video genres" into my favorite mainstream search engine and have it spit me out a list of things that we as Internet users do when we make video. That didn't happen. I spend some time in amused pondering over the juxtaposition in The six most baffling genres on YouTube. But I wanted something a bit more comprehensive than that (with a different tone), so I'm posting my own list.

Captured video: Walking through the room and not knowing the camera is on. Interesting as a curiosity, but serves no specific purpose.
Life-log: I know the camera is on but don’t really think about it. Serves the purposes of off-line memory.


Surveillance video: I put the camera in some particular place to capture the scene, but the people in the scene don’t pay attention to it. Serves the purpose or providing extra ears and eyes.


Home video: I am obviously holding a camera and pointing at people. Not trying to do anything but “get the feel of the situation on video” Serves the purposes of off-line memory. The act of making the video is also inherently entertaining and it might not necessarily be watched.


Event: I am documenting an event, like a wedding. The video is meant to portray both the compliance of the event to social convention and also its uniqueness. Ideally, I want to see the face of the bride and groom and hear the “I do”. Serves the purpose of memory, but may also be considered a public document that attests to status.


Meeting video/lecture video: I know the camera is on but don’t really think about it. Serves the purposes of off-line memory and possibly institutional record. Used to rewatch things that might have been missed the first time. The camera has a specific position – other material such as slides or white board shots might be present.


Testimony: A narrator recounts and experience. Spoken audio is unscripted but declarative factual statements. The narrative is usually temporally organized. The visuals are “convenience visuals”, but there are typical camera angles: frontal shot, shot of interviewer in dialogue with the interviewee. Viewer acquires some declarative knowledge, but basically, it’s an impression of the situation. 


How-to video: Demonstrates how to do something. The visuals are key, with the camera angle chosen to give maximal information. Items depicted in the visual tract have a high probability of being named in the speech track. After watching this video the viewer is intended to have procedural knowledge of the task.


Learning video: The video acts out scenes and viewers are invited to put themselves into the scene. This video is a surrogate for experience. Here, the camera angle is carefully chosen and any spoken audio must be clearly captured (esp. in the case of language learning videos.)
 Again, viewer acquires procedural knowledge, but it is via vicarious doing and not via showing. In contrast to how-to video, learning videos are only intended to be watched once.

Interview/review: I want to get someone’s view or opinion or tell my own. Can be planned or relatively unscripted. Contain opinions or attitudes. Factual statements are made to support opinions. The visuals may be convenience visuals, but may be planned convey the feel of that person.


Report: Following a script, I report a certain even that happened. The statements I make are factual. The visuals provide depictions of the objects mentioned (broadcast news). In the end the viewer has acquired declarative knowledge.


Documentary: Reports a sequence of events subordinate to some sort of ordering. Temporal ordering is common, or they may be ordered in order to support a thesis. Documentaries include a narrative line: they open questions and resolve them. In the end the viewer has acquired declarative knowledge.


Film (or TV Series): Narrative created for the purposes of entertainment, but may have other elements (didactic, community memory). The basis of film is a complex “contract” between the filmmaker and the viewer, which rests on a series of established conventions that have been developed over the history of filmmaking (the literature traces this system of conventions back into novels). The nature of this contract varies from film genre to film genre. Narratives are created by setting up viewer expectations and then either fulfilling the expectations or failing to fulfill them. Scenes are carefully composed to carry out scene setting, introduce characters and to depict events. Organization can be temporal or otherwise. Shots are set up so that the viewers understand what is going on (in all but a few exceptions, the main action in the shot will be readily visible, for example, the moment that he passes her the gun, the moment that she kisses him). In addition to understanding the plot, the film aims to create a mood for the viewer. Conventions are used to create mood include music, timing shifts (quick shot sequence used to portray the passage or time), lighting, camera angles. Further, films are created to delight the viewers with their film craft, which can involve strict adherence to filmmaking principles (including references to other films) or creative breaks with convention.


Art: With art the contract between the viewer and the creator is not as complex as with film. In fact, it can be considered utterly simple. The viewer simply has to agree, “this is art”. The impression the video has on the viewer is decoupled from the intention of the creator to a greater extent than in film. Art closes the circle and resembles captured video in that it the video is an object in of itself. It assumes a purpose in the act of viewing. Art events are not necessarily depicted so that the viewer understands the “plot”, the conveyance of a certain mood may be highly viewer dependent and there is no narrative. If there is a speech track, there is no predictable coupling between the speech track and the visual channel.



Object: Sometimes we make video and we don't know why. Neither videographer or viewer would readily commit to the "art" label. Even a video that we have made ourselves become objects that inhabits the world of objects, things we come across, think about, try to fit in to the larger pictures. Some of these videos are the ones whose existence in an of itself explains and expands the role of video. Here perhaps there is only a person and a camera and it is not appropriate to speak of intent. Video happens.

In sum, the intent of the creator can be used as a basis of a typology containing many different kinds of video. Each is different from the other in several important respects, including, (1) what information is packaged by the creator into the visual channel and (2) what the relationship is between the visual channel and the audio track. In light of this typology, it is rather curious that we consider multimedia information retrieval to be a single discipline. Instead, every genre presents us with a unique set of challenges -- an entirely different range of issues that need to be face to provide retrieval algorithms that succeed in meeting users' needs.

Thursday, May 6, 2010

Knowing where to search

At the end of last year, VideoCLEF became MediaEval. I thought it was a great name for a multimedia retrieval benchmark evaluation and a mashed up a new logo in an enthused rush. When I needed to go back and find the original illumated "M" that I used, it seemed to be the perfect job for content based image search. I recalled the Best Paper from ACM Multimedia 2009 onVisual Query Suggestion and headed off to Bing image search to try my hand at some combined text and image search.

I quickly found myself wishing that I had more options. In particular, I wanted to chose more than a single image at a time that was related to my query. The VIPER group at the University of Geneva has a Cross-Model search engine that lets you select multiple relavant images for each feedback iteration. You can also select a set of images for negative feedback, which would have been helpful.

But for this particular search, the Bing option to limiting search to black & white proved helpful. After a few iterations, I came up with some nice looking results that gave me a sense that I was really moving the right direction.

However, my search did not return the "M" that I had originally used. I went to Google images, formulated and reformulated. "Letter M illuminated", "Medieval manuscript M", "Illuminated medieval letter"...nothing seemed to help. Arg! Isn't this task easy? Shouldn't this just be duplicate detection?

Then I remembered that when I was looking for the original "M" I wanted to make sure that there would be no licensing issues so that MediaEval could use it freely. I had been experimenting at the time with the Creative Commons search engine so I went back there and put in the simplest of all possible queries "Illuminated M."

Bingo. The original M from the Chronica Polonorum on Wikimedia commons.

How often when we are searching do we remember that Web search is all about recall? Multimodal relevance feedback may expand our queries, but it also limits our results. If I weren't engaging in known-item search I would have never known the "M" I was missing. Similarity along radically simplistic visual dimensions is useful, but enevitably something will fall between the cracks. Thankfully it seldom seems to matter, but we shouldn't let our awareness that we might be missing something slip from our consciousness.

The more interesting observation was that the key to re-finding my image was reconstructing the way that I found it in the first place. Not only knowledge about the "M", but also detailed knowledge of where and how I should be looking for it turned out to be critical.

The search process is entertaining in and of itself. I am not going to reveal how much time I was willing to devote to finding that "M" and browsing through the images that Bing came up with as similar. The visual feedback did turn up a useful by-product -- not the direct target of my search: a beautiful high-resolution "M" that should satisfy gripes about the low quality of our MediaEval logo.

Saturday, April 10, 2010

Speechless? Not us.


Our proposal for a fourth workshop on Searching Spontaneous Conversational Speech at ACM Multimedia 2010 was accepted today.

ACM Multimedia 2010 Workshop
Searching Spontaneous Conversational Speech (SSCS 2010)
29 October 2010, Firenze Italy

http://www.searchingspeech.org/

The SSCS 2010 workshop is a forum for presentation of recent research results concerning advances and innovation in the area of spoken content retrieval and in the area of multimedia search that makes use of automatic speech recognition technology. Spontaneous, conversational speech occurs in a wide variety of domains and the workshop is relevant for lectures, meetings, interviews, debates, conversational broadcast (e.g., talkshows), podcasts, call center recordings, cultural heritage archives, social video on the Web and spoken natural language queries. The objective of the workshop is to bring together researchers in theareas of speech recognition, audio processing, multimedia analysis and information retrieval for exchange and interaction.

Thursday, April 1, 2010

Living with defeats

In 2006, I moved to Amsterdam and in 2007, I started a blog entitled "Living with defeats," which lasted exactly four entries before I abandoned it. The title plays with the word fiets, pronounced "feats". It's the Dutch word for bike, de fiets meaning "the bike" and pronounced approximately "defeats". A bicycle was my constant companion and the beast of burden that got me there and back from my apartment near Dam Square in the center of Amsterdam to the Science Park where I worked in the Intelligent Systems Lab Amsterdam.

The blog's title "Living with defeats" probably both anticipated and contributed to its quick demise. I'd thought I'd reflect a bit more about what went wrong, especially since my second attempt, i.e., this blog, is now celebrating its one year anniversary. This time, as a blogger I am doing something differently. What?

An important aspect is my understanding of my audience. In "Living with defeats", I was attempting to write a blog of the genre "American living in Amsterdam." Who was going to read it? Here, I write about things that keep my brain busy. It's more of an attempt to empty my head. To put ideas on the the back burner. The best typification: this blog is a conversation with my future self. In other words, now, I really know my audience.

Then there are the practical aspects. Blogger is easy to use -- I have a browser open most of the time anyway. "Living with defeats" didn't use a blogging platform and required quite a few extra clicks to write and publish.

Also, I have a different perspective on how frequently I need to blog. I find that I have time to do it about once or twice a month. In Amsterdam, nearly every day generated a new bike-related anecdote and I had a constant feeling of being behind. For all the cute word plays, I failed to be "dam square" with myself on how much time I actually had to devote to the project.

Most importantly, I am incurably passionate about search and especially about spoken content search. I catch myself in my free time thinking about finding things, especially bits of interesting multimedia, and also reflecting on the subject of "what I know about the things I don't know", i.e., the possibility of finding information or a resource that I don't know exists.

It's not that I don't still think about bikes. Although the Triumph finally needed to be replaced and the commute to Delft University of Technology is much shorter, I still accumulate plenty of Dutch bike anecdotes. When I get the urge to write about my bike, however, I control myself, at least in this forum.

Probably the major contributor to the relative longevity of this blog has been my work on coherence measures.

He, J., Weerkamp, W., Larson M. and de Rijke, M., An Effective Coherence Measure to Determine Topical Consistency in User Generated Content. International Journal on Document Analysis and Recognition, Vol. 12, No. 3, pages 185-203, October 2009.

Blogger consistency is a blog characteristic that is independent of the topic of a blog, "orthogonal to topic", as it were. Yet, it really impacts building of blogger credibility and the relevance of blog material to user information needs.

From this distance, I have come to regard "Living with defeats" as an exercise in cheap failure. I'm not exactly sure how long my new blogging project will last. But already, I've been able to consolidate my thinking by going back and reading old posts. Since my future self is the target audience, this constitutes nothing short of a rave review.

Actually, I'm living with my Triumph
Wednesday January 10, 2007



Although this blog claims to be about defeats it is actually about triumph, or should I say a Triumph. De fiets, pronounces more-or-less "defeats" is Dutch for "the bicycle" and the bike that inspired me to go on-line with my musings about Amsterdam is a metallic blue Triumph.

Currently, the Triumph is being stored at Central Station. "Your tire is flat", the attendant pointed out as I brought my bike down to the bike storage to lock it up. In my rush to make my train on time I hadn't noticed.

New Year's resolutions: fix the tire, get a picture of the Triumph on to the blog, learn how to spot a flat before it is pointed out to me by a more bicycle-aware Dutch denizen.

Key to Triumph
Sunday January 28, 2007

The Triumph (tire now repaired) spends a lot of time tied with a chain to railings of bridges over canals.

The first time I locked my bike up in this way, I realized that even a little slip on my part could lead to my whole ring full of keys flying through the bars of the railing and into the opaque water.

How many people in Amsterdam, I asked myself, have experienced such a mishap? If other people are as inclined to fumbling as I am, the bottom of those canals must be strewn with bundles of keys. I wondered how many busy schedules have lurched to a standstill because of keys flying into canals. People who have suddenly found themselves locked out of house and office and unexpectedly deprived of their means of transport, now immobilized on a bridge. Hopeful job applicants fatally late to that big interview. Romantics with nervous fingers never forgiven for missing a first assignation.

Anything capable of such disruption of the directed bustle of Amsterdam, must surely be a widely-foreseen threat.

There must be a service, I thought, someone to call in case this happens. I wondered if this would be a key-retrieval type service, which would rescue keys from any of the places they can fall in Amsterdam (train tracks, too, for example.) Or would this be a special service to retrieve things from the bottom of canals?

I addressed this question to an Amsterdam expert I know, and got quite a practical reply. Of course there is no rescue service. No one puts a bike key on a bundle with other important keys. A bike key needs its own ring. And naturally, you have a spare bike key somewhere handy.

It will be interesting to see how I do, administering now two key rings in my life along with mobile phone, memory stick, wallet and pocket calendar. Of course, I have particular motivation to get it right, since the chain to the Triumph has no spare key.

Have a light?
Friday March 16, 2007

What at first appears to be a snarl of cars, bikes and pedestrians has sorted itself out in my mind into three distinct streams whose twining is regulated by a set of rules. One learns to reconstruct the rules by asking questions, by observing the signs and markings on the bike paths and by watching how more experienced bikers avoid running into each other. Some of the rules I don't think I will ever be in danger of breaking. For example, the fact that there are occasionally speed bumps built into bike paths leads me to believe that there must be some kind of a bike speed limit. My Triumph is, however, permanently stuck in third gear. Without doubt it is able to achieve significant speed, but the protests of my knees will keep me well below ever being the kind of biker that those speed bumps were built to reign in.

One rule I do watch out for is the the lights rule. You're supposed to have lights to be visible after the sun sets. You can imagine that if my gears do not work properly it is something of a challenge to maintain the rather delicate set of wires that runs from the dynamo to the head light and back to the tail light. Extricating one's own bike from a forest of other bikes locked to the same rack tends to get wires and things caught on other people's pedals and break levers. I am careful. I monitor my bulbs. And at night my pedal power is transformed not only into forward motion but into twin beacons one ahead and one behind.

I was rounding the corner to go under the bridge at Muiderpoort and two police agents stepped on to the bike path one on either side. Oh, I thought, this is how the police stop bikes in Amsterdam and I stopped.

"Mevrouw" they said "we have stopped you because of the example you are providing for the neighborhood." I was a bit flustered. What kind of an example? What was I doing wrong? They looked at an age to be relatively new to the force and their faces were solemn. Doubtlessly they had those bike rules I was trying to reconstruct freshly memorized down to the last footnote.

Thankfully the dramatic pause was short and they continued on to explain, "You are using your bike lights. To show our appreciation we would like to present you with this light-up police keychain."

My fluster fled."Oh," I exclaimed. "Bedankt" I take the keychain "Echt leuk." But you have to imagine that I used the American English intonation of "Hey, thank you, that's soooooo nice."

They looked at each other. Then back at me. They were probably not too thrilled about getting assigned to keychain distribution detail. My Anglophone ebullience was not making the assignment any easier.

"You're welcome," in English, the staccato reply.

Now, the light is on my keychain. It has a police logo and a button to push to make it light up. I mean to show it to the next police agent who pulls me over, since it will without doubt be for doing something contrary to those bike rules.

Fiets feats: Breakfast by bike
Sunday May 13, 2007

In Amsterdam, you don't interrupt your life just because you are on your bike. A mp3 player or a cell phone conversation makes a long ride go by much quicker. Now I automatically get on my bike with my cell phone somewhere I can answer it easily.

But I am definitely still a newbie. Around me feats of unimagined coordination and balance are integral to the lives of Dutch denizens. One kid in the back one kid in the front. Everyone gets to school and work on time. One kid in the front one kid in the back add the grocery shopping and we're home in time for dinner. No kids, on the cell phone explaining the delay to the baby sitter.

Left hand steering right hand clutching cell phone thumb tapping in text message to boss. Call friend to talk about stressful day, right hand on cell phone left hand holding cigarette still sneaks in adequate steering. I haven't actually seen this combination, but I strongly suspect that the smoking phone talker could easily add an umbrella to the mix in case of rain.

The part of my life that is moving onto my bike is breakfast. I don't wake up hungry in the morning. I am not a breakfast person and my body does not wake up with the realization that there is going to be a half hour of hard pedaling before leaving the house and arriving at work. So half way to work I have taken to stopping and picking myself up a scone.

The scone I put in my pocket, since I need two hands to get going. But once under way I can do handle bar left hand, scone right hand, mouth munching. And no one stares. Quite to the contrary. Already on my second bike breakfast a passerby wished me bon appetit.

I am getting quite a lot of practice with the breakfast scone and I hope in a couple months to be able to add a cup of coffee. Someone in Amsterdam must sell handle-bar mountable cup holders.

Wednesday, March 31, 2010

and I have transcended

Here's a screen shot from my YouTube video about the intelligent multimedia player. I've had YouTube do the automatic transcription, which you can see displayed as a caption at the bottom.

Captioning my video with the YouTube speech recognition service got me thinking about an op-ed piece that appeared in the International Herald Tribute last month: Typing with a Voice by Stewart Wachs. As the title suggests it's about using the computer via a speech recognition interface rather than a keyboard. Unsurprisingly, his speech recognition software generates some wildly off-the-mark transcriptions of what he's saying -- and these serve to make the op-ed quite amusing. But what sticks and doesn't let me go is the final turn of the article, where he talks about a flash of insight that allowed him to start extending empathy to the software. This change of attitude made the mis-recognitions less frustrating allowing him to work with the technology instead of pulling the opposite direction.

He worries that his compassion for the software is misplaced, citing the pathetic fallacy, the mistaken attribution of human characteristics to an inanimate object. The compassion is useful, nonetheless, he observes.

Does this explain my own fascination with speech recognition that has extended over more than a decade now? Is is some sort of innate ability, a gift I have to empathize with the software? Perhaps. If there is indeed compassion involved, more likely it arises not from working with an inanimate software, but rather from the accumulation of interactions with speech recognition researchers I have had over the years. It stems from an appreciation for their creativity and hard work and especially their patience with an external world where word error rate rules and where anything less than 99% accuracy is declared a priori useless. Humans, they patiently point out, don't achieve 99% accuracy in transcription and speech transcripts can be useful for speech retrieval even with error rates over 50%.

Sometimes you administer a bit of tough love, however, like on the day I decided to test drive the new YouTube automatic transcription service. The video I transcribed was recorded using the built-in mic on my Cannon Powershot which was about 2 meters away from me and there was also quite an echo in the kitchen where I recorded the video. If you were squinting at the small screen shot above to read the transcript generated by Google I'll now reveal that it reads, "and I have transcended weeks." Hmm, maybe I have transcended -- reached some sort of a plane where I embrace rather than push away the unexpected outputs of automatic speech recognition. They say, after all, that the speech recognition problem is AI-complete, as difficult as artificial intelligence. Maybe there is no other reasonable choice.

In reality, what I am saying at this point in the video is "...and our friends send us links!" What does this have to do with search? I say "send" YouTube hears "scend" not much of a common ground, but with a syllable level indexing system it would be enough to help locate this quote within the video. That's a lot better than what we have right now.

Thursday, March 18, 2010

Time to play

What's the difference between a single image and a video? Well, video is a timed medium, otherwise called a time-continuous or time-synchronized medium. It's impossible to take in at a glance, and in order to appreciate it, or derive any benefit from it at all, you have to devote time to watching it.

Watching would be more efficient if we didn't need to watch from end to end, but rather had some sort of a road map for the video: an intelligent players that would give us signposts, provide us with an indication of where the video could be most interesting.

I was invited to give a position statement at a session on search at ICTDelta 2010, a larger IT event in the Netherlands. I took the position that we need to have intelligent multimedia players that give us clues as to where the video is the most interesting as my position. The position statement was in Dutch, but I made an English version in the form of a YouTube video.