Wednesday, March 31, 2010

and I have transcended

Here's a screen shot from my YouTube video about the intelligent multimedia player. I've had YouTube do the automatic transcription, which you can see displayed as a caption at the bottom.

Captioning my video with the YouTube speech recognition service got me thinking about an op-ed piece that appeared in the International Herald Tribute last month: Typing with a Voice by Stewart Wachs. As the title suggests it's about using the computer via a speech recognition interface rather than a keyboard. Unsurprisingly, his speech recognition software generates some wildly off-the-mark transcriptions of what he's saying -- and these serve to make the op-ed quite amusing. But what sticks and doesn't let me go is the final turn of the article, where he talks about a flash of insight that allowed him to start extending empathy to the software. This change of attitude made the mis-recognitions less frustrating allowing him to work with the technology instead of pulling the opposite direction.

He worries that his compassion for the software is misplaced, citing the pathetic fallacy, the mistaken attribution of human characteristics to an inanimate object. The compassion is useful, nonetheless, he observes.

Does this explain my own fascination with speech recognition that has extended over more than a decade now? Is is some sort of innate ability, a gift I have to empathize with the software? Perhaps. If there is indeed compassion involved, more likely it arises not from working with an inanimate software, but rather from the accumulation of interactions with speech recognition researchers I have had over the years. It stems from an appreciation for their creativity and hard work and especially their patience with an external world where word error rate rules and where anything less than 99% accuracy is declared a priori useless. Humans, they patiently point out, don't achieve 99% accuracy in transcription and speech transcripts can be useful for speech retrieval even with error rates over 50%.

Sometimes you administer a bit of tough love, however, like on the day I decided to test drive the new YouTube automatic transcription service. The video I transcribed was recorded using the built-in mic on my Cannon Powershot which was about 2 meters away from me and there was also quite an echo in the kitchen where I recorded the video. If you were squinting at the small screen shot above to read the transcript generated by Google I'll now reveal that it reads, "and I have transcended weeks." Hmm, maybe I have transcended -- reached some sort of a plane where I embrace rather than push away the unexpected outputs of automatic speech recognition. They say, after all, that the speech recognition problem is AI-complete, as difficult as artificial intelligence. Maybe there is no other reasonable choice.

In reality, what I am saying at this point in the video is "...and our friends send us links!" What does this have to do with search? I say "send" YouTube hears "scend" not much of a common ground, but with a syllable level indexing system it would be enough to help locate this quote within the video. That's a lot better than what we have right now.

Thursday, March 18, 2010

Time to play

What's the difference between a single image and a video? Well, video is a timed medium, otherwise called a time-continuous or time-synchronized medium. It's impossible to take in at a glance, and in order to appreciate it, or derive any benefit from it at all, you have to devote time to watching it.

Watching would be more efficient if we didn't need to watch from end to end, but rather had some sort of a road map for the video: an intelligent players that would give us signposts, provide us with an indication of where the video could be most interesting.

I was invited to give a position statement at a session on search at ICTDelta 2010, a larger IT event in the Netherlands. I took the position that we need to have intelligent multimedia players that give us clues as to where the video is the most interesting as my position. The position statement was in Dutch, but I made an English version in the form of a YouTube video.