The final day of Interspeech 2009 here in Brighton. It's been a great conference and each and every keynote has been well worth getting up for. This morning, Mari Ostendorf talked about "Transcribing Speech for Spoken Language Processing." Interspeech encompasses a staggeringly broad spectrum of perspectives on speech research and technology. For every point here, there is an immediate counterpoint, and it was without doubt under influence of this chorus that the opening slide of the keynote this morning displayed a long-play version of the title reminding the audience that they would be hearing about transcribing human-directed human speech, as opposed to speech that humans produce to communicate with computers.
The message from the keynote that will ring longest in my ears was, "The goal of speech transcription is information access." This leaves open of the course, the question of what is the information and what is the access when it comes to content that contains the spoken word. I find myself compiling little lists of domains in which information encoded in spoken audio could be important: podcasts, video diaries, lifelogs, meetings, call center recordings, social video networks, Web TV, conversational broadcast, lectures, discussions, debates, interviews and cultural heritage archives, home videos, photo annotations, video conferences. These lists invariable end with etc. etc. etc. And what constitutes access (keyword search, retrieval, question answering, browsing, recommendation...) is another question to which we can't give a closed-set answer.
My personal experience doesn't really support the idea that we need to push the envelope. The last video I watched I found because a link was sent to me by my cousin. The content of the video was a short clip of her new cat purring. No real access problem there. No information either. The purr did not inform me in the conventional sense. In fact, there wasn't much human speech involved at all. Nonetheless, I found the content supremely worthwhile of my watching time. Although my own multimedia access needs are a string of examples of this pre-solved sort, I do agree that the challenge of access to speech-based information is a serious one and will require a great deal of effort to address.
The full phrase in Ostendorf's slide read, "The goal of speech transcription is information access, not just getting the words right." But maybe it is about "getting the words right". The words referred to are, presumably, white-space delineated grapheme strings, lexical words, citation forms. But we can also see a word as the totality of knowledge that a human needs to possess in order to deploy it in human-to-human communication. There may be a limit to how far we can go beyond that sort of word and still remain within what is meaningful in the context of our information access needs.
We can go for prosody, for speech act, subjectivity, affect, but in the end we'll never capture the "you had to have been there" component of understanding. And already the moment of that particular purr video has passed and my next need for video content will be for a new one.