N-grams

Generative Pottery Transmogrifier: An Allegory

2023-03-11T12:13:00.020+01:00

The new machine

The boy shifted his attention to the large screen. He had seen the word “pottery” in the opening title and wanted to follow the news report. At school, he was taking a class in the pottery studio. He liked to make things from clay.

On the screen, people were gathered around a large, shiny box on wheels. The box radiated innovation. It was cube-like, but streamlined—like it was meant to move, like it was capable of guiding itself to its own destination.

From one side extended a flexible silver tube with a nozzle. The nozzle now hung down gracefully, but it was clear that it was designed to produce wonders.

One person punched a sequence of buttons on a remote control panel. The Machine rolled forward slightly and its nozzle lifted.

After a pause it gave a delicate quiver and the nozzle launched an object made of clay. The form traversed a neat arc and landed on a nearby table. What was it?

The camera zoomed in and the boy could see that the form was actually a coffee mug, a beautiful mug. Such a mug would have taken him several hours to make in the studio, if he even could achieve the curves, the symmetry. It was produced by the Machine in a flash, and it seemed perfect.

The nozzle of the Machine was in motion again, raising and firing another object. Another flawless landing on the table. This one was a coffee cup with two handles. It was fascinating to put two handles on a coffee cup. What could the second handle be used for?

The people in the news report were still intently examining the first mug, passing it from hand to hand and pointing out aspects of its lines and surfaces to each other.

The boy noticed that they seemed unaware that the Machine continued to shoot out objects. A group of clay forms appeared on the table and continued to grow. Mug after creative mug: two handles, no handles, three handles, five handles. Then, it produced a cross between a mug and an elegant flower vase, and, then, three ashtrays joined like a clover. It would be too much for the people in the news report to ever have time to look at each piece.

They didn’t seem interested in the other objects, but remained focused on the first mug and were now demonstrating something they had discovered. The mug had no bottom. One raised it to the camera and peered through it. The boy could see an eye staring from the screen at him where the bottom of the mug should have been. The others nodded with approval.

Why were they so happy with a mug without a bottom?

The Machine enters the studio

The boy sat in his work station in the pottery studio. The other students were also at their stations, and they were chatting among themselves.

The boy was impatient for the teacher to arrive and the class to start. He had an idea for what he wanted to make from clay today. Lately, his thoughts had been full of dogs. He wanted to make a water dish for a dog. It would be for a dog to drink from, but also shaped like a dog—with a head, feet and tail coming from the sides. He had never seen anyone make a dog dish in the studio, and certainly not one that actually looked like a dog.

The door opened and the pottery teacher entered the room. She held the door wide and peered expectantly into the hall. The students fell silent, their gaze following the teacher’s.

There was a low musical whir and through the door a silver tube with a nozzle became visible and it soon became evident that the tube was part of a large, shiny box. The Machine rolled into the room. It was the Machine that the boy had seen on the news.

The school’s studio housed a few electric potter's wheels, but had never seen such advanced technology as the Machine. Its shiny metal surface, its futuristic shape, attracted admiration and fascination.

“Students,” the pottery teacher announced, “today we begin a new form of learning. Our school has put out a huge sum for this Machine and has also found the budget to pay the monthly subscription fee. The Machine will help you learn pottery.”

She glanced around the room. The students were listening closely.

“You will no longer need to learn to hand-build or throw pots. Coil technique, slab technique, mold technique are no longer necessary for you to make a high quality pot. Instead, the Machine will produce an initial piece of pottery for you that is already at the level of an experienced potter, and you will learn by perfecting it. You will learn to produce masterly pieces by starting with high-level work.”

The teacher punched some buttons on a remote control panel. She was focused on the exact sequence of buttons, and seemed to tremble with excitement.

The Machine started to roll around the room. At each workstation, the Machine paused, trembled, and the nozzle lifted. Out of the nozzle flew a clay form, which arced through the air and landed precisely in the middle of the workstation. Each student’s eyes widened with wonder as they each received their own form to work on.

The Machine halted at the workstation of the boy, who watched it intently. The others seemed to trust that the clay form would land on their workstation and not hit them in the face. He wasn’t sure how they were so certain, but he braced himself not to recoil as the Machine quivered and the nozzle lifted and fired in his direction.

With a loud thud, his form arrived on the workstation precisely in front of him. It was a teapot. He observed it with enchantment. It was round and squat, but avoided looking heavy. Its spout was lifted proudly. He peered into the hole at the top to reassure himself it had a bottom. It did. The teapot seemed perfect.

As he studied the teapot, he recalled that he had wanted to make a dog dish. He would have to flatten the beautiful teapot that the Machine had created for him to make a totally different form. He tried to picture how he would completely remold the clay. Tentatively he pinched at the side of the pot—but he couldn’t remember his idea for the dog dish clearly anymore. When he looked at the clay form, all he could see was a teapot.

The teacher said that they would produce masterly pieces. Maybe the dog dish had not been a good idea. Maybe what he had really wanted all along was a teapot. Perhaps a teapot with ears.

The boy set to work and spent the class period pinching out a jaunty pair of dog ears, one on each side of the teapot. He turned his work on his workstation and regarded it from every angle. He certainly could not have produced such a piece from scratch. Would the teacher consider it masterly?

The pottery teacher walked by his desk, and the boy looked up.

“Lovely piece. The ears give the teapot a sense of added lightness. You did a good job.”

The teacher started to walk towards the next student, but then looked back.

“Don’t forget to deposit your pot in the bin of the Machine as you leave the studio. Just open the lid of the box and throw it in. The Machine gives pottery to us, but it is only because we give pottery back to the Machine.”

The Machine leaves the studio

The boy walked into the studio and closed the door behind him. The other students were all sitting quietly at their workstations. It was too quiet. He was missing the musical whirring of the Machine. The large, shiny box was no longer standing in its corner.

He sat down at his workstation just as the door opened again. The students expected to see their pottery teacher, but in walked the geology teacher. The teacher greeted the class and then pointed to three of the students.

“There are some pottery supplies in the hall, please bring them in.”

The students looked at each other in surprise, but quickly scrambled out of their seats and through the door. After a moment, they returned lugging a large, rough wooden box.

“The Machine, as you see, is not with us anymore. Your pottery teacher has asked me to take over the lesson for the day.”

The boy looked around and saw disappointment in the faces of the other students. The box looked like it had been standing in the garage of the geology teacher for twenty years. The outside was streaked with reddish brown. What could they learn from this box?

The teacher motioned the students to move the box to the corner where the Machine had stood when it was not producing clay objects.

“Today,” he announced, “I am going to teach you how to make pottery with wild clay.”

A murmur traveled through the studio. What was “wild clay”?

“I gathered the clay from the bank of a stream that runs behind my house. In this box you will find a bucket of clay soaking in water for each of you. Before you can use it, you need to mix it to make a smooth slip and then sieve it. You’ll find your sieve in your bucket. Then we will let it dry to a consistency with which you can work.”

The students crowded around the box, removed the lid and distributed the buckets. Soon they were all back at their workstations mixing and sieving.

“Here are cloth pillowcases,” the teacher said, moving from workstation to workstation, “take one each for the next phase.”

The boy poured his sieved liquidy clay into his pillowcase, wrapping it around the bucket handle so that it would hang into the bucket and drip. He positioned the bucket in the middle of the workstation. It would be there waiting for him for next week’s class. He gently patted its side, sealing the promise.

The geology teacher passed by his desk. “Looking good. What do you think?”

The boy didn’t reply, but continued to gaze at the bucket. Finally, he looked up and asked, “What happened to the Machine?”

“Oh, the Company that makes the Generative Pottery Transmogrifier is facing some challenges,” the teacher responded. “Their Machine works by processing thousands and thousands of pots. The Machine needs many, many pots so it can mix and match potters’ styles and produce new forms.” The boy knew the basics of how the Machine works from the news, but he hadn’t yet thought deeply about what it meant.

“The Machine spits out clay objects that are delightful and sometimes even dazzling,” the teacher elaborated. “However, it cannot create a pot from formless clay. It needs to start with pots that have been created by potters. All these pots get thrown into its bin, and the machine mixes and matches their functions and their shapes. Plus the Machine consumes whatever you yourself produce. You also feed the Machine by throwing your own work into its bin.

“In the beginning the Company managed to get all of these pots for free,” the teacher went on. The boy listened closely. “But this is changing. Potters from far and wide have joined forces. They are pointing out that they were not being paid for the time and effort they had invested in making the pots used to feed the Machine. They had not intended their work to be used in this way.”

The boy lightly bit his lower lip. He thought he could understand the rage and frustration of the potters.

“They pressured the Company to publicly acknowledge that the Machine would not function without potters who make pots,” said the teacher. “The general public grew disenamored with the Machine. Respected museums and galleries released reports on how the Machine was consuming rare pottery from Africa, from the precolonial Americas, from neolithic Europe and Asia. The Company was forced to act.”

The teacher sighed. “The Company is currently raising the price of the subscription for the Machine as it tries to respond to the public outcry. Our school can’t afford the Machine any more after the latest price increase. Your pottery teacher is with the school principal at this moment trying to get a refund for the original Machine. Don’t worry, she’ll be back for the next class.”

The boy considered what the teacher had said. After a moment, he remarked, “We’re lucky there’s a stream running behind your house.”

“Yes. I like my stream. There’s enough clay there to make anything you like. We certainly won’t miss the Machine.” He paused before adding, “And there are enough potters around the globe to make whatever we want and need. The world wouldn’t miss the Machine either.”

“I want to be a potter when I grow up,” said the boy.

The boy saw encouragement in the smile the teacher gave him before moving on to the next student.

Standing up to move to the studio sink, the boy realized it would take him longer than usual to get his hands clean. There was nothing to be done about his clothes. He would simply go to his next class still splotched with reddish brown. He didn’t mind.

He glanced at the rough wooden box that held the wild clay supplies standing in the corner where it had replaced the Machine. He couldn’t imagine a group of people on a news report gathered excitedly around this wooden box, like they had gathered around the silver, streamlined, musically purring Machine.

He hoped that the wild clay would remain part of pottery class. While he mixed and sieved the wild clay slip, he understood where pottery comes from. He felt a connection to other potters who had done the same in the past, for thousands of years. Preparing the wild clay raised images in his mind of the pieces he could make in the future.

The boy shook his head. Why would a Machine make a mug without a bottom, even if a potter was waiting to fix it?

If he wanted to re-form an already formed teapot, the boy reflected, he could still do that without the Machine. One of the other students could make a teapot and he would add the dog ears. He smiled to himself. Working together in this way, they would be very sure that the finished piece was theirs and theirs to keep.

Why should recommender system researchers care about platform policy?

2020-10-02T01:23:00.011+02:00

In this post, I reflect on why recommender system researchers should care about platform policy. These reflections are based on a talk I gave last week at the Workshop on Online Misinformation- and Harm-Aware Recommender Systems (OHARS 2020) at ACM RecSys 2020, which was entitled, "Moderation Meets Recommendation: Perspectives on the Role of Policies in Harm-Aware Recommender Ecosystems."

Every online platform has a policy that specifies what is and what is not allowed on the platform. Platform users are informed of the policy via platform guidelines. All major platforms have guidelines, e.g., Facebook Community Standards, Twitter Rules and Policies, Instagram Community Guidelines. Amazon's guidelines are sprawling and a bit more difficult to locate, but can be found at pages like Amazon Product Guidelines and Amazon restricted products.

Policy is important because it is the language in which the platform and users communicate about what constitutes harm and needs to be kept off the platform. Communicating via policy, which is expressed in everyday language, ensures that everyone can contribute to the discussion of what is and is not appropriate. Communication via technical language or computer code would exclude people from the discussion. The language of policy is what offers the possibility (which should be used more often) for us to reach consensus on what is appropriate. It also acts as a measuring stick to make specific judgements in specific cases, which is necessary in order to enforce that consensus completely and consistently.

Policy is closer to recommender system research that we realize

On the front lines of enforcing platform policy are platform moderators. Moderation is human adjudication of content on the basis of policy. Moderators keep inappropriate content off the platform. (Read more about moderation in Sarah T. Roberts' Behind the Screen and Tarleton Gillespie's Custodians of the Internet.)

Historically, there has been a separation between moderators and the online platforms that they patrol. Moderators are often contractors, rather than regular employees. It is easy to develop the habit of placing both responsibility for policy enforcement and the blame for enforcement failure outside of the platform (which would also make it distant to the recommender algorithms). An example of such distancing occurred this summer, when Facebook failed to remove a post that encouraged people with guns to come to Kenosha in the wake of the shooting of Jacob Blake. The Washington Post reported that Zuckerberg said: "The contractors, the reviewers who the initial complaints were funneled to, didn’t, basically, didn’t pick this up." He refers to "the contractors", implicitly holding moderators at arm's length from Facebook. It is important that we as recommender system researchers resist absorbing this historic separation between "them" and "us".

Recommender system researchers, as computer scientists, live by the wisdom of GIGO (Garbage In Garbage Out). In order to produce harm-free lists of recommended items, we need an underlying item collection that does not contain harmful items. This is achieved via policy, and the help of moderators enforcing policy.

Second, recommender systems are systems. Recommender system research understands them as not only as systems, but as ecosystems, encompassing both human and machine components. When we think of the human component of recommender systems we generally think of users. However, moderators are also a part of the larger ecosystems, and we should include them and their important work in our research.

Connecting recommendation and moderation opens new directions for research

Currently, most of the interest in moderation has been around how to combine human judgement and machine learning in order to quickly, and at large scale, decide what needs to be removed from the platform. At the end of the talk at the workshop, I introduced a case study of a system that can translate the nuanced judgments of moderators into automatic classifiers. I discussed the potential of these classifiers for helping platforms to keep up with the fast change of content and quickly evolving policy. The work has not yet been published, but is current still under preparation (hope to be able to add a reference here at some later point).

However, not all policy enforcement involves removal. Some examples of how platform policy interacts with ranking are mentioned in the recent Wired article YouTube's Plot to Silence Conspiracy Theories. It is worth noting, that even if downranking can be largely automated it is important to keep human eyes in the loop to ensure that the algorithms are having their intended effects. We should strive to understand how this collaboration can be designed to be most effective.

Finally, I will mention that together with Manel Slokom, I have previous proposed the concept of hypotargeting for recommender systems (hyporec), a recommender system algorithm that produces a constrained number of recommended lists (or groups, sets, sequences). Such an algorithm would make it easier to enforce platform policy not only for individual items, but also for associations between items (which are created when the recommender produce a group, list or stream of recommendations).

In order to understand the argument for hypotargeting consider the following observation: There is a difference between a situation in which I view one conspiracy book online as an individual book, and a situation in which I view one book online and am immediately offered a discount to purchase of set of three books promoting the same conspiracy.

The difference lies in the impact that the recommender has on the user. Associations of items can be easily interpreted as "a trail of crumbs" leading the user to assume more broader supporting evidence for an idea than is actually justified. If the recommender produced a constrained number of sets, it would be easier to review them manually, and to make the subtle judgement of whether it is appropriate to be incentivizing purchase of these items.

Ultimately these ideas open new possibilities for policy as well: the e-commerce site should be transparent not only about which items they remove, but also about the items they prevent from occurring together in lists, groups, or streams.

There are no silver-bullet solutions to the problem of harm caused by recommender systems. However, it does seem like there is a great deal of potential in researching algorithms that can be steered by humans in order to enforce policy.

Three Laws of Robotic Language

2020-08-10T01:54:00.014+02:00

This post is a draft on which I am currently eliciting feedback. Changes may be made in the future.

Artificial Intelligence that can produce language is improving in leaps and bounds (cf. the recent GPT-3 as reported on, e.g., in The Economist). However, it is still early enough to think seriously about how we should guide the development of language AI in order to maintain influence over the large-scale, long-term effects of automatic language generation. Asimov’s Three Laws of Robotics have inspired AI research towards conscious design choices during the early stages of new AI technologies. Parallel to these laws, this post proposes Three Laws of Robotic Language. We understand robotic language as language (written, spoken, or signed) that was generated partially or entirely by an automatic system. Because such a system can be seen as a machine engaging in a conventionally human activity, we refer to it as a language robot. These laws are intended to support researchers developing AI for natural language generation. The laws are formulated to help lay a solid foundation for what is to come by inspiring careful reflection about what we need to get right from the beginning, and the mistakes we need to avoid.

The Three Laws of Robotic Language

First Law: A language robot must declare its identity.

Second Law: A language robot’s identity must be easy to verify.
Third Law: A language robot’s identity must be difficult to counterfeit.

Practical benefits

Adopting these Three Laws would support desirable practical properties of robotic language as its use becomes more widespread:

People (readers, consumers) will be able to identify content as robotic language (as opposed to language produced by other people) without relying on sophisticated technology.
People will be able to confirm the source of the content without relying on sophisticated technology.
Entities (organizations, companies) that generate high-quality, reliable robotic language can be sure that consumers can recognize and trust their content.
Entities that generate robotic language can more easily ensure that they don’t unwittingly train their language generation systems on previously generated robotic language.

Like the Three Laws of Robotics, these laws depend on adoption by the people and organizations that develop and control technology. For many, the practical properties delivered by the laws will be convincing enough. For others, it will be important to understand the link between these laws and the nature of human language, which is explained next.

Moving robotic language towards human language

Currently, the success of robotic language is judged by its ability to fool a reader into mistaking it for language generated by a human. This criterion seems sensible for judging individual sentences, paragraphs or documents. Adopting this criterion implies that we, effectively, regard human language as the generation and exchange of sequences of words and that we consider the aim of language robots to be approximating these sequences. However, if we look at the larger picture of how people actually use language, we see that language goes beyond word sequences. What interests us here is how language conveys the connection between the creator (i.e., who is speaking or writing) and the language content that they create (i.e., what is spoken or written). The Three Laws of Robotic Languages state that when language robots generate language content, information about the creator must be inextricable from that content. Adding the criterion of creator-content inextricability should not be considered a nice-to-have functionality that can optionally be added to language robots at some future point. Rather, this feature must be planned from the beginning, before language robots establish themselves as a major source of the language content that we consume.

For some, the idea that the connection between creator and content is an important part of language is surprising. It is not, however, radically new, but rather an observation, perhaps so obvious that it is easily overlooked. Think about speaking to a baby or an animal: they react to the you-ness of your voice, although they might not understand your words. Our voices identify us as individuals. On top of that, when we hear a voice we may not recognize the specific person speaking, but we still hear something about them. Speech is produced by the human body, and is given its form by our mouths and nasal cavities. Our voices identify something about us, e.g., how big we might be. The Three Laws of Robotic Language are, at their root, a proposal to give language robots a “sound of voice” that would carry information about the origin of the language content that they produce. Language robots must identify themselves, or at least reveal enough about themselves so that it is clear (without the need for sophisticated technology) that they are robots.

In order to better grasp why the inextricability of creator from content is a fundamental characteristic of human language, it helps to look back in time. Throughout most of the history of language, speech could not exist independently of a speaker (and sign language could not exist independently of a signer). It was impossible to decouple the words and the source of the words. It is only with the rise of written language that we have the option of breaking the content-creator association, allowing language content to float free of the person who produced it. Most recently, speech synthesis or sign synthesis can also disassociate the speaker from what is spoken. This possibility of content-without-creator now feels so natural to us that it is hard to imagine that it was not originally a property of human language. However, the age of speech-only language was tens of thousands of years (possibly more) longer than the current era of written language. It may seem strange from the perspective of today, but the original state of human language is one of inextricability: speech could not exist without a speaker.

In short, we know that language works well with inextricability: that’s the way in which human language was originally developed and used. For this reason, the Three Laws of Robotic Language should not be considered an unnatural imposition, but rather a gentle requirement that language robots behave in a way that is closer to the original state of language.

An important design choice

It is important to note that when humans use language they creatively manipulate the connection between who is creating and what is created. We imitate others’ voices. We quote other people. We love the places where we can yell and hear our voices echoed back. Once written language introduced the possibility of extricating the creator from language content, we started to take advantage of the option of hiding our identities: we use pen names and we write anonymous messages. The Three Laws of Robotic Languages constrain the ability of language robots to engage in these kinds of activities. For example, the laws prevent them from generating anonymous content or producing imitations that are impossible to detect.

At first consideration, it seems that the Three Laws of Robotic Language represent an unnecessary hindrance or constraint. However, human language is characterized by strong constraints. On further thought, it becomes clear that language robots need to be subject to some form of constraint if they are to interact productively with naturally constrained humans over the long run in large-scale information spaces.

The constraints on human language are human mortality and limited physical strength. When we focus on a small, local scale, thinking about individual texts and short periods of time, we risk overlooking these constraints. However, they are there and their effect is important.

First, think about human mortality: A given person can produce and consume only so many words in their lifetime. Our deaths represent a hard limit, and force us to choose, over the course of our existence, what we say and what we don’t say, what we listen to and read, and what we don’t. A language robot needs shockingly little time to generate the same amount of language content that a human would produce (or could consume) in a lifetime.

Second, think about human physical strength. Language is the means by which humans as a species have pooled their physical strength. Language allows us to engage in coordinate action towards a common goal. We use language to convince other people to adopt our opinions or follow our plans. The power of our language to convince is limited by our physical ability to act consistently with our opinions or to contribute to carrying out our plans. People speaking empty words put themselves at risk of ostracization or physical harm. A language robot can generate language that is finely tuned to be convincing, and is unconstrained by the need to follow up words with action. Language robots risk nothing.

Considering again Asimov’s Three Laws of Robotics, human mortality and limited physical strength is what makes the laws necessary in the face of robots with superior strength and stamina. The laws level the playing field, so to say. The Three Laws of Robotic Language serve a similar function. They do not protect humans as directly as Asimov’s laws. However, they make the actions of language robots traceable, which provides a lever that allows humans to maintain influence on the large-scale, long-term impact of robotic language on our information sphere.

At this point, we don't know enough to predict this influence exactly. What is clear, however, is that we need some kind of constraint. It is also clear, as argued above, that the Three Laws of Robotic Language are consistent with a functioning form of human language, which is actually its original form. Further, we know that the laws have some already-obvious advantages. Recall from above the desirable practical properties: inextricability delivers convenience i.e., following the Three Laws of Robotic Language will prevent AI researchers from inadvertently training language robots on automatically generated text, causing feedback loops (resulting, possibly, in systems drifting away from human interpretable syntax and semantics). Further, as we struggle to gain control of malicious bots and disinformation online, it would be helpful if the language robots with honorable intent would declare themselves. Inextricablity would make it easier to build a case against ill-intentioned actors.

The Three Laws of Robotic Language are not a silver-bullet solution, but rather a well-informed design choice. Currently, AI researchers have defaulted to the extricability of creator from content. The Three Laws will already be a success if they inspire AI researchers to pause and consider whether inextricablity, rather than extricability should be considered the default choice for systems that automatically generate natural language (text, speech and sign).

An example

Let’s consider a language robot that generates text sentences. We will call this language robot DP-bot, because it declares its identity by upholding the double prime (DP) rule with every sentence that it produces. The language robot can generate the sentence:

We adore language.

The double prime rule states that a prime number of letters must occur a prime number of times in a sentence. The rule is upheld by this sentence since ‘e’,’a’,’g’ (3 and only 3 letters; 3 being a prime number) each occur in the sentence a prime number of times (3, 3, and 2 times respectively; 2 and 3 being prime numbers).

This sentence expresses the same sentiment:

We love language.

The sentence, however, does not respect the double prime rule. ‘e’,’a’,’g’ all occur a prime number of times (3, 2, and 2 times respectively), but ‘l’ also occurs 2 times. This means that 4 letters occur a prime number of times (4 not being a prime number).

At first consideration, it may seem that DP-bot is a bit too constrained in the semantics that it can express, since the match in meaning between the two sentences is approximate. However, if sentences get longer, or if the rule is defined to apply at a higher level (e.g., paragraph and not the sentence level), it will be easier to encode semantics into a text that respects the double prime rule without burdensome constraints.

DP-bot upholds the First Law of Robotic Language in that all language content generated by DP-bot respects the double prime rule and is thus identifiable as having been generated by DP-bot. DP-bot upholds the Second Law because it is easy to validate that a sentence respects the double prime rule. The only knowledge that is needed for validation is the natural language sentence that states the double prime rule, i.e., “a prime number of letters must occur a prime number of times in a sentence”. DP-bot does not do very well with the Third Law, since it is easy to create a sentence that respects the double prime rule, thereby counterfeiting DP-bot language. Even manually constructing a sentence that complies to the double prime rule is not difficult. Currently, we are working on formulating rules that are more sophisticated than the double prime rule and that require a large amount of computational power or specialized training data in order to embed them into natural language sentences.

Note that the language robot DP-bot produces text that encodes a mark, but that this mark is not a watermark. Let’s call it an sourcemark, since it marks a language robot as having been the source of the text. A watermark is also a pattern that is embedded into content, like text or an image. Its purpose is to identify ownership. A watermark is designed to be robust to change. For example, if a text is paraphrased or excerpted the mark should still remain. An sourcemark, however, is meant to identify the original text and associate it with a creator (the source). A small change in text might compromise the meaning, e.g., We do not adore language. A creator can no longer claim responsibility for text once it has changed, and should not be identified with the changed text. Unlike a watermark, a sourcemark must disappear when the text has been changed.

Note that the double-prime rule has nothing to do with encryption. Prime numbers are used because they are a relatively small set of numbers that are easy to describe. If the rule can be expressed in a single sentence, “a prime number of letters must occur a prime number of times in a sentence”, then it is easy to confirm the rule without any sophisticated technology, such as a machine learning classifier or a key (with enough patience it can be done without even using a computer). If we used a form of encryption, the ability to verify the identity of a language robot would be restricted to the subset of people who have the appropriate technology (requiring software installation and maintenance, computation, passing of keys).

Following the Three Laws of Robotics means designing language robots that embed sourcemarks in all the content that they generate. Here, we have presented a simple (and not yet completely successful) example of a sourcemark. We expect that any number of sourcemarks could be developed. An interesting overall property is that even if we do not have knowledge of the presence of a sourcemark, carrying out some simple statistics could reveal the difference between marked and unmarked language content. This signal would reflect a “suspected” language robot, and trigger deeper investigation. As further sourcemarks are developed, desirable properties of marks going beyond the Three Laws of Robot Language can be innovated.

Reflections on Discrimination by Data-based Systems

2019-11-11T11:08:00.000+01:00

A student wrote to me to ask me to interview me about discrimination in text mining and classification systems. He is working on his bachelor thesis, and plans to concentrate on gender discrimination. I wrote him back with an informal entry into the topic, and posted it here, since it may be of more general interest.

Dear Student,

Discrimination in IR, classification, or text mining systems is caused by the mismatch between what is assumed to be represented by data and what is helpful, healthy and fair for people and society.

Why do we have this mismatch and why is it so hard to fix?

Data is never a perfect snapshot of a person or a person's life. There is no single "correct" interpretation inherent in data. Worse, data creates its own reality. Let's break it down.

Data keeps us stuck in the past. Data-based systems make the assumption that predictions made for use in the future, can be meaningfully based on what has happened in the past. With physical science, we don't mind being stuck in the past. A ballistic trajectory or a chemical reaction can indeed be predicted by historical data. With data science, when we build systems based on data collected from people, shaking off the past is a problem. Past discrimination perpetuates itself, since it gets built into predictions for the future. Skew in how datapoints are collected also gets built into predictions. Those predictions in turn get encoded into the data and the cycle continues.

In short, the expression "it's not rocket science" takes on a whole new interpretation. Data science really is not rocket science, and we should stop expecting it to resemble physical science in its predictive power.

Inequity is exacerbated by information echo chambers. In information environments, we have what is known as rich gets richer effects, i.e., videos with many views gain more views. It means that small initial tendencies are reinforced. Again, the data creates its own reality. There is a difference between data collected in online environments and data collected via a formal poll.

Other important issues:

"Proxy" discrimination: for example, when families move they tend to follow the employment opportunities of the father and not the mother. The trend can be related to the father often earning more because he tends to be just a bit older (more work experience) and also tends to have spent less time on pregnancy and kid care. This means that the mother's CV will be full of non-progressive job changes (i.e., gaps or changes that didn't represent career advancement), and gets down ranked by a job candidate ranking function. The job ranking function generalizes across the board over non-progressive CVs, and does not differentiate between the reasons that the person was not getting promoted. In this case, this non-progressiveness is a proxy for gender, and down-ranking candidates with non-progressive CVs leads to reinforcing gender inequity. Proxy discrimination means that it is not possible to address discrimination by looking at explicit information; implicit information also matters.

Binary gender: When you design a database (or database schema) you need to declare the variable type in advance, and you also want to make database interoperable with other databases. Gender is represented as a binary variable. The notion that gender is binary gets propagated through systems regardless of the ways that people actually map well to two gender classes. I notice a tendency among researchers to assume that gender is some how a super-important variable contributing to their predictions just because it seems easy to collect and encode. We give importance to the data we have, and forget about other, perhaps more relevant data, that are not in our database.

Everyone's impacted: We tend to focus on women when we talk about gender inequity. This is because of the examples of gender inequity that threaten life and limb tend to involve women, such as gender gaps in medical research. Clearly action needs to be taken. However, it is important to remember that everyone is impacted by gender inequity. When a lopsided team designs a product, we should not be surprised when the product itself is also lopsided. As men get more involved in caretaking roles in society, they struggle against pressure to become "Supermom", i.e., fulfill all the stereotypical male roles, and at the same time excel at the female roles. We should be careful while we are fixing one problem, not to fully ignore, or even create, another.

I have put a copy of the book Weapons of Math Destruction in my mailbox for you. You might have read it already, but if not, it is essential reading for your thesis.

From the recommender system community in which I work, check out:

Michael D. Ekstrand, Mucun Tian, Mohammed R. Imran Kazi, Hoda Mehrpouyan, and Daniel Kluver. 2018. Exploring author gender in book rating and recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys '18). ACM, New York, NY, USA, 242-250.

and also our own recent work, that has made be question the importance of gender for recommendation.

Christopher Strucks, Manel Slokom, and Martha Larson, BlurM(or)e: Revisiting Gender Obfuscation in the User-Item Matrix. In Proceedings of the Workshop on Recommendation in Multistakeholder Environments (RMSE) Workshop at RecSys 2019.
http://ceur-ws.org/Vol-2440/short2.pdf

Hope that these comments help with your thesis.

Best regards,
Martha

P. S. As I was about to hit the send button Sarah T. Roberts posted a thread on Twitter. I suggest that you read that, too.
https://twitter.com/ubiquity75/status/1193596692752297984

The unescapable (im)perfection of data

2019-11-10T21:59:00.000+01:00

In data science, we often work with data collected from people. In the field of recommender system research, this data consist of ratings, likes, clicks, transactions and potentially all sorts of other quantities that we can measure: dwell time on a webpage, or how long someone watches a video. Sometimes we get so caught up in creating our systems, that we forget the underlying truth:

Data is unescapably imperfect.

Let's start to unpack this with a simple example. Think about a step counter. It's tempting to argue that this data is perfect. The step counter counts steps and that seems quite straightforward. However, if you try to use this information to draw conclusions, you run into problems: How accurate is the device? Do the steps reflect a systematic failure to exercise, or did the person just forget to wear the device? Were they just feeling a little bit sick? Are all steps the same? What if the person was walking uphill? Why was the person wearing the step counter? How were they reacting to wearing it? Did they do more steps because they were wearing the counter? How were they reacting to the goal for which the data was to be used? Did they decide to artificially increase the step count (by paying someone else to do steps for them)?

In this simple example, we already see the gaps, and we see the circle: collecting data influences data collection. The collection of data actually creates patterns that would not be there if the data were not being collected. In short, we need more information to interpret the data, and ultimately the data folds back upon itself to create patterns with no basis in reality. It is important to understand that this is not some exotic rare state of data safely ignored in day-to-day practice (like the fourth state of water). Let me continue until you are convinced that you cannot escape the imperfection of data.

Imagine that you have worked very hard and have contolled the gaps in your data, and done everything to prevent feedback loops. You use this new-and-improved data to create a data-based system, and this system makes marvelous predictions. But here's the problem: the minute that people start acting on those predictions the original data becomes out of date. Your original data is no longer consistent with a world in which your data-based system also exists. You are stuck with a sort of Heisenberg's Uncertainty Principle: either you get a short stretch of data that is not useful because it's not enough to be statistically representative of reality, or a longer stretch of data, which is not useful because it encodes the impact of the fact that you are collecting data, and making predictions on the basis of what you have collected.

So basically, data eats its own tail like the Ouroboros (image above). It becomes itself. As science fictiony as that might sound, this issue has practical implications that researchers and developers deal with (or ignore) constantly. For example, in the area of recommender system research in which I am active, we constantly need to deal with the fact that people are interacting with items on a platform, but the items are being presented to them by a recommender system. There is no reality not influenced by the system.

The other way to see it, is that data is unescapably perfect. Whatever the gaps, whatever the nature of the feedback loops, data faithfully captures them. But if we take this perspective, we no longer have any way to relate data to an underlying reality. Perfection without a point.

And so we are left with unescapable.

Pixel Privacy: Protecting multimedia from large-scale automatic inference

2018-04-14T21:13:00.004+02:00

This post introduces the Pixel Privacy project, and provides related links. This week's Facebook congressional hearings have made us more aware how easily our data can be illicitly acquired and used in ways beyond our control or our knowledge. The discussions around Facebook have been focused on textual and behavior information. However, if we think forward, we should realize that now is the time to also start worrying about the information contained in images and videos. The Pixel Privacy project aims to stay ahead of the curve by highlighting the issues and possible solutions that will make multimedia safer online, before a multimedia privacy issues start to arise.

Pixel Privacy project is motivated by the fact that today's computer vision algorithms have super-human ability to "see" the contents of images and videos using large-scale pixel processing techniques. Many of us our aware that our smartphones are able to organize the images that we take by subject material. However, what most of us do not realize is that the same algorithms can infer sensitive information from our images and videos (such as location) that we ourselves do not see or do not notice. Even more concerning that automatic inference of sensitive information, is large-scale inference. Large scale processing of images and video could make it possible to identify users in particular victim categories (cf. cybercasing [1]).

The aim of the Pixel Privacy project is to jump-start research into technology that alerts users to the information that they might be sharing unwittingly. Such technology would also put tools in the hands of users to modify photos in a way that protects them without ruining them. A unique aspect of Pixel Privacy is that it aims to make privacy natural and even fun for users (building on work in [2]).

The Pixel Privacy project started with a 2 minute video:

The video was accompanied by a 2 page proposal. In the next round, I gave a 30 second pitch followed by rapid fire QA. The result was winning one of the 2017 NWO TTW Open Mind Awards (Dutch).

Related links:

The project was written up as "Change Perspective" feature on the website of Radboud University, my home institution: Big multimedia data: Balancing detection with protection (unfortunately, the article was deleted after a year or so).
The project also has been written up by Bard van de Weijer for Volkskrant in a piece with the title "Digital Privacy needs to become second nature". (In Dutch: "Digitale privacy moet onze tweede natuur worden")

References:

[1] Gerald Friedland and Robin Sommer. 2010. Cybercasing the Joint: On the Privacy Implications of Geo-tagging. In Proceedings of the 5th USENIX Conference on Hot Topics in Security (HotSec’10). 1–8.

[2] Jaeyoung Choi, Martha Larson, Xinchao Li, Kevin Li, Gerald Friedland, and Alan Hanjalic. 2017. The Geo-Privacy Bonus of Popular Photo Enhancements. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval (ICMR '17). ACM, New York, NY, USA, 84-92.

[3] Ádám Erdélyi, Thomas Winkler and Bernhard Rinner. 2013. Serious Fun: Cartooning for Privacy Protection, In Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop, Barcelona, Spain, October 18-19, 2013.

2018: The year we embrace the information check habit

2018-01-01T23:56:00.000+01:00

The new year dawns in the Netherlands. The breakfast conversation was about the Newscheckers site in Leiden and about the ongoing "News or Nonsense" exhibition at the Netherlands Institute for Sound and Vision.

Signs are pointing to 2018 being the year that we embrace the information check habit: without thinking about it do a double check of the trustworthiness of the factuality and the framing of any piece of information that we consume in our daily lives. If the information will influence us, if we will act upon it, we will finally have learned to automatically stop, look, and listen: the same sort of skills that we internalized when we learned to cross the street as youngsters.

For me, 2018 is the year that I make peace with how costly that information quality is. On factuality: I spend hours reviewing papers and checking sources. On framing: I devote a lot of time to looking for resources in which key concepts and processes are explained in ways that my students would easily understand them. And too often I am prevented from working on factuality and framing by worrying about the consequences of missing something or making the wrong choices.

It is costly in terms of time and effort just to choose words. I need words to convey to the students in my information science course that the world is dependent on their skills and their professional standards: anyone whose work involves responsibility for communication must devote time and effort to information quality and must take constant care to inform, rather than manipulate.

What is the name for our era? I don't say "post-truth". A era can call itself "post-truth", but that's asking us to accept that it is fundamentally different than whatever came before---the "pre-post-truth" era. The moment we stop to reflect on how the evidence proves that we have shifted from truth to post-truth, we are engaging in truth seeking. Post-truth goes poof.

I don't say "fake news" era. I grew up with the National Enquirer readily available at the supermarket check out counter, with its bright and interesting pictures of UFOs and celebrity divorces. That content wasn't there to contribute to building my mental model of reality, any more than Pacman. "Fake news" has always been there.

My search for the right words continues. I am using the book Weaponized Lies by Daniel Levitin for the first time this year in order to teach critical thinking skills. Levitin uses words like "counterknowledge" and "misinformation". These are important terms, but they imply the existence of a intelligent adversary intentionally misleading us. It is important to defend against these forces. However, the idea that the problem is people putting effort into "weaponization" overlooks the less dramatic, and less easily identify problem, of reasoning from shaky, half remembered information sources or using flawed logic to build arguments.

Now at the end of the first day of 2018, I am staring at Weaponized Lies next to my keyboard, wishing there were shortcuts---that I didn't have to start from the bottom finding the words to talk about the importance of information quality, even before I start talking about information quality itself, and researching how to build safer more equitable information environments.

There are no shortcuts. The only thing that we can hope for is that we can routinize information check. Make it a habit.

I even stopped for a moment to dream about a rising demand for information quality creating new jobs. We need professionals who are able to help us monitor information without sliding into suppressing free speech and imposing censorship. This is the direction in which our knowledge society should grow.

I thought I remembered reading an article online that discussed 2018 as the "Information Year". Now, for the life of me, I cannot find it. It takes so long to track and keep track of sources. My first step in making peace with the cost of information quality: I end this blog post by admitting I have no proof for my thesis that 2018 is the year we embrace the information check habit. The title is instead an expression of hope that we can move in that direction.

Multimedia Meets Machine (Learning): Understanding images vs. Image Understanding

2017-05-24T00:00:00.000+02:00

Today, I gave a talk at Radboud University's Good AIfternoon symposium, for Artificial Intelligence students. I covered several papers that I have written with different subsets of my collaborators [1,2, 3]. The goal was to show students the difference in the way humans understand images, and in the type of understanding the can be achieved by computers applying visual content analysis, particularly concept detection.

Human Understanding of Images
Consider the images below from [1]. The concept detection paradigm claims success if a computer algorithm can identify these images as depicting a woman wearing a turquoise blue sundress with water in the background. For bonus points, in one image the woman is wearing sunglasses.

A person looking at these images would not say that such concept-based description of the images is wrong. In fact, if a person is presented with these pictures out of context, and asked what they depict, "A woman wearing a blue sundress at the beach" would be an unsurprising response.

However, this response falls short of really characterizing the photos from the perspective of a human viewer. This shortcoming becomes clear by considering contexts of use. For example, if we needed to chose one of the two as a photo for selling a turquoise blue dress in a web shop, the right hand photo is clearly the photo we want. The left-hand photo is clearly unsuited for the job. Concept-based descriptions of these images fail to fully capture user perspectives on images. Upon reflection, a person looking at these images would conclude that the concept-based description is not wrong per se, but that it seriously misses the point of the image.

A often-heard argument is that you need to start somewhere and that concept-based description is a good place to start. However, we need to keep in mind that this starting point represents a build-in limitation on the ability of systems that use automatic image understanding (such as image retrieval systems) to serve users.

Think of it this way. Indexing images with a preset set of concepts is a bit like those parking garages that paint each floor a different color. If you remember the color, that color is effective at allowing you to find your car. However, the relationship of the color and your car is one of convenience. The parking-garage-floor color is an essential property of your car when you are looking for it in the garage, but outside of the garage, you wouldn't consider it an important property of your car at all.

In short, automatic image understanding underestimates the uniqueness of these images, although this uniqueness is of the essence for a human viewer.

Machine Image Understanding

Consider the images below from [4]. A human viewer would see these as two different images.

If the geo-location of the right-hand image is known, geo-location estimation algorithms [3] can correctly predict the geo-location of the left-hand image. In this case, a machine learning algorithms "understands" something about an image that is not particularly evident to a casual human viewer. Humans are largely unaware that the geo-location of their images is "obvious" to a computer algorithm that has accessed to other images known to have been taken at the same place.

In short, human understanding of images overestimates the uniqueness of these images, and visual content analysis algorithms understand more than people realize that they do.

Moving forward

Given the current state of the art in visual content analysis, "Multimedia Meets Machine" is perhaps a bit out dated, and we should be thinking in terms of titles like, "Multimedia Has Already Met Machine".

The key question moving forward is whether machine understanding of images supports the people who take and use those images, or if it is providing a little convenience, at the larger cost of personal privacy.

[1] Michael Riegler, Martha Larson, Mathias Lux, and Christoph Kofler. 2014. How 'How' Reflects What's What: Content-based Exploitation of How Users Frame Social Images. In Proceedings of the 22nd ACM international conference on Multimedia (MM '14).

[2] Martha Larson, Christoph Kofler, and Alan Hanjalic. 2011. Reading between the tags to predict real-world size-class for visually depicted objects in images. In Proceedings of the 19th ACM international conference on Multimedia (MM '11).

[3] Xinchao Li, Alan Hanjalic, Martha Larson. Geo-distinctive Visual Element Matching for Location Estimation of Images, Under review. http://arxiv.org/pdf/1601.07884v1.pdf

[4] Jaeyoung Choi, Claudia Hauff, Olivier Van Laere and Bart Thomee. 2015. The Placing Task at MediaEval 2015. In Working Notes Proceedings of the MediaEval 2015 Workshop.

http://ceur-ws.org/Vol-1436/Paper6.pdf

March for Science: Einsteins at the Lake

2017-04-22T00:00:00.000+02:00

May break at Radboud University (which happens to fall in April this year) sees me arriving in the US, just in time to participate in the March for Science—Milwaukee, on the shores of Lake Michigan. The weather was gorgeous and the march route was beautiful, taking me past sites familiar from school field trips of my childhood. This blogpost contains photos and some reflections on what the march means.

Why march for science?

Marching restores the natural balance between listening and reading (I'm at overdose levels these days.) and expressing oneself. The thought expressed is not complicate: it is simply a statement of support for evidence-based policy making. The act of marching also serves to preserve our culture of freedom of expression, of open and informed criticism, and of citizens demanding that their values and interests be represented by their government.

In Dutch, a scientist is a "Wetenschapper", literally, a "Creator of Knowledge". Marching is a concrete and publicly visible sign of the importance of the knowledge created by the scientific method. This knowledge is the bedrock of our well-being as a society. Think: energy, food, health, housing, sanitation, security, transport, and the technology underlying today's digital information creation and exchange. The knowledge that we create by the scientific method is knowledge that we cannot live without.

Restoration is sorely needed in a world delivering a constant information deluge. There's news, but that news includes includes news about news. It is important to keep up, to read, track developments, form a position, and, on the basis of this position, vote. However, without working actively to keep the balance, too much reading becomes bookkeeping of who is on which side, and tallying points, wins or losses, for both.

Relief comes from falling back on common ground, seeking out the non-partisan issues, and focusing on these. We are mechanics, potters, brewers, nurses, birdwatchers, cooks. We drive cars, fly in airplanes, surf the Web, do our laundry, and, upon occasion, fool around with the physics and chemistry around us, e.g., by putting Mentos in Coke. These daily activities all represent science in action.

True to our Wisconsin roots, more than one person at the March for Science carried the sign, "No science, no beer". I thought about the Student's t-test: it might surprise you that beer is actually not that far away from much more science that you might expect.

The common ground is surprisingly sturdy. People, all of us, are constantly applying evidence-based approaches. We don't heat up tomato soup by putting a tin can directly in the microwave, we don't put airtight lids on our fishbowls, we water our plants and maybe even give them plant food, and we try to eat healthily ourselves.

Seen from this perspective of common ground, which we understand to be common sense, we are not experiencing a crisis of denial. Rather, it is perhaps a crisis of connection: putting what we collectively know into action for the benefit of us all. On Monday, 21 August, all of North America will have a special opportunity to watch an eclipse of the sun. No one expects it not to unroll exactly as NASA has announced. Surely, this certainty is something that can be productively built upon.

Relief comes from also falling back on shared values. One that is deeply ingrained in me from my Wisconsin youth is avoidance of waste. Waste of human life is at the top of that list of waste we must seek to avoid. I have taught myself to read Nicholas Kristof's columns on women's health without falling into despair. His latest is on the impact of the funding cuts of the current Republican Administration to women's health programs internationally. I have not seen what Kristof has seen in his travels, but I have seen enough beyond the borders of the US to realize that these cuts translate directly into suffering and death. The science to save lives is there. We are an affluent society: our pride should be that we devote resources to doing just that.

Avoidable waste is also to be observed closer to home. There is broad consensus on the importance of the Great Lakes Restoration Initiative, as discussed by the Chicago Tribune. The Great Lakes Restoration Initiative has the purpose of protecting and restoring the Great Lakes, which face threat from pollution and invasive species. These lakes contain 21% of the fresh water on the surface of the earth, measured by volume. Growing up, I wished they were not quite so deep, since it was cold as cold could be trying to swim in them. Today, the presence of that incomprehensibly large mass of water still remains with me. I feel it in the way that my stomach drops to read about planned funding cuts to an essential program preserving it. Many, many people across party lines have had a similar visceral reaction.

Who does the march's message reach?

If the march is about expressing a message, who receives that message? One goal is that it is received by policy makers: the sheer bio-mass of science-minded citizens on the street is a flashing red light signaling that the course needs to be corrected. More tangibly for me, the march is about reaching young people: people in school who are on the point of deciding for an education in STEM and for a career in science.

At the March for Science, I was enchanted by the many mini-Einsteins. My presence there is a signal to them: "You are clear sighted in your understanding, dear mini-Einsteins. You are right in your resolve. Stay steadfast in your studies and stay true to your vision. There are three thousand of us who turned out here today to show you that you are not alone."

Shared-tasks for multimedia research: Bans, benchmarks, and being effective in 2017

2017-02-26T17:14:00.001+01:00

Last week, I officially resigned from contributing as an organizer to TRECVid Video Retrieval Evaluation, which is sponsored by NIST, a US government agency in Gaithersburg, Maryland. In 2016, I was part of the Video Hyperlinking task, and contributed by defining the year's relevance criteria, creating queries, and helping to design the crowdsourcing-based assessment. It has been a very difficult decision, so I would like to record here in this blogpost why I have made it.

Ultimately, we make such decisions ourselves, and everyone navigates these difficult processes alone. However, it takes a lot of time and energy to search for the relevant information, and to weigh the considerations. For this reason, I think that for some it may be helpful to know more details about my own process.

Benchmarking Multimedia Technologies
Since 2008, I have been involved in benchmarking new multimedia technology. Benchmarking is the process of systematically comparing technologies by assessing their performance on standardized tasks. The process makes it possible to quantify the degree to which one algorithm outperforms another. Quantification is necessary in order to understand if a new algorithm has succeeded in improving over the state of the art, defined by the performance of existing algorithms.

The strength of benchmarking lies in the degree to which a benchmark succeeds in achieving open participation. If a new algorithm is compared to some, but not all, existing algorithms, the results of the benchmark reflect less clearly a true improvement over the state of the art.

My emphasis in benchmarking is on tasks that focus on the human and social aspects of multimedia access and retrieval. In other words, I am interested in people producing and consuming video, images, and audio content in their daily lives, and how technology can create algorithms to give them back usefulness and value from these activities. It is difficult to pack these aspects into quantitative metrics, so I am also committed to research that develops new evaluation methodologies and new metrics, as well.

Due to this emphasis, it is not surprising that most of my contribution has been channeled through the MediaEval Benchmark for Multimedia Evaluation. (I coordinate the MediaEval "umbrella", which synchronizes the otherwise autonomous tasks.) However, the strength of the benchmarking paradigm is weakened if a single benchmark, with a limited spectrum of topics, becoming all-dominant. Instead, we need to act to prevent a single effort from "taking over the market". We need to work towards ensuring that a broad range of different types of problems are investigated by the research community. Fostering breadth means offering not only multiple tasks, but multiple benchmarks. This year, I am again involved in MediaEval, but also, as last year, in contributing to the organization of the NewsREEL task at CLEF (where my role is to contribute to design, documentation, and reporting).

Open Participation in Benchmarks
Both MediaEval and CLEF are open participation benchmarks in three aspects:

First, anyone can propose a task (there is an open call for tasks). CLEF chooses its tasks by multi-institutional committee, cf. 2017 CLEF Call for Task Proposals. MediaEval also chooses its task by multi-institutional committee. However, the committee checks only for viability. The ultimate choice lies in the hands of all community members, including organizers and participants, cf. MediaEval 2017 Call for Task Proposals. The goal of an open call for tasks is to promote innovation---constantly evolving tasks prevent the community from "locking in" on certain topics, and becoming satisfied with incremental progress.
Second, anyone can sign up to participate. Participants submit working notes papers, which go through a review process (emphasizing completeness, clarity, and technical soundness). MediaEval and CLEF both publish open access working notes proceedings.
Third, for both MediaEval and CLEF, workshop registration is open to anyone, and requires only the payment of a fee to cover costs. For MediaEval, the fee covers the costs of the workshop, and also of hosting the website and organizer teleconferences. People/organizations contribute time to cover the rest of workshop organization.

Like MediaEval and CLEF, TRECVid also pursues the mission of offering an open research venue. Historically, both TRECVid and CLEF grew from TREC (also, of course, organized by NIST) so the commitment to the common cause is unsurprising in this sense. However, TRECVid does not offer open participation in all three of the above aspects. Specifically, there is no publicly circulated call for task proposals, and the workshop is closed. (The stated policy is that the workshop is only open to task participants, and "to selected government personnel from sponsoring agencies and data donors", cf. TRECVid 2017 Call for Participation) Technically, TRECVid is not able to welcome all participants. The US does not maintain diplomatic relationships with Iran. US Government employees cannot answer email from Iran. It is important to understand that this is a historical challenge, and is not new with the current US Republican Administration.

Defining Priorities and Making Decisions
Considerations related to open participation made me hesitant to get deeply involved in TRECVid. However, over the years, I have been very open for exchange. TRECVid originally reached out to me to give an invited talk back in 2009, when MediaEval was still VideoCLEF. (There are some musings on my blog from that trip.) The idea was to learn from each other. We hope this year to reciprocate with a TRECVid speaker at CLEF/MediaEval.

In 2016, I contributed to the Video Hyperlinking organization, since the move of Video Hyperlinking from MediaEval to TRECVid represented a spread of the emphasis on the human aspects of multimedia retrieval, and it was important to me to support that explicitly.

All and all, it has taken a lot of time to decide where to invest my resources in 2017 in order to most effectively support multimedia benchmarking efforts that provide venues that are open and therefore effective as benchmarks.

With the new Republican Administration in the US, two considerations grew to dominate my decision making process. The first is how to contribute to the movement whose goal is to demonstrate the relevance and importance of science to the public and to policy makers https://www.forceforscience.org TRECVid, by virtue of being a benchmark, is certainly on the forefront of this movement (just by doing the same thing it has done for years). We need to support our US-based colleagues in the efforts to be a force for science, and hope that they support us as well, if we land in a similar situation.

The second is how to react to the travel ban, which would prevent scientists of certain countries from entering the US. The first-order effects of the travel have been constrained by court rulings. However, the future plans of the administration are uncertain, and there is a range of second-order effects that a court cannot un-do, e.g., people self-selecting out of participation since they are worried about their visa's being held up by additional processing steps (and granted, for example, only after the workshop has occurred). These secondary effects effectively prevent people from attending a US-based event even though technically they may be able to get a visa.

We are not alone in our thinking, but we are guided by a large number of organizations who have issued a public statement on the importance of openness for science (Statement of the International Council for Science, Statement of American Association for the Advancement of Science) including professional organizations that we belong to (Statement of the ACM, Statement of ACM SIGMM, Statement of IEEE) and European universities (Statement of the European University Association, Combined statement of all the universities in the Netherlands, Statement of Radboud University).

There is much power in making an open statement of values---more than one might think. However, we should avoid assuming that statements are enough and that the situation will go back to where it was before the current Republican Administration. In other words, the days are gone in which we had to dedicate relatively less time in protecting and upholding the values of openness in science. Instead, we need to think explicitly about where our effort can be best dedicated in 2017.

TREC/TRECVid celebrated their 25th anniversary in 2016. The event has been a constant through many changes of US administration, and it is heartening that the 2017 event will look, from the inside at least, with all probability pretty much like all other events over the past 25 years.

However, 2017 is the first year where people will be in the streets, in the US and around the world, marching for science: https://www.marchforscience.com. The large-scale sense of urgency tells us that 2017 is not just business as usual. For this reason, it is important in 2017 to reexamine the idea that the US should be such a strong attractor within the map of scientific research in the world.

On top of the merit and can-do attitude that attracts people from around the world to US institutions, we as scientists (because we study systems and networks) know that another force is at play. Specifically, we know that US institutions enjoy preferential attachment, meaning that past success is a determiner of future success. This effect translates into the reality that new or small events (e.g., research topics or benchmarking workshops) need a lot of extra time and attention to establish or maintain themselves in the field. 2017 is the year that we need to think carefully about to which extent we want to contribute this non-linear feedback loop that strengthens the pull towards US-based events, and to which extent we want to build counterweights.

I consciously use the word "counterweights" since I am referring to a balancing act. We stand in complete solidarity with our US-based colleagues. Providing counterweights in no way detracts from that fact. For multimedia research, counterweights include region-based initiatives, and benchmarks that allow anyone to propose a task. A network of diverse benchmarks makes benchmarking as a whole stronger, and makes us internationally more robust,

My personal decision is that time spent promoting and preserving diversity is, in 2017, a more effective way to achieve the larger goals of benchmarking, than time spent reinforcing the connection between benchmarking and Gaithersburg, Maryland. I was born in Maryland, outside of DC, but Maryland is not where I am needed now. TRECVid will be fine without extra help from Europe, but what can (and does) suffer is the availability to the research community of non-US-based benchmarks.

Recommendations to TRECVid
The intention is for my resignation to be a positive decision for and not a negative decision against. Reasoning that my reflections on the topics are probably helpful to NIST, I distilled my thinking into a set of three recommendations. Interestingly, these recommendations are relatively independent of the situation in the US caused by the current Republican Administration:

First, TRECVid is an open research venue. I recommend stating this explicitly on the website. An example is the ACM Open Participation statement.
Second, TRECVid is supported by NIST. I recommended a clearer statement of the source and the distribution of the funding on the website. People familiar with the benchmark know that NIST is the powerhouse behind its success, but it is not clear to newcomers. Critically, currently, the cases in which defense funding supports TRECVid are not clear. This is important to people who personally, or whose institutions, have a commitment to pursue research for civil purposes only. For example, many German institutions have a Zivilklausel by which they commit themselves to pursuing exclusively research for civilian purposes. Even if participation is nominally open, unclarity on defense funding can scare people away, and the benchmark is effectively not as open as it would otherwise aspire to be. (For completeness: at least one colleague assumed I received NIST funding for my work on Video Hyperlinking. I did not. The unclarity in the funding causes confusion.)
Third, attention should be devoted to the archival status of the proceedings. As a good next step, they should be indexed by mainstream search engines. Moving forward, attention should be paid to maintaining a historical record of TRECVid should at some point in the future NIST not be able to continue to support open participation/open access in the way it does now.

If you have read all the way to the end of this blog post, let me finish by thanking you: both for your dedication to open participation in scientific research, which is so essential to benchmarking, but also for taking the time to read about my personal struggle. It has been a long path.

Don't miss the March for Science on 22 April. Inspire and be inspired.

Amsterdam: https://marchforscience.nl

Or find another march around the world here: https://www.marchforscience.com/satellite-marches

Bytes and pixels meet the challenges of human media interpretation

2017-02-08T21:17:00.001+01:00

Back in June, I gave a talk at the Communication Science Department here at Radboud University Nijmegen. Today, I presented a version of that talk to my colleagues in the Language and Speech Technology Research Meeting. The abstract is below together with the slides, which are on SlideShare. During the discussion it became clear that many problems in natural language processing and information retrieval face the issue of human interpretations. It is important to find ways to move forward, although it may not be possible to pack our challenges into neat classification or ranking problems with a single set of consensus ground truth labels. A way forward, is to look to other disciplines for theory of how people understand and use media, and let these inform what we design our systems to do and the ways that we measure success.

Within computer science, "Multimedia" is a field of research that investigates how computers can support people in communication, information finding, and knowledge/opinion building. Multimedia content is defined broadly. It includes not only video, but also images accompanied by text and other information (for example, a geo-location). It can be professionally produced, or generated by users for online sharing. Computer scientists historically have a “love-hate” relationship with multimedia. They “love” it because of the richness of the data sources and the wealth of available data, which leads to interesting problems to tackle with machine learning. They “hate” it because multimedia is a diffuse and moving target: the interpretation of multimedia differs from person to person, and changes over time in the course of its use as a communication medium. This talk gives a view onto ongoing research in the area of multimedia information retrieval algorithms, which help people find multimedia. We look at a series of topics that reveal how pattern recognition, text processing, and crowdsourcing tools are used in multimedia research, and discuss both their limitations and their potential.

Multimedia Information Retrieval: Bytes and pixels meet the challenges of human media interpretation from maranlar

Women's March on Washington: A shout heard around the world

2017-01-29T06:32:00.001+01:00

In my previous post, I wrote up some observations on the Women's March on Washington (WMW) and how the technology that allows us to produce and share multimedia adds dimensions to what what we actually do when marching and by marching. I stated that what remains with me most clearly from a week later is people's voices: people speaking and listening to each other in ways they hadn't before.

In that post, I looked at people's voices from the point of view of the information that they convey. However, because of my interest in speech and speech technology, I also see people's voices purely and simply as an audio signal. This post contains some observations following on from the fact that each person's voice is actually a sound wave.

Newly arrived at the WMW, we stood in the midst of a sea of people, wondering if we were actually going to see the stage. As we oriented ourselves, an enormous sound moved towards us over the crowd. It started from a way off, and moved closer and closer like a wave from far out in the ocean. It was an unfamiliar sound.

As the wave passed over us, it became clear that shouting was creating the wave. When it reached us, we also shouted, and it moved on.

My backstory: My main reason for marching at WMW was for health care, or at least that's where I started. I found myself quickly leaning towards the position of the person holding the sign saying, "Too many issues to fit on one sign". I don't join mono-gender initiatives: gender issues affect everyone and we only get to equality if we get to equality together. We must move on equal footing towards equal footing. The WMW was not mono-gender. About 10-20% of the crowd were men, but my estimate might be wrong, since putting the people around me into gender categories was one of the last things on my mind that day.

After the shout wave had passed, it struck me: What I have just experienced was an acoustic event that has never before occurred on the surface of the planet. The shout wave was the sound of a woman-dominated group of hundreds of thousands of voices acting in coordination. No wonder that it had struck me as an unfamiliar sound.

I can think of times in the history of the world in which a group of men would have created a shout wave, or even a 50/50 gender group. I can imagine what that would sound like, or perhaps I have even heard it before. However, this woman-dominated sound was fully new. We have collaboratively invented, as a species, the ability to generate a never-before-heard acoustic event. There was not one wave that day, but many.

Later, I saw this acoustic event that I call a "women's shout wave" referred to in the newspaper as a "rolling roar".

If you know something about sound, you know that it has a physical reality. Sound is a mechanical wave caused by compression of the air: it can knock you off of your feet. The size of those waves is directly and necessarily related to the size of what initially starts pushing the air to create the waves. The smaller the physical source, the faster the waves and the higher the sound. Our voices are created by vibrations in our larynx (voice box). On average, women have a smaller voice box than men. A woman-dominated crowd will produce a higher pitched sound than a gender-balanced crowd or a man-dominated crowd. Throw in a few kids' voices (also small voice boxes) as with the WMW, and the resulting rolling roar is a powerful, yet sparkling, acoustic event that deserves to be compared to the sound of a band of angels, however you might choose to imagine that.

In moments of philosophy, we often discuss the question of a tree falling in a forest: if no one hears it then did it really make a sound? The wave of women's voices at the WMW produced an acoustic event of a fully different nature. If you think about it, you cannot even ask this question about the women's shout wave. It is produced if and only if a crowd dominated by women comes together in one place and acts in coordination: The existence of this sound and the fact that it is also heard are one and the same.

The wave of women's voices at the WMW truly produced a shout heard around the world. Reflecting a bit you realize that this new sound is not the only new signal that was produced at the march: shout waves happened at the march, but were not the essence of the march. The essence was a new social signal. What this signal is will reveal itself in what we do next: it may not be a mechanical wave, but I have no doubt in its power to move things in our social/political/physical world.

The discovery of the women's shout wave may or may not excite you as much as it excites me. But you don't have to share my enthusiasm of having participated in the acoustic history of humanity in order to agree: as long as we are able to come together and make this coordinated sound, we are headed in the right direction as a species.

Tamika Mallory said at the WMW, "When you go home remember how you felt": https://youtu.be/dDc9Ochrifw?t=3h38m36s Remembering the unique sound of the rolling roar has the ability to place us back again in the moment, in recalling what we have heard, we can recall what we felt.

But the rolling roar also reminds us of something very important about the movement: the decision to create the women's shout wave is the decision of every individual in the crowd at the moment the wave breaks over her. Every single time when it is time for you to contribute to the wave, you must check your alignment with the overall goals, and then you must shout your lungs out. You have a responsibility as an organizer to target a straight, true line, but also as a participant, an "organizee" as it were. Each person individually is responsible to periodically check that we still are on track towards the right goals, and after every check to reengage with full force.

That take a lot of energy, and a lot of strength. In the moments, when I wonder where will that strength come from, I think of this sign: "Do not be afraid".

Women's March on Washington: Across time and dimensions in space

2017-01-28T12:00:00.000+01:00

It's been one week since the Women's March on Washington (WMW). I was curious to see what would still grip me most strongly now after one week. Filtering through so many topics, so many impressions, so much emotion, what remains with me most clearly is: people's voices.

People are speaking up, speaking out, and, for the first time that I can remember, speaking with unanimous confidence that speaking can and does indeed bring about change. The most obvious case is the power of every citizen to call their government representatives in order to communicate their opinion. But speaking also has the power to open people's minds to fundamentally different perspectives: speaking with each other can lead to the understanding that the world is a single, immense interconnected system. There is and can be no difference between standing up for ourselves and standing up for other people. This understanding builds confidence, lends authority to our individual voices, and allows us to comprehend our own potential at our cores: what we can make happen by all pulling in a common direction is more that we have previously dared to hope or imagine.

This blog post contains observations about the WMW seen through the lens of information, which is my field of research. Specifically, I focus on information as it is captured and shared in multimedia: audio, video, and images.

In Ch. 2 of their Multimedia Computing book, Friedland and Jain discuss communication-related inventions over the course of civilization. Their discussion makes the point is that the era we live in is an era of media recordings, digital media, and the Internet. These inventions provide us with three invariances, which offer historically unique opportunities for communication: invariance of time, invariance of space, and invariance of addressee.

"Invariance" in this case refers to the stability of communication—if a message is invariant, it is not subject to loss or decay. Specifically, these three invariances mean that a message communicated in the past is also available in the future, that a message presented at one place, can be also presented at another, and that a message communicated to one person can be communicated to anyone.

How do these invariances make marching in today's times different than ever before? When our grandparents' generation made signs and took to the street, their message would only reach the people who were present on that street and in that moment. Anything beyond that was dependent on newspaper and radio coverage.

In 2017, the WMW took place in full realization that the march was not limited to that moment, to that space, or to the specific people who were physically in Washington DC. Most of the intended addressees of the march were somewhere completely different. In the case of all three invariances, I observed behavior that reflected the consciousness of people of these information invariances and the need to use them.

Invariance of time: Signs as photo opportunities
Cameras, both conventional and on mobile phones, were everywhere at the WMW. I watched amazed as people took pictures of each holding their signs. Slowly, at the march, it dawned on me that my idea of not bringing a camera to a demonstration outdated. It became apparent to me that people made their signs with the intention of having other people take pictures of them. Some people who had particularly interesting or novel signs were standing at the side of the street. Apparently their purpose standing there was so that people would come up and chat and then pose with the sign for pictures.

A consciousness pervaded the march: a sign is not merely a physical object, but rather a message broadcasted outward without a predetermined limit. Your face on a picture next to a sign anchors it to you as a person in a way that comments on the Web fail to be anchored. The march came together around unity principles. Motivated by these principles, a "selfie" taken with a march sign becomes an "unselfie": an act of selflessness in support of rights and of people without the opportunity to stand up for their own rights.

Invariance of space: Everywhere at once
There was a feeling of connection at the WMW to people who would have liked to have been there, but who weren't able to make the trip to Washington DC. I saw at least one marcher who had listed the names of the people who supported her on her sign. While marching, I felt strongly connected to the people in our circle of family and friends who had stayed at home and watched the media coverage and/or attended other marches.

During the WMW, the mobile phone network clearly suffered overload. Through much of the duration of the march, it was not possible to access the Web via the mobile phone network, to call, or even to send or receive a text. For this reason, we realized that it was that the people who were not at the march were actually understanding what was going on more than the marchers could, although we were actually at the march. The feeling that being the the middle of the march was not necessarily the best way of getting an overall understanding of what was going on enhanced the impression that the march was not happening in a particular physical space, but in fact everywhere at once.

It was only later that evening that I came to fully understand that there had been marches of magnitude in so many places around the the US and the world. Today, it's a week later and I spent some time browsing march pictures on Flickr. March pictures are all simply march pictures, and whether they were taken at the WMW or at a sister march has since faded into the background.

Invariance of addressee: No one missed out on anything
My experience of the march was people, people, and more people. I know Washington DC well, but at times, I felt completely out of sight of anything familiar. During the time that the speakers were speaking we saw no stage: we just had faith that some where in the core of the masses the speakers were speaking on schedule. At one point, someone close to me in crowd said, "Madonna's here!" People seemed excited to hear that, but everyone had realized by then, that it was counterproductive to try to get to the stage.

There was a consciousness that no one was missing out on anything. Not being able to see any of the speakers was not a disappointment, we could all just shrug "Oh, well, I'll catch the speeches on YouTube later". (I spend some time doing that today.) Thinking about what it was like in the middle of the march: I've never experienced a moment, with such a clear sense of shared awareness that that moment would be lived and relived afterwards. For those with similar scifi habits, I'll say, it's the closest I've ever come to experience the feeling of travel across time and relative dimensions in space TARDIS.

Past, present, and future
The weeks leading down to the WMW were already filled with an appreciation of what past marchers had to teach us, and my thoughts frequently turned to the 1963 March on Washington for Jobs and Freedom. Awareness of the contributions of those in whose footsteps we follow is perhaps the most dramatic impact of information communicated across time (54 years ago), space (around the world), and audience (I was not even born then).

I never thought too much about 1963 as a child, but I also never thought too much about fire drills. When the alarm goes off, you calmly and peacefully leave the school. When things become unbearable, you calmly and peacefully go to march in Washington. These are the procedures and the practices that keep us—all of us—safe and keep our efforts to build a just society moving forward. Images, audio recordings, and videos hold the practice before our mind's eye: yes, it does happen, it has happened, and since it needs to happen, it will happen again.

Recommender system failure as a business model: Repellent ad? Pay for premium!

2017-01-26T23:30:00.000+01:00

While writing my last post, I spent a lot of time worrying about whether we really understand the forces at play with so much of our information world driven by business models based on clicks. My underlying assumption was that these forces all perpetuate the dependence of the production and flow of information in today's world on advertising. Today, I was reminded of the importance of thinking out of the box, and never assuming anything: there might be exceptions.

Here's what's happening: YouTube incentivized me to subscribe to YouTube Red by showing me an ad that raised the hair on the back of my neck, and then giving me a pop up window asking "Want to remove ads?" (screenshot above).

Specifically what happened: the pre-roll ad for my video was from Urban Carry Holsters:

G2 Overview from David Foster on Vimeo.

and while my video played I had Urban Carry Holsters videos suggested at the upper right hand of my page:

After watching this for a while in horrified fascination, YouTube opened a pop-up:

The "try it" button might as well have been labeled: "Get me out of here!"

Pretty brilliant, really. What I am assuming is happening (i.e., "may be could be happening") is that the recommender system algorithm is optimized to increase not only the number of ad clicks, but also the number of YouTube Red subscriptions.

Of course, I am a proponent of recommender systems that are not designed to fulfill a single target [1]. The target could be ill-designed, and the world is also just not that simple.

However, I am of two minds about what YouTube just did to me as a user. First, when we talk about gun violence in the US, we talk about deaths and causalities. The discussion of the psychological wear and tear is often in the shadows. If my heartbeat rises with an ad like this one, then I can't even imagine what parents must go through, who send their kids out the door in the morning to school with the constant worry of stray bullets and guns in irresponsible hands. Ads like these just contribute to the second-order harm that the fact that we have no real gun solution inflicts on society. YouTube's recommender should know enough about me to protect me from the psychological wear and tear (which results in wasted time).

Second, maybe YouTube should not be protecting me, but exposing me to more. (Yes, I am of two minds, and the second is completely opposite.) If recommender systems recommend advertisements that are personalized to be repellent for users, it could be a force that drives subscriptions at a large scale. If enough other people react like me, we will soon be on the road to being able to fund the production and distribution of information based on quality and trust, funded by subscriptions, rather than on clicks.

There is a chance that this ad is not a complete recommender system misfire. The Urban Carry Holster ad was not actually an utter mismatch for my tastes. They show that the holster was designed on the basis of a "user study", and I have certainly purchased a number of high quality real leather handbags in my day. It's the "detail" of putting the gun inside of it that freaked me out.

So maybe it is a recommender system failure, or maybe it is the most important thing that recommenders have done for our online information ecosystem in years. Whichever of these points of view ends of winning, it is something worthwhile thinking about.

My only concern is the manipulation aspect: in order not to destroy trust with YouTube, I would appreciate knowing that the ads are optimized to increase YouTube Red subscriptions, and I am indeed being nudged.

[1] A Said, D Tikk, K Stumpf, Y Shi, M Larson, P Cremonesi. 2012. Recommender Systems Evaluation: A 3D Benchmark. ACM RecSys 2012 Workshop on Recommendation Utility Evaluation: Beyond RMSE, Dublin, Ireland

Down the Rabbit Hole: Greetings from a state of extreme information overload

2017-01-08T15:08:00.001+01:00

This morning, I innocently checked the news. Then I disappeared down the rabbit hole. One click following another, driven by the idea that around the next bend I would arrive at some kind of a lasting understanding that would outlive today.

When I realized I was in full information reading free fall, I started writing this blog post, just to record what was happening.

To reconstruct the beginning of the experience, I asked myself what was the lead story on The Guardian when I opened it this morning.

Do I really remember what happened two hours ago? First, I thought no. Then, I remembered it was something about the shooting in Fort Lauderdale. But what? The shooter was unhappy in some way. Let me go back to check, but whoops in the meantime, there is a fully different lead story...I can't go back to where I was...maybe Fort Lauderdale was not so important after all.

Actually, no don't want to be reading about Fort Lauderdale. I land in Florida airports rather frequently and I don't need to be creating anxiety. Shouldn't be reading that one.

Spent some time trying to get back to see the same "first page" that I saw this morning...clock ticking. It appeared not to be possible.

What kind of insight will I arrive at that will outlive today?

This morning became this afternoon as I dove into certain column with the headline: "Moral panic over fake news hides the real enemy – the digital giants"

https://www.theguardian.com/commentisfree/2017/jan/08/blaming-fake-news-not-the-answer-democracy-crisis

Hmmm. What exactly is "moral panic"?

I read this:

https://www.psychologytoday.com/blog/wicked-deeds/201507/moral-panic-who-benefits-public-fear

Interesting. We learn:

"Moral panic has been defined as a situation in which public fears and state interventions greatly exceed the objective threat posed to society by a particular individual or group who is/are claimed to be responsible for creating the threat in the first place."

However, that gets us nowhere on what "Moral panic over fake news hides the real enemy – the digital giants" is going to actually tell us. If there is a fear, it is related to the fact that we have no way of estimating an objective threat, and by this definition can't be moral panic.

OK. Title doesn't make sense. Let's click anyway. Maybe this article will allow me to move forward on one of my more dominant streams of thoughts these days: The discourse on news and news reading behavior seems to assume that people have an unlimited amount of time and attention resources to consume news in a given day. How do we achieve a healthy and balanced news diet, if we don't have countless hours to spend?

This stream of thought has led me to ask the question if the time that we are spent worrying about "fake news" should be spent thinking about something else. And the related question: "What is that something?" and "Is the problem with fake news actually not that it is fake but that it is simply consuming time that we should be spending doing other things?"

So I click. Falling, falling. The piece is interesting, but not what I expected.

Yet I am reading ideas that I don't recall encountering before in such a form. I keep reading. Second to last paragraph is:

"The only solution to the problem of fake news that neither misdiagnoses the problem nor overpowers the elites is to completely rethink the fundamentals of digital capitalism. We need to make online advertising – and its destructive click-and-share drive – less central to how we live, work and communicate. At the same time, we need to delegate more decision-making power to citizens – rather than the easily corruptible experts and venal corporations."

But how much does the author really know about the forces at play within the larger context that gives rise to online advertising? If there is going to be a "rethinking" there need to be "rethinkers" who are positioned to make changes. This piece seems to be implying that those "rethinkers" exist: but can they exert the required influence?

OK. I could fall forever. I am just going to dig a bit more deeply into this one article, and then I am going to stop and do something else.

Let's start with remembering exactly does "venal" mean again? Looked that up. "Open to bribery". Right. OK.

To understand who the author might consider to be the "rethinkers", let's have a look at where the author is coming from, specifically, what he might know about neuroscience and psychology, i.e., information addiction and confirmation bias, information literacy, and the science of complex systems. I started out by looking at the profile page of the author, here:

https://www.theguardian.com/profile/evgeny-morozov

which links to his blog here:

http://neteffect.foreignpolicy.com

Which doesn't give me a blog, but rather a portal:

what is going on?

I decide to read all of the comments to see if anyone else had this problem.

Whew. Lot's of opinions there. Pretty interesting discussion. One comment states "We have norm of unexamined adoption". That's an interesting observation: How did those norms get formed in the first place? If we can figure that out, then we can maybe take some action there.

At least two people commenting are pointing to the need for helping people develop critical thinking thinking skills and the ability to verify information. That's another of my streams of thought lately: how to promote the practice of evaluating information sources, for example, with the CAARP test.

No one seems to be bothered by the broken link in the author profile of the author of the piece. Usually a broken link would point to a poorly maintained, and potentially less authoritative source. But this is The Guardian! Maybe I am seeing things?

Then I spend some time on The Guardian website trying to figure out where to report a broken link. Lots of opportunities for suggesting corrections to content, and for securely passing The Guardian information. Good to see. However, none for just saying, "Hey, the link is bad".

OK. Time to take action. I posted the comment:

"Does anyone else find that the link to Morozov's Net Effect blog at the top of his Guardian profile page (https://www.theguardian.com/profile/evgeny-morozov) doesn't seem to really lead to the blog? It seems like the The Guardian made a mistake, and that the link should be directing us here: https://foreignpolicy.com/category/net-effect Uncertainty about this link is hindering me in digging into the wider context of this piece."

Time passes.

Worrying that that comment will be interpreted as being negative about the piece. I'm not negative, just trying to get to the bottom of what the lasting message is for me.

Time passes.

I'm spending time on trying to understand why research on dopamine and information seeking seems to have fallen silent in the mainstream press after 2012, and on wondering why there is not a good website to explain complex systems. We need to rely on Wikipedia for so many of the related concepts like "preferential attachment" and "emergence".

Why in the world does my Morozov piece feature one picture of Putin and the top and one picture of Trump in the middle? It is not about either of them. I don't think Morozov chose those.

I am still falling...with also a feeling of having been sucked in.

Time passes.

This is about one piece that I read in the newspaper! I'm trying to form an opinion about one single opinion piece. What if I had tried to read the other ones as well? What if I were doing any serious fact checking?

Greetings from a state of extreme information overload.

Time passes.

Is the conclusion that the limits our time will ultimately always win? That we will drown in a state of information overload because it requires an afternoon to evaluate a single opinion piece?

I am not so sure. In this case, I am planning to take action on my conclusions regarding the article and the things that the people are saying in the comments. As an information retrieval researcher, a crisis in information quality is a crisis at the core of our research field. As an instructor of a freshman information science course I need to be able to describe best practices in information and consumption behavior.

There is a lot riding on this one article for me.

In that respect, it is not wasted time.

A arrive at the bottom of with a loud bump.

So the conclusion is, yes, our time is limited. We can't spend an entire afternoon examining everything that we read. The most important information is the information that we take action on. We need to seek out that information, and evaluate the heck out of it.

If we are not planning to take action, read, but leave the information in suspended animation. For example, the article on Fort Lauderdale. Or: there is now an article about the Mob on the front page of the New York Times. I choose not to subject these to scrutiny, but neither will I take any action (including sharing those articles) on the basis of what I read.

Looking at the length of this post, another obvious conclusion is that people should set aside more time for finding and consuming information. The information available online initially looks "free", but really we need to also count the price of our time. Information without verification is useless.

Setting aside time requires asking the question, "What did we lose because we didn't choose to do something else instead?"In short, how can we more tightly link reading the news to tradeoffs and to tangible value?

Now what about this little bottle?

Advanced Bullshit Detection for your protection: Wise words on reading the news

2016-12-16T16:04:00.000+01:00

There's been lots of news on the news in the news recently. Just like we try to take care of our bodies by eating healthy foods, we must take care of our societies by carefully consuming quality news. The importance of a high-quality, balanced media diet is independent of your political convictions.

But how do we keep our news reading habits healthy? What do we do? This morning I came across a great video explaining four steps that everyone can take in order to achieve a healthier news diet. The video is an interview with Curd Knüpfer at the Freie Universität Berlin published by Die Zeit with this article. Since the video is in German, I provide my own translation of the four steps here. I tried to make the translation as accessible as possible.

I call it "Advanced Bullshit Detection": these four steps are what we need to be spending our time doing in order to protect ourselves while reading the news.

As a media consumer, you have to develop an attitude towards yourself so that you see yourself as someone who chooses news cautiously. There are guidelines that you can follow, and that you can make into habits:

1. Ask yourself questions about your own emotional reaction
When I see an article or a news items that that makes me particularly angry, one to which I have a very strong emotional reaction, one that makes me nervous or fearful, then I should stop and think. I should stop and ask "Wait a minute, what is this information actually saying, and why is it having this effect on me?"

2. Check the quality standards
Develop an eye for what good, meaningful journalism is, and how news articles are created. This means that one can look at any form of journalistic reporting (it doesn't matter where it is from or its political direction). You can say "It's better if an article on a particular topic cites more than one source" or if more than one source is cited, "It is better that more than one perspective is represented."

3. Pay attention to the sources of the information
Of course you need to pay attention to the sources. Can I, for example, trust someone who works for the Freie Universität Berlin? And, if I think that I can't: Why can't I? It also works the other way around. You can say, "Hey, there's someone who knows this topic relatively well!" It doesn't mean that you take everything they say at face value, but at least you can trace back where the person is from and who is paying them, etc.

4. Balance your media diet
Balancing your media diet is a luxury that we have because we live in a world in which media is digital. It is relatively easy for us to access a large number of different sources of news, and we should also take advantage of the diversity available to us.

Thank you Curd Knüpfer for these wise words (did my best not to lose anything in translation).

Big Data as Fast Data

2016-11-13T02:13:00.001+01:00

Last Thursday I was at a "sounding board" meeting of the Big Data Commission of the Royal Netherlands Academy of Arts and Sciences. This post highlights some points that I have continued to reflect upon since the meeting.

According to Wikipedia, "Big Data" are data sets that are too large and too complex for traditional data processing systems to work with them. Interestingly, the people who characterize "Big Data" in terms of volume, variety and velocity, often underemphasize, as the Wikipedia definition does, the aspect of velocity. Here, I argue it is important not to forget that Big Data is also Fast Data.

Fast Streams and Big Challenges
Because I work in the area of recommender systems, I quite naturally conceptualize problems in terms of a data stream rather than a data set. The task a stream-based recommender system addresses is the following: there is a stream of incoming events and the goal is to make predictions on the future of the stream. There are two issues that differentiate stream-based views of data from set-based views.

First: the temporal ordering in the stream means that ordinary cross-validation cannot be applied. A form of A/B testing must be used in order to evaluate the quality of predictions. Online A/B testing has implications for the replicability of experiments.

Second: at any given moment, you are making two intertwining predictions. One is the prediction of the future of the stream. The other is how much, if any, of the past is actually relevant in predicting the future. There are two reasons why the information in the past stream may not be relevant to the future: external and internal factors.

External factors are challenging because you may not know they are happening. A colleague doing medical research recently told me that when deductibles go up people delay going to the doctor, and suddenly the patients that are visiting the doctor have different problems, simply because they delayed their visit. Confounding variables of course exist for set-based data. However, if you are testing stream-based prediction online, you can't simply turn back the clock and start investigating confounding variables: it's already water under the bridge. As much as you may be recording, you cannot reply all of reality as it happened.

Internal factors are even tougher. Events occurring in the data stream influence the stream itself. A common example is the process by which a video goes viral on the Web. In this case, we have a stream of events consisting of viewers watching the video. Because people like to watch videos that are popular (or are simply curious about what everyone else is watching) events in the past actually serve to create the future, yielding an exponentially growing number of views. These factors can be understood as feedback loops. Another important issue, which occurs in recommender systems, is that the future of the stream is influenced by the predictions that you make. In a recommender system, these predictions are shown to users in the form of recommended items, and the users create new events by interacting with these items. The medical researcher is stuck with this effect: she cannot decide not to cure patients, just because it will create a sudden shift in the statistical properties of her data stream.

Time to Learn to do it Do it Right
In short, you are trying to predict and also to predict whether you can predict. We still call it "Big Data", but clearly we are at a place where the assumption that data greed pays off ("There's no data like more data") breaks down. Instead, we start to consider the price of Big Data failure ("The bigger they are, the harder they fall").

In a recent article Trump's Win Isn't the Death of Data---It Was Flawed All Along, Wired concluded that "...the data used to predict the outcome of one of the most important events in recent history was flawed." But if you think about it: of all the preposterous statements made during the campaign, no one proposed that the actual election be cancelled since Big Data could predict its outcome. There are purposes that Big Data can fulfill, and purposes for which it is not appropriate.

The Law of Large Numbers forms the basis for reliably repeatable predictions. For this reason, it is clear that Big Data is not dead. The situation is perhaps exactly the opposite: Big Data has just been born. We have reasons to believe in its enormous usefulness, but ultimately its usefulness will depend on the availability of people with the background to support it.

There is a division between people with a classic statistics and machine learning background who know how to predict (who may even have the technical expertise to do it at large scale) and people who, on top of a classical background, have the skills to approach the question of when does it even make sense to be predicting. Only the latter are qualified to pursue big data.

The difference is perhaps a bit like the difference between snorkeling and scuba diving. Both are forms of underwater exploration, and many people don't necessarily realize that there is a difference. However, if you can snorkel, you are still a long way from being able to scuba dive. For scuba diving, you need additional training, and more equipment, and a firm grasp of principles that are not necessarily intuitive, such as the physiological effects of depth and the wisdom of redundancy. There is a lot to be achieved on a scuba dive, that can't be accomplished by mere snorkeling: but the diver needs resources to invest, and above all needs to have the time to learn to do it right.

No Fast Track to Big Data
These considerations lead to the realization that although Big Data streams may be in and of themselves incredibly quickly changing, the overall process of making Big Data useful is, in fact, very slow. Working in Big Data requires an enormous amount of training going beyond a traditional data processing background.

Gaining the expertise needed for Big Data also requires understanding of domains that lie outside of traditional math and computer science fields. All work in Big Data areas must start from a solid ethical and legal foundation. Civil engineers are in some cases able to lift a building to add a foundation. With Big Data, this possibility is excluded.

To illustrate this point, it is worth returning to consider the idea of replacing the election with a group of data scientists carrying out Big Data calculations. It is perhaps an extreme example, but it is one that makes clear that ethical and legal considerations must come before Big Data. The election must remain the election because on its own a Big Data calculation has no way of achieving the necessary social trust necessary to ensure continuity of the government. For this we need a cohesive society and we need the law. Unless Big Data starts from ethics and from legal considerations, we risk time and effort developing a large number of algorithms that are solving the wrong problems.

Training data scientists while ignoring the ethical and legal implications of Big Data is a short cut that is tempting in the short run, but can do nothing but harm us in the long run.

Big Data as Slow Science
The amount of time and effort needed to make Big Data work, might lead us to expect that Big Data should yield some sort of Big Bang, a real scientific revolution. In fact, however, it are the principles the same old scientific method of centuries that we return to in order to define Big Data experiments. In short, Big Data is a natural development of existing practices. Some have even argued that data-driven science pre-dated the digital age, e.g., this essay entitled Is the Fourth Paradigm Really New?

However, it would also be wrong to characterize Big Data as business as usual. A more apt characterization is as follows: Before the Big Data age scientific research proceeded along a the conventional path: researchers would formulate their hypothesis, design their experiment, and then as the final step collect the data. Now, the path starts with the data, which inspires and informs the hypothesis. The experimental design must compensate for the fact that the data was "found" rather than collected.

Given this state of affairs, it is easy to slip into the impression that Big Data is "fast" in the sense that it speeds up the process of scientific discovery. After all, the data collection process, which in the past could take years, can be carried out quickly. If the workflow is implemented, a new hypothesis could be investigated in a matter of hours. However, it important to consider how the speed of the experiment itself influences the way in which we formulate hypotheses. Because there is little cost to running an experiment, there is little incentive to put a great deal of careful thought and consideration into which hypotheses we are testing.

A good hypotheses is one that is motivated by a sense of scientific curiosity, and/or societal need and that has been informed by ample amounts of real-world experience. If there is negligible additional cost to running an additional experiment, we need to find our motivation for formulating good hypotheses elsewhere. The price of thoughtlessly investigating hypotheses merely because they can be formulated given a specific data collection is high. Quick and dirty experiments lead to mistaking spurious correlations for effects, and yield insights that fall short of generalizing to meaning real-world phenomena, let alone use cases.

In sum, we should remember the "V" of velocity. Big Data is not just data sets, its also data streams, which makes Big Data also Fast Data. Taking a look at data streams makes it easier to see the ways in which Big Data can go wrong, and why it requires special training, tools, and techniques.

Volume, variety, and velocity have been extended by some to include other "Vs" such as Veracity and Value. Here, I would like to propose "Vigilance". For Big Data to be successful we need to slow down: train people with a broad range of expertise, connect people to work in multi-skilled teams, and give them the time and the resources needed in order to do Big Data right. In the end, the value of Big Data is the new insights that it reveals, and not the speed at which it reveals them.

The Societal Impact of Multimedia Research: ACM MM 2016 Brave New Ideas Track (Papers and Slides)

2016-10-18T00:00:00.000+02:00

At ACM Multimedia 2016, the Brave New Ideas Track was devoted to the theme "Societal Impact of Multimedia Research". In the Call for Papers, we challenged authors to be brave in pursuing topics that have a direct impact on people's lives.

What is brave about multimedia research with societal impact? The answer is simple: it takes much more time. To pursue work with direct societal impact it is necessary to work together with other disciplines, create new data resources, and develop new evaluation methodologies that demonstrate success with respect to socially relevant criteria.

It also takes time for researchers to reach the insight that important new scientific questions arise from directly attempting to create solutions for societal problems. For example, concerns about privacy are currently motivating researchers to turn away from pursuit of "data greedy" algorithms to study how to get more out of less data.

We recommend reading the papers of the track to understand other interesting scientific problems that have been opened up by researchers who have the courage to work in these high societal-impact areas:

Mengfan Tang, Siripen Pongpaichet, and Ramesh Jain. 2016. Research Challenges in Developing Multimedia Systems for Managing Emergency Situations. In Proceedings of the 2016 ACM on Multimedia Conference (MM '16). ACM, New York, NY, USA, 938-947. [ACM DL link][slides]

Andrea Castelletti, Roman Fedorov, Piero Fraternali, and Matteo Giuliani. 2016. Multimedia on the Mountaintop: Using Public Snow Images to Improve Water Systems Operation. In Proceedings of the 2016 ACM on Multimedia Conference (MM '16). ACM, New York, NY, USA, 948-957. [ACM DL link][paper][slides]

Alexis Joly, Hervé Goëau, Julien Champ, Samuel Dufour-Kowalski, Henning Müller, and Pierre Bonnet. 2016. Crowdsourcing Biodiversity Monitoring: How Sharing your Photo Stream can Sustain our Planet. In Proceedings of the 2016 ACM on Multimedia Conference (MM '16). ACM, New York, NY, USA, 958-967. [ACM DL link][paper][slides] See also: the Pl@ntNet App.

Michael Riegler, Mathias Lux, Carsten Gridwodz, Concetto Spampinato, Thomas de Lange, Sigrun L. Eskeland, Konstantin Pogorelov, Wallapak Tavanapong, Peter T. Schmidt, Cathal Gurrin, Dag Johansen, Håvard Johansen, and Pål Halvorsen. 2016. Multimedia and Medicine: Teammates for Better Disease Detection and Survival. In Proceedings of the 2016 ACM on Multimedia Conference (MM '16). ACM, New York, NY, USA, 968-977. [ACM DL link][paper][slides]

Contact: ACM MM 2016 BNI Chairs: Martha Larson (TU Delft and Radboud University Nijmegen) and Hari Sundaram (University of Illinois)

Music as Technology: Sad song of missed opportunities for music to do what music does well

2016-08-16T11:10:00.002+02:00

Last week my colleagues Andrew Demetriou and Cynthia Liem presented a paper at ISMIR 2016 (the 17th International Society for Music Information Retrieval Conference) entitled "Go with the Flow: When Listeners Use Music as Technology" [1]. The idea of this paper is that listeners use music as a tool that is directed to accomplishing a task.

In the paper, we point to the phenomenon of people using music to put themselves into a flow state: "listeners make a conscious decision to expose themselves to the experience of music to alter their internal state in order to achieve a goal that they have set for themselves." With this paper, we want to encourage the development of music information retrieval technology, including recommender systems, that supports listeners in finding the music that they need in order to support their goals.

It is a chicken and egg problem. Unless systems are there that support users in finding music that allows them to reach there goals, it is hard to study the phenomenon at large scale. Unless we understand the phenomenon, it is hard to develop these systems. The first leap remains one of, well, basically, faith: faith that, given the evidence that we already have on hand, that we should push for music information retrieval that recognizes that positive potential of music for allowing people to best use their brains.

We try to keep the focus on the positive potential, but the dark underside of the situation is what happens if our world continues on, with the mainstream being unaware of the effect of music on the brain. The wrong music can prevent certain brain states, as easily as the right music can promote them. When music is not in the control of an individual (such as in a restaurant, cafe, or public place) serious thought is needed about what music is playing and how to play it. Otherwise the music is putting people in a brain state that is neither productive, nor even pleasant.

Stop the noise by Surko

In Chicago, I met a man cleaning tables in a restaurant.

The chain had obviously put a lot of time and money into the decor.

The music was a loud mix of alternating genres. Understanding people speaking was a strain.

When I asked the man about it he got emotional.

"I'm teaching myself guitar," he said. "There are a couple of good country pieces mixed in, but I know them by heart. The whole thing plays over and over again."

"I could tell them what they really should be playing!"

We look to a future in which the people making the music decisions about places like this restaurant care about their sound atmosphere as much as they care about the visual impression of their decor. The music should not only be geared to customers who stop by for lunch, but also to employees, who spend their days listening to it. The experience (and arguably also mental health) of people spending time in the restaurant could be enhanced if the music released them from the grind of repetition. Music may not always allow people to achieve flow state, but it can support them in enjoying being where they are, as much as possible.

Music decision makers should sit up and realize that music matters: music doesn't happen "out there" somewhere, but rather happens inside every person who is listening to it. The leap of faith is not a large one, it just requires listening to people who know that music is important, and asking them how to make things better.

[1] Demetriou, A., Larson, M., and Liem, C. Go with the Flow: When Listeners Use Music as Technology, ISMIR 2016.
https://wp.nyu.edu/ismir2016/wp-content/uploads/sites/2294/2016/07/068_Paper.pdf

How to use the word "subjective" in multimedia content analysis

2016-07-28T00:22:00.002+02:00

Multimedia content analysis is devoted to the automatic processing of video, image, audio, and text content with the purpose of describing it, or otherwise associating it with information that will make it findable, and also useful, to users. Previously, I have urged multimedia content analysis researchers to avoid the word “subjective” and instead formulate their insights in terms of inter-annotator agreement with respect to the data that they are using and the protocol that they give to the annotators who are providing the target labels. Since we don’t seem to be inclined to stop using the word “subjective” soon, it makes sense to formulate some guidelines on how to use it "safely".

Best practice for the use of the word “subjective”: When the word "subjective" it is used, it should be first defined.

The word "subjective" has different definitions. It’s not particularly productive to fix any one way of using it as “the only right way”. Instead, when using the word "subjective" you should simply declare which definition you are using, and you will avoid a lot of unproductive confusion. You do not want to risk that you use “subjective” in one sense, and your reader/listener interprets it in another sense.

We can gain further understanding of why it is important to "define well before use" by examining the dictionary entry for “subjective” provided by Merriam-Webster. Here, you can see the many meanings that “subjective” can take on. I haven’t observed any issues caused by definitions 1 or 2. Multimedia content analysis research is generally not interested in these definitions. Where we get into trouble is with 3-5, so I will focus on these.

Let’s start with definition 4c: “arising out of or identified by means of one's perception of one's own states and processes” This definition of subjective is related to the conceptualization of a situation as being exclusively determined by the point of view of the “subject”, i.e., the person who is undergoing the experience of perceiving something.

Such a conceptualization, in the case of certain situations, is standard, and when we communicate with each other, we don’t even think about the fact that we assume it. Let’s take a closer look at how this conceptualization works. When we use language, we rely on an unspoken agreement that certain phenomena (for example, the emotion that music evokes in a person) are subjective. Specifically, the agreement means that the way in which we understand the world gives all listeners the power to determine what they feel when listening to music (i.e., induced emotion) for themselves.

Simply stated: if someone says, “This music makes me so happy”, it is nonsensical for me to assert, “No, it doesn’t”. I might say this to tease someone, but it is clear that I am not using language in a standard way. An emotion felt while listening to music can only be asserted by the subject, and I, who am not in the subject’s mind, do not have the power to originate a meaningful statement on the matter. It is not a trivial point: Without this shared understanding, the convention/assumption of subjectivity behind "This music makes me happy", the function of language would break down and we would have failed to communicate.

Here’s where things can go wrong for a researcher working in the area of multimedia content analysis. Imagine you are collecting multimedia content labels from a group of annotators who are judging the content, and you at the end of experiment, and declare, “The results show that the phenomenon we are studying is subjective”. Readers who are using definition 4c of subjective will find this conclusion invalid. The reason is that under this definition, “subjective” is something that is established ahead of time by convention: it cannot be determined experimentally. (Full disclosure: for me this is the preferred definition of "subjective", because it is the most literal interpretation. The word "subjective" contains the word "subject". I also prefer it since it ensures the sanctity of the private world of the individual, and the right of the individual to an independent voice.)

Moving to 4b: “arising from conditions within the brain or sense organs and not directly caused by external stimuli” This definition is not so interesting for multimedia researchers: we study multimedia content, which is an external stimuli.

Now, we go on to definition 4a: (1): “peculiar to a particular individual: personal” This definition of subjective is related to the idea that each individual has their own unique view. (Merrian-Webster's Definition 1 of "peculiar" is "characteristic of only one person, group, or thing") Under this definition, something is "subjective" it means that everyone disagrees with everyone else. This definition is also not so interesting for multimedia researchers: if everyone has their own completely different interpretation, then we are lost: we cannot hope to build algorithms that generalize over the different meanings that find in multimedia. Until the field of multimedia starts working extensively on systems used only by a single person, this definition of subjective is probably not one that will be used often.

Note that the field of recommender systems strives to develop personalized algorithms, and users evaluation methodologies that assess whether personal predictions are successful. However, even recommender systems rely on the fact that people are similar to each other. In a world populated exclusively with utterly unique individuals, collaborative filtering algorithms will necessarily fail.

More helpful is definition 4a (2): “modified or affected by personal views, experience, or background” This definition of "subjective" is often implicitly assumed in multimedia content analysis. People’s interpretations are affected by what they know, the opinions they hold, and the life experience that they have had. These factors can lead to there being a multitude of different interpretations that apply to certain multimedia content. However, in contrast to the situation above with definition 4a (1), we are not assuming that everyone has their own “peculiar” interpretations. It makes sense for us to try to create systems that generalize or predict meaning, only in the case that we are not dealing with exclusively unique interpretations.

We can see 4a (2) as closely related to 3b: “relating to or being experience or knowledge as conditioned by personal mental characteristics or states”

With both of these definitions, 4a (2) and 3b, we can reasonably have hope that we can find islands of consistency in the perceptions of users of multimedia (and in the labels of our annotators). Within these islands we can make stable inferences that will be useful to users.

Let’s check again if, under these definitions, you can make a statement in your paper, “The results show that the phenomenon we are studying is subjective”. This time you can. But in order to do so, you need to have an experiment that shows that the background of the users is what is causing your classifier not to give you stable predictions. Otherwise, it might be the case that your classifier just has not been well designed or trained.

You also need to provide evidence that the protocol that your annotators are using to make judgements is not unduly steering people to diverse interpretations. Your protocol should put people reasonably on the same page, and then ask them for judgements at all times being careful not to ask "leading" questions, cf. [1, 2]. For some research work, you might not be using a protocol. Many tasks involve "found" labels such as tags. In this case, you need to state the assumptions that you are making concerning the original labeling context, including the reasons for which the labels were assigned.

With any definition of subjective, it is important to strictly avoid arguing along these lines: “This phenomenon is subjective, and therefore it is not important and we should not be studying it.”

Scientifically, there is no a priori reason to prioritize the “objective” over the “subjective” if we use definitions 4a (2) and 3b. It is true that we tend to study phenomena with high inner-annotator agreement since these are easier to get a handle on. However, at the same time we remain aware that this tendency steers us dangerously close to the famous story of Nasreddin Hodja who looks for his ring outside, since it is too dark inside where he lost it. In short, define “subjective”, but never use it as an excuse for failure or avoidance.

To drive that particular point home: The message is "Keep up your guard". Your problem should arise from the needs of users. Practically, speaking the problem you choose will be influenced by your ability to access the resources needed to study it, including carrying out a well designed, conclusive experiment. It will not, however, be influenced by your personal decision that something is "subjective".

Next, we turn to definition 3a: “characteristic of or belonging to reality as perceived rather than as independent of mind.” Using this definition is dangerous. It forces you to take a position on the difference between effects that are real, and effects that are imagined. As scientists, we determine this difference experimentally. We do not presume it. Unless we are undertaking experiments directed a making this difference, it makes sense to steer clear of this definition.

Finally, definition 5: “lacking in reality or substance” The same comment applies as in the case of definition 3a. We cannot a priori say whether patterns that can be found in multimedia content lack reality or substance. If we don’t find evidence for the reality of some phenomenon in our data, it simply means that there is no evidence for its reality in our data. Lack of observation does not disprove existence. We must guard ourselves against jumping to conclusions. Again, this is a definition to be avoided, unless you are actually directly investigating the nature of reality.

As researchers in the area of multimedia content analysis, we must carefully keep ourselves from creating our own realities: the reality we assume must be the reality (possible multiple realities) of the users that we serve—all of them. The fact that we do not necessarily understand this reality fully, or have the type of information or data that would capture it in its complexity, richness, and continuous, rapid evolution, is a challenge that we face. This challenge is inherent to the types of algorithms and technologies that we design and develop.

Other posts in this blog on subjectivity:

http://ngrams.blogspot.com/2011/08/subjectivity-vs-objectivity-in.html
http://ngrams.blogspot.com/2011/11/affect-and-concepts-for-multimedia.html
http://ngrams.blogspot.com/2016/02/mediaeval-2015-insights-from-last-years.html

[1] Larson, M., Melenhorst, M., Menéndez, M. and Peng Xu. Using Crowdsourcing to Capture Complexity in Human Interpretations of Multimedia Content. In: Ionescu, B. et al. Fusion in Computer Vision – Understanding Complex Visual Content, Springer, pp. 229-269, 2014.

[2] M. Riegler, V. R. Gaddam, M. Larson, R. Eg, P. Halvorsen and C. Griwodz, "Crowdsourcing as self-fulfilling prophecy: Influence of discarding workers in subjective assessment tasks," 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI), Bucharest, 2016, pp. 1-6.

Multimedia analysis out of the box: new applications and domains

2016-06-18T22:31:00.004+02:00

Flickr: Tom Magliery

This blogpost summarizes the panel on the third day of the 14th International Workshop on Content-based Multimedia Indexing CBMI 2016. It includes both statements by the panelists and comments coming from the audience. I was the panel moderator, and was also taking notes as people were speaking (any error in reproducing what people said here is strictly my own).

The panel was structured into three rounds roughly related to the past, present, and future of multimedia analysis research. Each round had an “opener” that the panelists were asked to respond to, and then continued in free form, with the audience also contributing.

First round: The panelists were asked to discuss, “A past vision (that you have had during the last 20 years) for a multimedia analysis application that came to be.”

The early work of GroupLens started a user revolution. It was great to have recommender systems break onto the scene. Their introduction shifted the focus of the community of researchers, also those studying information/multimedia access, from pure computation to involving users. This shift was possible because computers could collect user interactions, providing researchers with large sets of interactions to work with. Recommender systems introduced the key idea that users can benefit from other users, and this idea has come into its own.

Historically, multimedia indexing started with spoken content indexing. (This statement carried the “footnote” that the panelists and the panel moderator all have a speech background.) In the past years, we have seen the maturation of speech and language technology. Now we are on the brink of systems that index all spoken information in multimedia. (But let’s keep breathing in the meantime.)

The panel noticed that it is easier to name past visions that still have not completely come to be. Examples were:

First person video: In the late 1990’s, video life logging started. The goal was to summarize daily life, and to aid memory and remembering. Privacy is a real stumbling block for this vision. However, now we are seeing first person cameras like GoPro: so, perhaps it is video life logging is here, but it is not exactly what we thought it was going to be.

Users: Ten years ago we were developing algorithms for applications, but there was a sense that they would never be put to use. The field of multimedia analysis is now more user centered, although not yet complete so: we are on our way. Sometimes it’s not gaining 5% MAP that makes product usable. Instead, we need to think about different lines.

Education: The panel was in agreement that we have yet to see multimedia reach its potential as a tool for education. This could and should be the century of education!

In the early 1990’s multimedia retrieval and spoken content retrieval were intended to support education. Today, we see that eduction is still mainly about books. MOOCs and online learning resources are growing in popularity, but we are still waiting for multimedia indexing to really contribute to education at large scales.

We used to have the vision that kids should be able to play with information and to communicate with each other as part of studying and learning: These types of applications were fun. What happened to this kind of work? It is a shame that this hasn’t really been put to mainstream use: Is this the responsibility of the multimedia people?

Well, yes. We are all teachers in a way: Why don’t we eat our own dogfood? Looking at this conference, our presentations are all text-heavy sets of PowerPoint slides!

Why is willingness of teachers and journalists to use multimedia tools so low? Do we need to wait until everyone in the world becomes tech friendly to have our research put to use?

Maybe we just don’t have the tools necessary to allow multimedia indexing to come into its own in support of education. We need the tools in order to engage teachers.

We don’t have the time to do education related research. You can’t just do a 10 minute experiment with data from 30 people: people are complex kids are complex! We haven’t been willing to take the time to work with teachers: we haven’t had funding for a 5-10 year sustained effort in this area. But it’s a worthwhile goal.

We need to understand the nature of education. There is a relationship between student and teacher: it is a human relationship. A machine might not be able to motivate the student.

This observation about student behavior stands in contrast to the success of video games in motivating kids. Games appear to motivate kids more so than their parents are able to do. However, today’s games are too simplistic to be an education tool. They don’t reflect real breath.

Final note of the first round: It seems that multimedia analysis researchers don’t talk about “killer applications” anymore. The way we see our success is more diffuse, and maybe that is also OK.

Second round. Panel members were asked to discuss “A current (widely-held) vision for a multimedia analysis application that is doomed.”

Our panelists jumped on the opportunity to be controversial.

Is lifelogging doomed?
Multimedia researchers of course love the huge amounts of data that life logging delivers. But do people really want their lives to be logged? Why would I want all of those picture? Are we just recording without a real application?

When we are healthy and in good shape we have perhaps no reason to record our lives. But when we become older or are in a situation that we need to be managing an illness, things change. In this case, the lifelogging applications are tremendously interesting. For elderly people living alone, it can be a real help: although it does not replace human company.

Why don’t we see this technology being widely used? The problem is not the market. The problem is that we are not marketing or business people: we need someone else to put this technology on the market. This process for doing so is a mess! We develop nice applications, but we need to move on, and the business development never gets done.

Is virtual reality doomed?
We are not in a virtual space having a virtual conference. We are here. Virtual meeting rooms have not come to be and video conferencing fatigue is real. Virtual reality works great in games. Perhaps also in demonstrating things. But in general, augmented reality appears to be the more promising path.

Is multimedia analysis of broadcast television doomed?
Analysis of news, sports, movies, in fact, any produced content is over. If someone can produce the content, they can also dedicate the effort to annotate it.

A less extreme version of that position is probably, however, more appropriate. When we carry out multimedia research, often produced content is the only content we have. Not every content producer has the resources to create annotations. Finally (as note by the moderator) some types of annotations are against the business interests of people producing multimedia content: Do film producers really want audiences to have a fine-grained breakdown of the violence in film?

The panel agreed that analysis of produced content is very important for knowledge extraction and summaries of large, heterogenous collections. You can extract knowledge and facts: for example, the present needs a 20 minute summary.

Professionals, or specific applications often need detailed summaries: There would be value in summarizing to study for example the soccer moves of a certain player for practice or for strategy purposes.

Personal content often needs summarization: parents like highlights of school games or performances that feature their own children.

Are standards doomed?
Standards make sense for compression and communication, but standards have been over pushed. Many researchers identify with this situation: You barely know what you’re doing and you make a standard for it. However, the activity that takes place around the production of standards gives rise to new ideas. The fact that descriptors were encoded in MPEG7 gave rise to a lot of further work on descriptors.

Perhaps a more direct way of achieving the same effects is via reference implementations and toolkits. OpenCV is effectively, although not formally, a standard. This kinds of efforts are very important.

Third round: Panel members were asked for “A future vision for a multimedia analysis application that we should strive for.”

The opening comment was interesting and unexpected: As a early-career researcher in multimedia one is drawn to problems that one likes, and that attract and holds one’s attention. However, as a late-career researcher, one looks back and starts to regret not having considered the contribution that one’s career was making to society.

Multimedia for medicine: Young multimedia researchers should consider “joining the doctors”: the field of medecine needs us.

Human rights: Another area with enormous potential social impact is multimedia for human rights. We need algorithms that will allow us to find evidence of violations: examples are the analysis of areal photos to search for hidden destruction and the reconstruction of events using social media.

We need (footnote by moderator) technology that is able to verify the extent to which multimedia reflects the reality that it claims to capture: and, in particular, identify multimedia created with the intent to deceive.

Low quality content is key: Interestingly, some of the most highly socially relevant applications for multimedia involve processing some of the worst images. Multimedia researchers need to be brave enough to venture into areas where content is poor quality, difficult to obtain, and (footnote by moderator) where evaluation of success is highly challenging.

User intent: Multimedia information retrieval has recently experienced the “intent revolution”: the change from focusing on the nature of the items that users are trying to find, to the tasks that users are trying to achieve. Supporting people in their daily lives is not is as obviously socially relevant as education, medical or human rights applications. However, it has an important contribution to make.

Affective computing: We look forward to multimedia systems that support us in the emotional aspects of communicating with multimedia: sharing and mutual remembering. Humans are social creatures (isolations causes us to suffer). Shared experiences allow us to build relationships, share values, and keep the connections needed for social and psychological well-being. Regretfully, current research on affect and sentiment simplifies the emotional aspects of multimedia to the extent that it may be “trivial”. We need to work towards understanding both multimedia and the mind: a key question is: What pieces need to come together in order for someone to experience the reproduction of a memory or an experience?

Hardware and energy consumption: We should not forget that multimedia analysis is possible because of the devices that capture, store and process multimedia. We are ever dependent on hardware. Processing of multimedia costs energy: and future work should also keep energy efficiency in mind.

Closing comments:
When we study multimedia, we study communicating with multimedia. Moving forward it is important to keep the human in human communication.

Is there an end to multimedia? Can we foresee that it might be replaced by something completely different?

We see multimedia as an “everlasting field” encompassing applications that have not yet been invented. However, we should continue to call it “multimedia”, because continuity of what we call it will allow us to build on the past.

Currently, we see more and more other communities doing multimedia: examples are the computer vision community and the speech and language processing community. Having a distinct identity will allow the other fields to avoid reinventing the wheel.

We saw during the first round of the panel that looking back over the past 20 years, we did not do so well in formulating predictions which came true: the technologies that we anticipated have not achieved mainstream uptake (with a few notable exceptions). It’s not dramatic to be wrong in our predictions. However: it is important that we learn from our mistakes.

In general, we do not expect all early-career multimedia researchers to connect to socially relevant applications by “joining the doctors”. But it is good to have a larger vision. When you are writing a paper, embed your ideas within an overall picture of their potential. Embrace the larger meaning of your work and imbue multimedia research with sense of mission.

A big thank you to our panelists and to the members of the audience who contributed to the discussion.

Panelists:
Guillaume Gravier, IRISA, France
Alexander Hauptmann, Carnegie Mellon University, USA
Bernard Merialdo, EURECOM, France

Audience contributors:
Jenny Benois Pineau, University of Bordeaux, France
Bogdan Ionescu, University Politehnica of Bucharest, Romania
Georges Quénot, LIG, France
Stéphane Marchand Maillet, University of Geneva, Switzerland
Mathias Lux, Klagenfurt University, Austria

Horizons: Multimedia Technologies that Protect Privacy

2016-04-21T22:14:00.001+02:00

The Survey on Future Media for the new H2020 Work Programme gave me 500 characters each to answer a series of critical questions. I’m listing questions and my answers below. I'm taking this as my chance to pull out all the stops: extreme caution meets idealism. Did I use my characters wisely?

Describe which area the new research and innovation work programme of H2020 should look at when addressing the future of Media.

Non-Obvious Relationship Awareness (NORA) is a set of data mining techniques that find relationships between people and events in data that no one would think would exist. European Citizens sharing images or videos online have no way of knowing what sorts of information they are revealing about themselves. We need innovative research on media processing techniques that protect people's privacy by warning them when they are sharing information, and that obfuscate media making it safe for sharing.

What difference would projects in the area you propose make for Europe's society and citizens?

Projects in this area would contribute to safeguarding the fundamental right of European citizens to privacy and protection of personal data. Today, privacy protection focuses on protecting "obvious" personal information. This protection means nothing when personal information is obtainable "non-obvious" form. European citizens need tools to understand the dangers of sharing media in cyberspace, and tools that can support them in making informed decisions and protecting themselves.

What are the main technological and Media ecosystem related breakthroughs to achieve the foreseen scenario?

The Media ecosystem in question is the whole of cyberspace. The breakthrough that we need is techniques to predict that impact of data that we have not seen entering the system. We need techniques that are able to obfuscate images and videos in ways that defeat sophisticated machine learning algorithms, such as deep learning techniques. These technologies must be designed from the beginning in a way that is understandable and acceptable to the general population: protection only works if used.

What kind of technology(ies) will be involved?

Technologies involved are image, text, audio, and video processing algorithms. These algorithms will re-synthesize users' multimedia content so that it still fulfills its intended function, but with a reduced risk of leaking private information. Technology must go beyond big data to be aware of hypothetical future data. Yet unheard of: technology capable of protecting users' privacy against inference of non-obvious relations must be understandable by the people who it is intended to serve.

Describe your vision on the future of Media in 5 years' time?

People will begin to worry about large companies claiming to own (and attempt to sell them back) digital versions of their past selves, forgotten on distant servers. The realization will grow that it is not enough to have a device that takes amazing images and videos, but you also need a device that allows you to save and enjoy those images in years to come. An understanding will emerge that a rich digital media chronicle of ones own life contributes to health, happiness and wellbeing.

Describe your vision on the future of Media in 10 years' time?

Social images circling the globe will give people unprecedented insight into the human condition. People living in both developed and developing countries will rebel at anyone in the human race living under conditions of constant fear, and threat of constant hunger. The world will change. If protecting privacy means that people need to stop sharing images and videos all together, the opportunity to fulfill this idealistic vision is missed. The future of Media is bright, only if can be kept safe.

At the end of the day, multimedia is about making the world healthy, happy, and complete. At the end of this exercise I have concluded that the horizon stretches even further than 2020.

Starting to RUN

2016-04-03T14:32:00.000+02:00

Thank you for the email, tweets and texts about my new appointment at Radboud University Nijmegen. I'm happy that other people realize what a special day it was for me, and share my excitement about new opportunities and new challenges. I appreciate the warm reception at Radboud University. The "Welcome!" was unmistakeable: actually written on my whiteboard, when I walked into my office in the Center for Language Studies for the first time.

My appointment is as "Professor of Multimedia Information Technology" at the Faculty of Science, Institute for Computing and Information Sciences (iCIS). It involves a double affiliation (50/50) between iCIS and the Faculty of Arts, Centre for Language Studies (CLS). In this way, it brings together my background (pre-1990 in Math and EE; 1990-2000 in Formal Linguistics; and since 2000 in Computer Science, i.e., audio-visual search engines). It is a natural extension of this background that I will be working to bridge the research occurring on information access between the two faculties.

A press release about my appointment appeared on 31 March on the Radboud University homepage. I was very happy about the publicity for the MediaEval Multimedia Evaluation Benchmark. MediaEval is an initiative aimed at driving the development of new multimedia access technologies by offering shared tasks to the community. Instead of being centrally organized, it is grassroots in nature. My role is the bass player who, in a band, helps to links different parts together and keep the music moving forward on tempo. The success of the benchmark comes from the dedication and efforts of the task organizers, and the participants. (MediaEval is offering a great lineup of tasks in 2016, and signup is now open on the MediaEval 2016 website. The MediaEval 2016 workshop will be held 20-21 October 2016, right after ACM Multimedia 2016 in Amsterdam.)

Starting January 2017, Radboud University will be my main university (4 days per week), but I will maintain an affiliation with Delft University of Technology (1 day a week).

Currently, my main affiliation remains the Multimedia Computing Group at Delft University of Technology. However, I am at Radboud University Nijmegen for two days a week to get started at CLS. My first act is teach Intelligent Information Tools, a course for first and second year undergraduate students in Communication and Information Science. The students learn about the nature of information, the structure of the internet, how search, recommendation, and other information tools work, and also how to think critically about these tools.

At TU Delft I continue teaching, and pursuing my research. The main focus of my research at this time is recommender systems, within the context of the EC FP7 project CrowdRec "Fusion of active information for next generation recommender systems". It is a privilege to serve the CrowdRec consortium as the scientific coordinator. Current highlights are: The NewsREEL news recommendation challenge, at CLEF 2016 the ACM RecSys 2016 job recommendation challenge, and the Workshop on Deep Learning for Recommender Systems, also at ACM RecSys 2016. I look forward to a successful conclusion of the project September 2016, and also to future collaborations.

Seven years ago, nearly to the day, I wrote the first post on this blog. I had read an article advising kill your blog, as an answer to blogposts getting lost in a sea of mainstream information. My post points out that it is strange to suggest that bloggers must change, and not mention the role or responsibility of search engines.

Now, I am more convinced in ever of the value of information within small circles. Search needs to support exploitation of that value. The readership of this blog is intended to be future versions of myself, and also a limited number of people interested in a deep dive into reflections on various search-related topics. As I move to a new university, and the number of people I teach or collaborate with grows, I would like to remember that. I'll probably have less time to write blog posts, but I have decided that I will wait a few more years until moving away from occasionally blogging.

Creating information is a way in which we help ourselves think. Intense conversations also refine thought. But the model of everyone talks to everyone about everything does not always make sense. Instead, we need room for reflection with a relatively small set of individuals. Search should support that.

What's blocking the road? Maybe we feel that small scale search is a success because Google now displays calendar events in our search results. Maybe facing the personal is somehow more laborious or painful. In any case, currently we are far from understanding the aggregated impact of thousands of local dialogues, or to evaluating the success of small search that helps us exchange ideas with our past selves, and our closest colleagues. The future holds no lack of challenges.

A Non Neural Network algorithm with "superhuman" ability to determine the location of almost any image

2016-03-05T22:33:00.001+01:00

Martha Larson and Xinchao Li

We would like to complement the MIT Technology Review headline Google Unveils Neural Network with “Superhuman” Ability to Determine the Location of Almost Any Image with information about NNN (Non Neural Network) approaches with similar properties.

This blogpost provides a comparison between the DVEM (Distinctive Visual Element Matching) approach, introduced by our recent arXiv manuscript (currently under review):

Xinchao Li, Martha A. Larson, Alan Hanjalic Geo-distinctive Visual Element Matching for Location Estimation of Images (Submitted on 28 Jan 2016) (http://arxiv.org/abs/1601.07884)

and the PlaNet approach, introduced by the arXiv manuscript covered in the MIT Technology Review article:

Tobias Weyand, Ilya Kostrikov, James Philbin PlaNet—Photo Geolocation with Convolutional Neural Networks (Submitted on 17 Feb 2016) (http://arxiv.org/abs/1602.05314)

We also include, at the end, a bit of history on the problem of automatically "determining the location of images", which is also known as geo-location prediction, geo-location estimation as in [3], or, colloquially, "placing" after [4].

Our DVEM approach is a search-based approach to the prediction of the geo-location of an image. Search-based approaches consider the target image (the image whose geo-coordinates are to be predicted) as a query. They then carry out content-based image search (i.e., query-by-image) on a large training set of images labeled with geo-coordinates (referred to as the "background collection"). Finally, they process the search results in order to make a prediction of the geo-coordinates of the target image. The most basic algorithm, Visual Nearest Neighbor (VisNN), simply adopts the geo-coordinates of the image at the top of the search results list as the geo-coordinates of the target image. Our DVEM algorithm uses local image features for retrieval, and then creates geo-clusters in the list of image search results. It adopts the top ranked cluster, using a method that we previously introduced [5, 6]. The special magic of our DVEM approach is the way that it reranks the clusters in the results list: it validates the visual match at the cluster level (rather than at the level of an individual image) using a geometric verification technique for object/scene matching we previously proposed in [7], and it leverages the occurrence of visual elements that are discriminative for specific locations.

The PlaNet approach divides the surface of the globe into cells with an algorithm that adapts to the number of images in its training set that are labeled with geo-coordinates for that location, i.e., a location that has more photos will be divided into finer cells. Each cell is considered a class, and is used to train a CNN classifier.

Further comparison of the way the algorithms were trained and tested in the two papers:

	DVEM	PlaNet
Training set size	5M images train, 2K validation	91M train, 34M validation
Training set selection	CC Flickr images with geo-locations, (MediaEval 2015 Placing Task)	Web images with Exif geolocations
Training time	1 hour on 1,500 cores for 5M photos for indexing and feature extraction	2.5 months on 200 CPU cores
Test set size	ca. 1M images	2.3M images
Test set selection	CC Flickr images (MediaEval 2015)	Flickr images with 1-5 tags
Train/test de-duplication	train/test sets mutually exclusive wrt uploading user	CNN trained on near-duplicate images
Data set availability	via MM Commons on AWS	not specified
Model size	100GB for 5M images	377MB
Baselines	GVR [6], MediaEval 2015	IM2GPS [8]

From this table, we see that the training and test data for the algorithms are different, and for this reason, we cannot compare the accuracy measured for the two approaches directly. However, the numbers at the 1 km level (i.e., street level) suggest that DVEM and PlaNet are playing in the same ballpark. PlaNet reports correct prediction for 3.6% of the images on the (2.3M image test set) and 8.4% on the IM2GPS data set (237 images). Our DVEM approach achieves around 8% correct predictions on our 1M image test set, and is surprisingly robust to the exact choice of parameters. DVEM gains 12% relative performance over VisNN, and 5% over our own previous GVR. Note that [6] provides evidence that GVR outperforms IM2GPS [8]. PlaNet also reports that it outperforms IM2GPS, but the numbers are not directly comparable because 14x less training data is used.

The downside of search-based approaches is prediction time, as pointed out by the PlaNet authors in discussion IM2GPS. DVEM requires 88 hours on a Hadoop based cluster containing 1,500 cores to make predictions for 1M images. For applications requiring offline prediction, this may be fine, however, we assume that online geo-prediction is also important. We point out that with enough memory or an efficient index compression method, we would not need Hadoop, and we would be able to do the prediction on a single core with about 2s per query. Further, the question of how runtime scales is closely related to the question of the number of images that are actually needed in the background collection. Our DVEM approach uses 18x less training data than the PlaNet algorithm: if we are indeed in the same ballpark, this result calls in to question the assumption that prediction accuracy will not saturate after a certain number of training images.

We mention a couple reasons for which DVEM might ultimately turn out to out-perform PlaNet. First, the PlaNet authors point out that the discretization hurts accuracy in some cases. DVEM, in contrast, creates candidate locations "on the fly". As such, DVEM has the ability to make a geo-prediction at an arbitrarily small geo-resolution.

Second, the test set used to test DVEM is possibly more challenging than the PlaNet test set because it does not eliminate images without tags. We assume that the presence of a tag is at least a weak indicator of care on the part of the user. A careless user might also engage in careless photography, producing images that are low quality and/or are not framed to clearly depict their subject matter. A test set containing images taken by relatively more careful users could be expected to yield a higher accuracy.

Third, we assume that when near duplicates were eliminated from the PlaNet test/training set, that these were near duplicates from the same location. Eliminating images that are very close visual matches with other locations would, of course, artificially simplify the problem. However, it may also turn out that the elimination artificially makes the problem more difficult. In real life, a lot of people simply do take the same picture, for example, of the leaning tower of Pisa. A priori it is not clear how near duplicates should be eliminated to ensure the testing setup maximally resembles an operational setting.

The PlaNet paper was a pleasure to read, the name "PlaNet" is truly cool, and we are enthused about the small size of the resulting model. We are interested by the fact that "PlaNet" produces a distributional probability over the whole world, although we also remark that, DVEM is capable of producing top-N location predictions. We also liked the idea of exploiting sequence information, but think that considering temporal neighborhoods rather than temporal sequences might also be helpful. Extending DVEM with either temporal sequences or neighborhoods would be straightforward.

We hope that the PlaNet authors will run their approach using the MediaEval 2015 Placing Task data set so that we are able to directly compare the results. In any case, they will want to revisit their assertion that "...previous approaches only recognize landmarks or perform approximate matching using global image descriptors" in the light of the MediaEval 2015 Placing Task results, including our DVEM algorithm.

We would like to point out that work on algorithms able to predict the location of almost any image has been ongoing in full public visibility for a number of years. (Although given our field, we also enjoy the delicious jolt of a headline beginning "Google unveils...") The starting point can be seen as Mapping the World's Photos [9] in 2009. The MediaEval Multimedia Evaluation benchmark has been developing solutions to the problem since 2010, as chronicled in [10]. The most recent contribution was the MediaEval 2015 Placing task [11], cf. the contributions that use visual approaches to the task [12,13]. The MediaEval 2015 data set is part of the larger, publicly available YFCC100M data set, part of Multimedia Commons, and recently featured in Communications of the ACM [14]. MediaEval 2016 will offer a further edition of the Placing Task, which is open to participation for any research team who signs up.

We close by retuning to comment on the importance of NNN (Non Neural Network) approaches. This example of the strength of DVEM vs. PlaNet provides a demonstration that there is reason for the research community to retain a balance in their engagement in NN and NNN approaches. One appealing aspect of NNN approaches, and, in particular of search-based geo-location prediction, is the relative transparency of how the data is connected to the prediction. It may sound like science fiction from today's perspective, but one could imagine a future in which the person who took the image would receive a micro fee every time their image was used for the purpose of predicting geo-location metadata for someone else. Such a system would encourage people to take images that were useful for geo-location, and move us forward as a whole.

We would like to thank the organizers of the MediaEval Placing task for making the data set available for our research. Also a big thanks to SURF SARA for the HPC infrastructure without which our work would not be possible.

[1] Xinchao Li, Martha A. Larson, Alan Hanjalic Geo-distinctive Visual Element Matching for Location Estimation of Images (Submitted on 28 Jan 2016) (http://arxiv.org/abs/1601.07884)
[2] Tobias Weyand, Ilya Kostrikov, James Philbin PlaNet—Photo Geolocation with Convolutional Neural Networks (Submitted on 17 Feb 2016) (http://arxiv.org/abs/1602.05314)
[3] Jaeyoung Choi and Gerald Friedland. 2015. Multimodal Location Estimation of Videos and Images. Springer Publishing Company, Springer.

[4] P. Serdyukov, V. Murdock, R. van Zwol. 2009. Placing Flickr photos on a map. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '09), ACM, New York, pp. 484–491.

[5] Xinchao Li, Martha Larson, and Alan Hanjalic. 2013. Geo-visual ranking for location prediction of social images. In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval (ICMR '13). ACM, New York, NY, USA, 81-88.
[6] Xinchao Li, Martha Larson, and Alan Hanjalic. Global-Scale Location Prediction for Social Images Using Geo-Visual Ranking, in IEEE Transactions on Multimedia, vol. 17, no. 5, pp. 674-686, May 2015.
[7] Xinchao Li, Martha Larson, Alan Hanjalic. 2015. Pairwise Geometric Matching for Large-scale Object Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '15), pp. 5153-5161.
[8] J. Hays and A. A. Efros, "IM2GPS: estimating geographic information from a single image," Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, Anchorage, AK, 2008, pp. 1-8.
[9] David J. Crandall, Lars Backstrom, Daniel Huttenlocher, and Jon Kleinberg. 2009. Mapping the world's photos. In Proceedings of the 18th international conference on World wide web (WWW '09,) ACM, New York, 761-770.
[19] Martha Larson, Pascal Kelm, Adam Rae, Claudia Hauff, Bart Thomee, Michele Trevisiol, Jaeyoung Choi, Olivier Van Laere, Steven Schockaert, Gareth J.F. Jones, Pavel Serdyukov, Vanessa Murdock, Gerald Friedlan. 2015. The Benchmark as a Research Catalyst: Charting the Progress of Geo-prediction for Social Multimedia. In [3].
[11] Jaeyoung Choi, Claudia Hauff, Olivier Van Laere, Bart Thomee. The Placing Task at MediaEval 2015. In Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015, CEUR-WS.org, online ceur-ws.org/Vol-1436/Paper6.pdf
[12] Lin Tzy Li, Javier A.V. Muñoz, Jurandy Almeida, Rodrigo T. Calumby, Otávio A. B. Penatti, Ícaro C. Dourado, Keiller Nogueira, Pedro R. Mendes Júnior, Luís A. M. Pereira, Daniel C. G. Pedronette, Jefersson A. dos Santos, Marcos A. Gonçalves, Ricardo da S. Torres. RECOD @ Placing Task of MediaEval 2015. In Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015, CEUR-WS.org, online ceur-ws.org/Vol-1436/Paper49.pdf
[13] Giorgos Kordopatis-Zilos, Adrian Popescu, Symeon Papadopoulos, Yiannis Kompatsiaris. CERTH/CEA LIST at MediaEval Placing Task 2015. In Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015, CEUR-WS.org, online ceur-ws.org/Vol-1436/Paper58.pdf
[14] Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, Li-Jia Li. YFCC100M: The New Data in Multimedia Research. Communications of the ACM, Vol. 59 No. 2, Pages 64-73.

MediaEval 2015: Insights from last year's experiences in multimedia benchmarking

2016-02-07T20:37:00.001+01:00

This blogpost is a list of bullet points concerning MediaEval 2015. It represents the "meta-themes" of MediaEval that I perceived to be the strongest during the MediaEval 2015 season, which culminated with the MediaEval 2015 Workshop in Wurzen, German (14-15 September 2015). I'm putting them here, so we can look back later and see how they are developing.

How not to re-invent the wheel? Providing task participants with reading lists of related work and with baseline implementations helps ensure that it is as easy as possible for them to develop algorithms that extend the state of the art.
Reproducibility and replication: How can we encourage participants to share information about their approaches so that their results can be reproduced or replicated? How can we emphasize the importance of reproduction and replication and at the same time push for innovation, and forward movement in the state of the art (and avoid re-inventing the wheel as just mentioned)? One answer that arose this year was to reinforce student participation. Students should feel welcome at the workshop, even if they “just” reproduced an existing workflow.
Development of evaluation metrics for new tasks: Innovating a new task may involve a developing a new evaluation metric. All tasks face the challenges of ensuring that they are using an evaluation metric that faithfully reflects usefulness to users within an evaluation scenario.
How to make optimal use of leaderboards in evaluation: Participants should be able to check on their progress over the course of the benchmark, and aspire to ever-greater heights. However, it is important that leaderboards not discourage participants from submitting final runs to the benchmark. It is possible that an innovative new approach does very badly on the leaderboard, but is still valuable.
Understanding the relationship between the conceptual formulation of the task, and the dataset that is chosen for use in the task: Are the two compatible? Are there assumptions that we are making about the dataset that do not hold? How can we keep task participants on track: solving the conceptual formulation from the task, and not leveraging some incidental aspect of the dataset?
Disruption: Tasks are encouraged to innovate from year to year. However, 2015 was the first year that organizers started planning far ahead for “disruption” that would take the task to the next level in the next year.
Using crowdsourcing for evaluation: How to make sure that everyone is aware of and applies best practices? How to ensure that the crowd is reflective of the type of users in the use scenario of the task?
Engineering: Task organization involves an enormous amount of time and dedication to engineering work. We continuously seek ways to structure organizer teams and to recruit new organizers and task auxiliaries to make sure that no one feels that their scientific output suffered in a year where they spend time handling the engineering aspects of MediaEval task organization.
Defining tasks and writing task descriptions: We repeatedly see that the process of defining and new task and of writing task descriptions must involve a large number of people. If people with a lot of multimedia benchmarking experience contribute, they can help to make sure that the task definition is well grounded in the existing literature. If people with very little experience in multimedia benchmarking contribute, they can help to make sure that the task definition is understandable even to new participants. We try to write task descriptions such that a master student planning to write a thesis in a multimedia related topic would easily understand what was required for the task.

In order to round this off to a nice "10" points let me mention another issue that is constantly on my mind, namely, the way that the multimedia community treats the word "subjective".

"Subjective" is something that one feels oneself as a subject (and cannot be directly felt by another person---pain is the classic example). In MediaEval tasks, such as Violent Scene Detection, we would like to respect the fact that people are entitled to their own opinions about what constitutes a concept. Note that people can communicate very well concerning acts of violence, without all having an exactly identical idea of what constitutes "violence". Because the concept "works" in the face of the existence of person perspectives, we can consider the task "subjective".

So often researchers reason in the sequence, "This task is subjective, therefore it is difficult for automatic multimedia analysis algorithms to address". That reasoning simply does not follow. Consider this example: Classifying a noise source as painful is the ultimate "subjective task". You as a subject are the only one who knows that you are in pain. However: Create a device that signals "pain" when noise levels reach 100 decibels, and you have a solution to the task. Easy as pie. "Subjective" tasks are not inherently difficult.

Instead: whether a task is difficult to address with automatic methods depends on the stability of content-based features across different target labels.

The whole point of machine learning is to generalize across not only obvious cases, but also across cases in which no stability of features is apparent to a human observer. If we stuck to tasks that "looked" easy to a researcher browsing through the data, (exaggerating a bit for effect) we might as well handcraft rule-based recognizers. So my point 10 is to try to figure out a way to keep researchers from being scared off from tasks just because they are "subjective", without giving the matter a second thought. Multimedia research needs to tackle "subjective" tasks in order to make sure that it remains relevant to the real-world needs of users---once you understand subjectivity, you start to realize that it is actually all over the place.

In 2014, we noticed that the discussion of such themes was becoming more systematic, and that members of the MediaEval community were interested in having a venue in which they could publish their thoughts. For this reason, in 2015, we added a MediaEval Letters section to the MediaEval Working Notes Proceedings dedicated to short considerations of themes related to the MediaEval workshop. The Letter format allows researchers to publish their thoughts already as they are developing, even before they are mature enough to appear in a mainstream venue.

The concept of MediaEval Letters was described in the following paper, in the 2015 MediaEval Working Notes Proceedings:

Larson, M., Jones, G.J.F., Ionescu, B., Soleymani, M., Gravier, G. Recording and Analyzing Benchmarking Results: The Aims of the MediaEval Working Notes Papers. Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015, CEUR-WS.org, online http://ceur-ws.org/Vol-1436/Paper90.pdf

Look for MediaEval Letters to be continued in 2016.