N-grams

Saturday, March 11, 2023

Generative Pottery Transmogrifier: An Allegory

The new machine

The boy shifted his attention to the large screen. He had seen the word “pottery” in the opening title and wanted to follow the news report. At school, he was taking a class in the pottery studio. He liked to make things from clay.

On the screen, people were gathered around a large, shiny box on wheels. The box radiated innovation. It was cube-like, but streamlined—like it was meant to move, like it was capable of guiding itself to its own destination.

From one side extended a flexible silver tube with a nozzle. The nozzle now hung down gracefully, but it was clear that it was designed to produce wonders.

One person punched a sequence of buttons on a remote control panel. The Machine rolled forward slightly and its nozzle lifted.

After a pause it gave a delicate quiver and the nozzle launched an object made of clay. The form traversed a neat arc and landed on a nearby table. What was it?

The camera zoomed in and the boy could see that the form was actually a coffee mug, a beautiful mug. Such a mug would have taken him several hours to make in the studio, if he even could achieve the curves, the symmetry. It was produced by the Machine in a flash, and it seemed perfect.

The nozzle of the Machine was in motion again, raising and firing another object. Another flawless landing on the table. This one was a coffee cup with two handles. It was fascinating to put two handles on a coffee cup. What could the second handle be used for?

The people in the news report were still intently examining the first mug, passing it from hand to hand and pointing out aspects of its lines and surfaces to each other.

The boy noticed that they seemed unaware that the Machine continued to shoot out objects. A group of clay forms appeared on the table and continued to grow. Mug after creative mug: two handles, no handles, three handles, five handles. Then, it produced a cross between a mug and an elegant flower vase, and, then, three ashtrays joined like a clover. It would be too much for the people in the news report to ever have time to look at each piece.

They didn’t seem interested in the other objects, but remained focused on the first mug and were now demonstrating something they had discovered. The mug had no bottom. One raised it to the camera and peered through it. The boy could see an eye staring from the screen at him where the bottom of the mug should have been. The others nodded with approval.

Why were they so happy with a mug without a bottom?

The Machine enters the studio

The boy sat in his work station in the pottery studio. The other students were also at their stations, and they were chatting among themselves.

The boy was impatient for the teacher to arrive and the class to start. He had an idea for what he wanted to make from clay today. Lately, his thoughts had been full of dogs. He wanted to make a water dish for a dog. It would be for a dog to drink from, but also shaped like a dog—with a head, feet and tail coming from the sides. He had never seen anyone make a dog dish in the studio, and certainly not one that actually looked like a dog.

The door opened and the pottery teacher entered the room. She held the door wide and peered expectantly into the hall. The students fell silent, their gaze following the teacher’s.

There was a low musical whir and through the door a silver tube with a nozzle became visible and it soon became evident that the tube was part of a large, shiny box. The Machine rolled into the room. It was the Machine that the boy had seen on the news.

The school’s studio housed a few electric potter's wheels, but had never seen such advanced technology as the Machine. Its shiny metal surface, its futuristic shape, attracted admiration and fascination.

“Students,” the pottery teacher announced, “today we begin a new form of learning. Our school has put out a huge sum for this Machine and has also found the budget to pay the monthly subscription fee. The Machine will help you learn pottery.”

She glanced around the room. The students were listening closely.

“You will no longer need to learn to hand-build or throw pots. Coil technique, slab technique, mold technique are no longer necessary for you to make a high quality pot. Instead, the Machine will produce an initial piece of pottery for you that is already at the level of an experienced potter, and you will learn by perfecting it. You will learn to produce masterly pieces by starting with high-level work.”

The teacher punched some buttons on a remote control panel. She was focused on the exact sequence of buttons, and seemed to tremble with excitement.

The Machine started to roll around the room. At each workstation, the Machine paused, trembled, and the nozzle lifted. Out of the nozzle flew a clay form, which arced through the air and landed precisely in the middle of the workstation. Each student’s eyes widened with wonder as they each received their own form to work on.

The Machine halted at the workstation of the boy, who watched it intently. The others seemed to trust that the clay form would land on their workstation and not hit them in the face. He wasn’t sure how they were so certain, but he braced himself not to recoil as the Machine quivered and the nozzle lifted and fired in his direction.

With a loud thud, his form arrived on the workstation precisely in front of him. It was a teapot. He observed it with enchantment. It was round and squat, but avoided looking heavy. Its spout was lifted proudly. He peered into the hole at the top to reassure himself it had a bottom. It did. The teapot seemed perfect.

As he studied the teapot, he recalled that he had wanted to make a dog dish. He would have to flatten the beautiful teapot that the Machine had created for him to make a totally different form. He tried to picture how he would completely remold the clay. Tentatively he pinched at the side of the pot—but he couldn’t remember his idea for the dog dish clearly anymore. When he looked at the clay form, all he could see was a teapot.

The teacher said that they would produce masterly pieces. Maybe the dog dish had not been a good idea. Maybe what he had really wanted all along was a teapot. Perhaps a teapot with ears.

The boy set to work and spent the class period pinching out a jaunty pair of dog ears, one on each side of the teapot. He turned his work on his workstation and regarded it from every angle. He certainly could not have produced such a piece from scratch. Would the teacher consider it masterly?

The pottery teacher walked by his desk, and the boy looked up.

“Lovely piece. The ears give the teapot a sense of added lightness. You did a good job.”

The teacher started to walk towards the next student, but then looked back.

“Don’t forget to deposit your pot in the bin of the Machine as you leave the studio. Just open the lid of the box and throw it in. The Machine gives pottery to us, but it is only because we give pottery back to the Machine.”

The Machine leaves the studio

The boy walked into the studio and closed the door behind him. The other students were all sitting quietly at their workstations. It was too quiet. He was missing the musical whirring of the Machine. The large, shiny box was no longer standing in its corner.

He sat down at his workstation just as the door opened again. The students expected to see their pottery teacher, but in walked the geology teacher. The teacher greeted the class and then pointed to three of the students.

“There are some pottery supplies in the hall, please bring them in.”

The students looked at each other in surprise, but quickly scrambled out of their seats and through the door. After a moment, they returned lugging a large, rough wooden box.

“The Machine, as you see, is not with us anymore. Your pottery teacher has asked me to take over the lesson for the day.”

The boy looked around and saw disappointment in the faces of the other students. The box looked like it had been standing in the garage of the geology teacher for twenty years. The outside was streaked with reddish brown. What could they learn from this box?

The teacher motioned the students to move the box to the corner where the Machine had stood when it was not producing clay objects.

“Today,” he announced, “I am going to teach you how to make pottery with wild clay.”

A murmur traveled through the studio. What was “wild clay”?

“I gathered the clay from the bank of a stream that runs behind my house. In this box you will find a bucket of clay soaking in water for each of you. Before you can use it, you need to mix it to make a smooth slip and then sieve it. You’ll find your sieve in your bucket. Then we will let it dry to a consistency with which you can work.”

The students crowded around the box, removed the lid and distributed the buckets. Soon they were all back at their workstations mixing and sieving.

“Here are cloth pillowcases,” the teacher said, moving from workstation to workstation, “take one each for the next phase.”

The boy poured his sieved liquidy clay into his pillowcase, wrapping it around the bucket handle so that it would hang into the bucket and drip. He positioned the bucket in the middle of the workstation. It would be there waiting for him for next week’s class. He gently patted its side, sealing the promise.

The geology teacher passed by his desk. “Looking good. What do you think?”

The boy didn’t reply, but continued to gaze at the bucket. Finally, he looked up and asked, “What happened to the Machine?”

“Oh, the Company that makes the Generative Pottery Transmogrifier is facing some challenges,” the teacher responded. “Their Machine works by processing thousands and thousands of pots. The Machine needs many, many pots so it can mix and match potters’ styles and produce new forms.” The boy knew the basics of how the Machine works from the news, but he hadn’t yet thought deeply about what it meant.

“The Machine spits out clay objects that are delightful and sometimes even dazzling,” the teacher elaborated. “However, it cannot create a pot from formless clay. It needs to start with pots that have been created by potters. All these pots get thrown into its bin, and the machine mixes and matches their functions and their shapes. Plus the Machine consumes whatever you yourself produce. You also feed the Machine by throwing your own work into its bin.

“In the beginning the Company managed to get all of these pots for free,” the teacher went on. The boy listened closely. “But this is changing. Potters from far and wide have joined forces. They are pointing out that they were not being paid for the time and effort they had invested in making the pots used to feed the Machine. They had not intended their work to be used in this way.”

The boy lightly bit his lower lip. He thought he could understand the rage and frustration of the potters.

“They pressured the Company to publicly acknowledge that the Machine would not function without potters who make pots,” said the teacher. “The general public grew disenamored with the Machine. Respected museums and galleries released reports on how the Machine was consuming rare pottery from Africa, from the precolonial Americas, from neolithic Europe and Asia. The Company was forced to act.”

The teacher sighed. “The Company is currently raising the price of the subscription for the Machine as it tries to respond to the public outcry. Our school can’t afford the Machine any more after the latest price increase. Your pottery teacher is with the school principal at this moment trying to get a refund for the original Machine. Don’t worry, she’ll be back for the next class.”

The boy considered what the teacher had said. After a moment, he remarked, “We’re lucky there’s a stream running behind your house.”

“Yes. I like my stream. There’s enough clay there to make anything you like. We certainly won’t miss the Machine.” He paused before adding, “And there are enough potters around the globe to make whatever we want and need. The world wouldn’t miss the Machine either.”

“I want to be a potter when I grow up,” said the boy.

The boy saw encouragement in the smile the teacher gave him before moving on to the next student.

Standing up to move to the studio sink, the boy realized it would take him longer than usual to get his hands clean. There was nothing to be done about his clothes. He would simply go to his next class still splotched with reddish brown. He didn’t mind.

He glanced at the rough wooden box that held the wild clay supplies standing in the corner where it had replaced the Machine. He couldn’t imagine a group of people on a news report gathered excitedly around this wooden box, like they had gathered around the silver, streamlined, musically purring Machine.

He hoped that the wild clay would remain part of pottery class. While he mixed and sieved the wild clay slip, he understood where pottery comes from. He felt a connection to other potters who had done the same in the past, for thousands of years. Preparing the wild clay raised images in his mind of the pieces he could make in the future.

The boy shook his head. Why would a Machine make a mug without a bottom, even if a potter was waiting to fix it?

If he wanted to re-form an already formed teapot, the boy reflected, he could still do that without the Machine. One of the other students could make a teapot and he would add the dog ears. He smiled to himself. Working together in this way, they would be very sure that the finished piece was theirs and theirs to keep.

Friday, October 2, 2020

Why should recommender system researchers care about platform policy?

In this post, I reflect on why recommender system researchers should care about platform policy. These reflections are based on a talk I gave last week at the Workshop on Online Misinformation- and Harm-Aware Recommender Systems (OHARS 2020) at ACM RecSys 2020, which was entitled, "Moderation Meets Recommendation: Perspectives on the Role of Policies in Harm-Aware Recommender Ecosystems."

Every online platform has a policy that specifies what is and what is not allowed on the platform. Platform users are informed of the policy via platform guidelines. All major platforms have guidelines, e.g., Facebook Community Standards, Twitter Rules and Policies, Instagram Community Guidelines. Amazon's guidelines are sprawling and a bit more difficult to locate, but can be found at pages like Amazon Product Guidelines and Amazon restricted products.

Policy is important because it is the language in which the platform and users communicate about what constitutes harm and needs to be kept off the platform. Communicating via policy, which is expressed in everyday language, ensures that everyone can contribute to the discussion of what is and is not appropriate. Communication via technical language or computer code would exclude people from the discussion. The language of policy is what offers the possibility (which should be used more often) for us to reach consensus on what is appropriate. It also acts as a measuring stick to make specific judgements in specific cases, which is necessary in order to enforce that consensus completely and consistently.

Policy is closer to recommender system research that we realize

On the front lines of enforcing platform policy are platform moderators. Moderation is human adjudication of content on the basis of policy. Moderators keep inappropriate content off the platform. (Read more about moderation in Sarah T. Roberts' Behind the Screen and Tarleton Gillespie's Custodians of the Internet.)

Historically, there has been a separation between moderators and the online platforms that they patrol. Moderators are often contractors, rather than regular employees. It is easy to develop the habit of placing both responsibility for policy enforcement and the blame for enforcement failure outside of the platform (which would also make it distant to the recommender algorithms). An example of such distancing occurred this summer, when Facebook failed to remove a post that encouraged people with guns to come to Kenosha in the wake of the shooting of Jacob Blake. The Washington Post reported that Zuckerberg said: "The contractors, the reviewers who the initial complaints were funneled to, didn’t, basically, didn’t pick this up." He refers to "the contractors", implicitly holding moderators at arm's length from Facebook. It is important that we as recommender system researchers resist absorbing this historic separation between "them" and "us".

Recommender system researchers, as computer scientists, live by the wisdom of GIGO (Garbage In Garbage Out). In order to produce harm-free lists of recommended items, we need an underlying item collection that does not contain harmful items. This is achieved via policy, and the help of moderators enforcing policy.

Second, recommender systems are systems. Recommender system research understands them as not only as systems, but as ecosystems, encompassing both human and machine components. When we think of the human component of recommender systems we generally think of users. However, moderators are also a part of the larger ecosystems, and we should include them and their important work in our research.

Connecting recommendation and moderation opens new directions for research

Currently, most of the interest in moderation has been around how to combine human judgement and machine learning in order to quickly, and at large scale, decide what needs to be removed from the platform. At the end of the talk at the workshop, I introduced a case study of a system that can translate the nuanced judgments of moderators into automatic classifiers. I discussed the potential of these classifiers for helping platforms to keep up with the fast change of content and quickly evolving policy. The work has not yet been published, but is current still under preparation (hope to be able to add a reference here at some later point).

However, not all policy enforcement involves removal. Some examples of how platform policy interacts with ranking are mentioned in the recent Wired article YouTube's Plot to Silence Conspiracy Theories. It is worth noting, that even if downranking can be largely automated it is important to keep human eyes in the loop to ensure that the algorithms are having their intended effects. We should strive to understand how this collaboration can be designed to be most effective.

Finally, I will mention that together with Manel Slokom, I have previous proposed the concept of hypotargeting for recommender systems (hyporec), a recommender system algorithm that produces a constrained number of recommended lists (or groups, sets, sequences). Such an algorithm would make it easier to enforce platform policy not only for individual items, but also for associations between items (which are created when the recommender produce a group, list or stream of recommendations).

In order to understand the argument for hypotargeting consider the following observation: There is a difference between a situation in which I view one conspiracy book online as an individual book, and a situation in which I view one book online and am immediately offered a discount to purchase of set of three books promoting the same conspiracy.

The difference lies in the impact that the recommender has on the user. Associations of items can be easily interpreted as "a trail of crumbs" leading the user to assume more broader supporting evidence for an idea than is actually justified. If the recommender produced a constrained number of sets, it would be easier to review them manually, and to make the subtle judgement of whether it is appropriate to be incentivizing purchase of these items.

Ultimately these ideas open new possibilities for policy as well: the e-commerce site should be transparent not only about which items they remove, but also about the items they prevent from occurring together in lists, groups, or streams.

There are no silver-bullet solutions to the problem of harm caused by recommender systems. However, it does seem like there is a great deal of potential in researching algorithms that can be steered by humans in order to enforce policy.

Monday, August 10, 2020

Three Laws of Robotic Language

This post is a draft on which I am currently eliciting feedback. Changes may be made in the future.

Artificial Intelligence that can produce language is improving in leaps and bounds (cf. the recent GPT-3 as reported on, e.g., in The Economist). However, it is still early enough to think seriously about how we should guide the development of language AI in order to maintain influence over the large-scale, long-term effects of automatic language generation. Asimov’s Three Laws of Robotics have inspired AI research towards conscious design choices during the early stages of new AI technologies. Parallel to these laws, this post proposes Three Laws of Robotic Language. We understand robotic language as language (written, spoken, or signed) that was generated partially or entirely by an automatic system. Because such a system can be seen as a machine engaging in a conventionally human activity, we refer to it as a language robot. These laws are intended to support researchers developing AI for natural language generation. The laws are formulated to help lay a solid foundation for what is to come by inspiring careful reflection about what we need to get right from the beginning, and the mistakes we need to avoid.

The Three Laws of Robotic Language

First Law: A language robot must declare its identity.

Second Law: A language robot’s identity must be easy to verify.
Third Law: A language robot’s identity must be difficult to counterfeit.

Practical benefits

Adopting these Three Laws would support desirable practical properties of robotic language as its use becomes more widespread:

People (readers, consumers) will be able to identify content as robotic language (as opposed to language produced by other people) without relying on sophisticated technology.
People will be able to confirm the source of the content without relying on sophisticated technology.
Entities (organizations, companies) that generate high-quality, reliable robotic language can be sure that consumers can recognize and trust their content.
Entities that generate robotic language can more easily ensure that they don’t unwittingly train their language generation systems on previously generated robotic language.

Like the Three Laws of Robotics, these laws depend on adoption by the people and organizations that develop and control technology. For many, the practical properties delivered by the laws will be convincing enough. For others, it will be important to understand the link between these laws and the nature of human language, which is explained next.

Moving robotic language towards human language

Currently, the success of robotic language is judged by its ability to fool a reader into mistaking it for language generated by a human. This criterion seems sensible for judging individual sentences, paragraphs or documents. Adopting this criterion implies that we, effectively, regard human language as the generation and exchange of sequences of words and that we consider the aim of language robots to be approximating these sequences. However, if we look at the larger picture of how people actually use language, we see that language goes beyond word sequences. What interests us here is how language conveys the connection between the creator (i.e., who is speaking or writing) and the language content that they create (i.e., what is spoken or written). The Three Laws of Robotic Languages state that when language robots generate language content, information about the creator must be inextricable from that content. Adding the criterion of creator-content inextricability should not be considered a nice-to-have functionality that can optionally be added to language robots at some future point. Rather, this feature must be planned from the beginning, before language robots establish themselves as a major source of the language content that we consume.

For some, the idea that the connection between creator and content is an important part of language is surprising. It is not, however, radically new, but rather an observation, perhaps so obvious that it is easily overlooked. Think about speaking to a baby or an animal: they react to the you-ness of your voice, although they might not understand your words. Our voices identify us as individuals. On top of that, when we hear a voice we may not recognize the specific person speaking, but we still hear something about them. Speech is produced by the human body, and is given its form by our mouths and nasal cavities. Our voices identify something about us, e.g., how big we might be. The Three Laws of Robotic Language are, at their root, a proposal to give language robots a “sound of voice” that would carry information about the origin of the language content that they produce. Language robots must identify themselves, or at least reveal enough about themselves so that it is clear (without the need for sophisticated technology) that they are robots.

In order to better grasp why the inextricability of creator from content is a fundamental characteristic of human language, it helps to look back in time. Throughout most of the history of language, speech could not exist independently of a speaker (and sign language could not exist independently of a signer). It was impossible to decouple the words and the source of the words. It is only with the rise of written language that we have the option of breaking the content-creator association, allowing language content to float free of the person who produced it. Most recently, speech synthesis or sign synthesis can also disassociate the speaker from what is spoken. This possibility of content-without-creator now feels so natural to us that it is hard to imagine that it was not originally a property of human language. However, the age of speech-only language was tens of thousands of years (possibly more) longer than the current era of written language. It may seem strange from the perspective of today, but the original state of human language is one of inextricability: speech could not exist without a speaker.

In short, we know that language works well with inextricability: that’s the way in which human language was originally developed and used. For this reason, the Three Laws of Robotic Language should not be considered an unnatural imposition, but rather a gentle requirement that language robots behave in a way that is closer to the original state of language.

An important design choice

It is important to note that when humans use language they creatively manipulate the connection between who is creating and what is created. We imitate others’ voices. We quote other people. We love the places where we can yell and hear our voices echoed back. Once written language introduced the possibility of extricating the creator from language content, we started to take advantage of the option of hiding our identities: we use pen names and we write anonymous messages. The Three Laws of Robotic Languages constrain the ability of language robots to engage in these kinds of activities. For example, the laws prevent them from generating anonymous content or producing imitations that are impossible to detect.

At first consideration, it seems that the Three Laws of Robotic Language represent an unnecessary hindrance or constraint. However, human language is characterized by strong constraints. On further thought, it becomes clear that language robots need to be subject to some form of constraint if they are to interact productively with naturally constrained humans over the long run in large-scale information spaces.

The constraints on human language are human mortality and limited physical strength. When we focus on a small, local scale, thinking about individual texts and short periods of time, we risk overlooking these constraints. However, they are there and their effect is important.

First, think about human mortality: A given person can produce and consume only so many words in their lifetime. Our deaths represent a hard limit, and force us to choose, over the course of our existence, what we say and what we don’t say, what we listen to and read, and what we don’t. A language robot needs shockingly little time to generate the same amount of language content that a human would produce (or could consume) in a lifetime.

Second, think about human physical strength. Language is the means by which humans as a species have pooled their physical strength. Language allows us to engage in coordinate action towards a common goal. We use language to convince other people to adopt our opinions or follow our plans. The power of our language to convince is limited by our physical ability to act consistently with our opinions or to contribute to carrying out our plans. People speaking empty words put themselves at risk of ostracization or physical harm. A language robot can generate language that is finely tuned to be convincing, and is unconstrained by the need to follow up words with action. Language robots risk nothing.

Considering again Asimov’s Three Laws of Robotics, human mortality and limited physical strength is what makes the laws necessary in the face of robots with superior strength and stamina. The laws level the playing field, so to say. The Three Laws of Robotic Language serve a similar function. They do not protect humans as directly as Asimov’s laws. However, they make the actions of language robots traceable, which provides a lever that allows humans to maintain influence on the large-scale, long-term impact of robotic language on our information sphere.

At this point, we don't know enough to predict this influence exactly. What is clear, however, is that we need some kind of constraint. It is also clear, as argued above, that the Three Laws of Robotic Language are consistent with a functioning form of human language, which is actually its original form. Further, we know that the laws have some already-obvious advantages. Recall from above the desirable practical properties: inextricability delivers convenience i.e., following the Three Laws of Robotic Language will prevent AI researchers from inadvertently training language robots on automatically generated text, causing feedback loops (resulting, possibly, in systems drifting away from human interpretable syntax and semantics). Further, as we struggle to gain control of malicious bots and disinformation online, it would be helpful if the language robots with honorable intent would declare themselves. Inextricablity would make it easier to build a case against ill-intentioned actors.

The Three Laws of Robotic Language are not a silver-bullet solution, but rather a well-informed design choice. Currently, AI researchers have defaulted to the extricability of creator from content. The Three Laws will already be a success if they inspire AI researchers to pause and consider whether inextricablity, rather than extricability should be considered the default choice for systems that automatically generate natural language (text, speech and sign).

An example

Let’s consider a language robot that generates text sentences. We will call this language robot DP-bot, because it declares its identity by upholding the double prime (DP) rule with every sentence that it produces. The language robot can generate the sentence:

We adore language.

The double prime rule states that a prime number of letters must occur a prime number of times in a sentence. The rule is upheld by this sentence since ‘e’,’a’,’g’ (3 and only 3 letters; 3 being a prime number) each occur in the sentence a prime number of times (3, 3, and 2 times respectively; 2 and 3 being prime numbers).

This sentence expresses the same sentiment:

We love language.

The sentence, however, does not respect the double prime rule. ‘e’,’a’,’g’ all occur a prime number of times (3, 2, and 2 times respectively), but ‘l’ also occurs 2 times. This means that 4 letters occur a prime number of times (4 not being a prime number).

At first consideration, it may seem that DP-bot is a bit too constrained in the semantics that it can express, since the match in meaning between the two sentences is approximate. However, if sentences get longer, or if the rule is defined to apply at a higher level (e.g., paragraph and not the sentence level), it will be easier to encode semantics into a text that respects the double prime rule without burdensome constraints.

DP-bot upholds the First Law of Robotic Language in that all language content generated by DP-bot respects the double prime rule and is thus identifiable as having been generated by DP-bot. DP-bot upholds the Second Law because it is easy to validate that a sentence respects the double prime rule. The only knowledge that is needed for validation is the natural language sentence that states the double prime rule, i.e., “a prime number of letters must occur a prime number of times in a sentence”. DP-bot does not do very well with the Third Law, since it is easy to create a sentence that respects the double prime rule, thereby counterfeiting DP-bot language. Even manually constructing a sentence that complies to the double prime rule is not difficult. Currently, we are working on formulating rules that are more sophisticated than the double prime rule and that require a large amount of computational power or specialized training data in order to embed them into natural language sentences.

Note that the language robot DP-bot produces text that encodes a mark, but that this mark is not a watermark. Let’s call it an sourcemark, since it marks a language robot as having been the source of the text. A watermark is also a pattern that is embedded into content, like text or an image. Its purpose is to identify ownership. A watermark is designed to be robust to change. For example, if a text is paraphrased or excerpted the mark should still remain. An sourcemark, however, is meant to identify the original text and associate it with a creator (the source). A small change in text might compromise the meaning, e.g., We do not adore language. A creator can no longer claim responsibility for text once it has changed, and should not be identified with the changed text. Unlike a watermark, a sourcemark must disappear when the text has been changed.

Note that the double-prime rule has nothing to do with encryption. Prime numbers are used because they are a relatively small set of numbers that are easy to describe. If the rule can be expressed in a single sentence, “a prime number of letters must occur a prime number of times in a sentence”, then it is easy to confirm the rule without any sophisticated technology, such as a machine learning classifier or a key (with enough patience it can be done without even using a computer). If we used a form of encryption, the ability to verify the identity of a language robot would be restricted to the subset of people who have the appropriate technology (requiring software installation and maintenance, computation, passing of keys).

Following the Three Laws of Robotics means designing language robots that embed sourcemarks in all the content that they generate. Here, we have presented a simple (and not yet completely successful) example of a sourcemark. We expect that any number of sourcemarks could be developed. An interesting overall property is that even if we do not have knowledge of the presence of a sourcemark, carrying out some simple statistics could reveal the difference between marked and unmarked language content. This signal would reflect a “suspected” language robot, and trigger deeper investigation. As further sourcemarks are developed, desirable properties of marks going beyond the Three Laws of Robot Language can be innovated.

Monday, November 11, 2019

Reflections on Discrimination by Data-based Systems

A student wrote to me to ask me to interview me about discrimination in text mining and classification systems. He is working on his bachelor thesis, and plans to concentrate on gender discrimination. I wrote him back with an informal entry into the topic, and posted it here, since it may be of more general interest.

Dear Student,

Discrimination in IR, classification, or text mining systems is caused by the mismatch between what is assumed to be represented by data and what is helpful, healthy and fair for people and society.

Why do we have this mismatch and why is it so hard to fix?

Data is never a perfect snapshot of a person or a person's life. There is no single "correct" interpretation inherent in data. Worse, data creates its own reality. Let's break it down.

Data keeps us stuck in the past. Data-based systems make the assumption that predictions made for use in the future, can be meaningfully based on what has happened in the past. With physical science, we don't mind being stuck in the past. A ballistic trajectory or a chemical reaction can indeed be predicted by historical data. With data science, when we build systems based on data collected from people, shaking off the past is a problem. Past discrimination perpetuates itself, since it gets built into predictions for the future. Skew in how datapoints are collected also gets built into predictions. Those predictions in turn get encoded into the data and the cycle continues.

In short, the expression "it's not rocket science" takes on a whole new interpretation. Data science really is not rocket science, and we should stop expecting it to resemble physical science in its predictive power.

Inequity is exacerbated by information echo chambers. In information environments, we have what is known as rich gets richer effects, i.e., videos with many views gain more views. It means that small initial tendencies are reinforced. Again, the data creates its own reality. There is a difference between data collected in online environments and data collected via a formal poll.

Other important issues:

"Proxy" discrimination: for example, when families move they tend to follow the employment opportunities of the father and not the mother. The trend can be related to the father often earning more because he tends to be just a bit older (more work experience) and also tends to have spent less time on pregnancy and kid care. This means that the mother's CV will be full of non-progressive job changes (i.e., gaps or changes that didn't represent career advancement), and gets down ranked by a job candidate ranking function. The job ranking function generalizes across the board over non-progressive CVs, and does not differentiate between the reasons that the person was not getting promoted. In this case, this non-progressiveness is a proxy for gender, and down-ranking candidates with non-progressive CVs leads to reinforcing gender inequity. Proxy discrimination means that it is not possible to address discrimination by looking at explicit information; implicit information also matters.

Binary gender: When you design a database (or database schema) you need to declare the variable type in advance, and you also want to make database interoperable with other databases. Gender is represented as a binary variable. The notion that gender is binary gets propagated through systems regardless of the ways that people actually map well to two gender classes. I notice a tendency among researchers to assume that gender is some how a super-important variable contributing to their predictions just because it seems easy to collect and encode. We give importance to the data we have, and forget about other, perhaps more relevant data, that are not in our database.

Everyone's impacted: We tend to focus on women when we talk about gender inequity. This is because of the examples of gender inequity that threaten life and limb tend to involve women, such as gender gaps in medical research. Clearly action needs to be taken. However, it is important to remember that everyone is impacted by gender inequity. When a lopsided team designs a product, we should not be surprised when the product itself is also lopsided. As men get more involved in caretaking roles in society, they struggle against pressure to become "Supermom", i.e., fulfill all the stereotypical male roles, and at the same time excel at the female roles. We should be careful while we are fixing one problem, not to fully ignore, or even create, another.

I have put a copy of the book Weapons of Math Destruction in my mailbox for you. You might have read it already, but if not, it is essential reading for your thesis.

From the recommender system community in which I work, check out:

Michael D. Ekstrand, Mucun Tian, Mohammed R. Imran Kazi, Hoda Mehrpouyan, and Daniel Kluver. 2018. Exploring author gender in book rating and recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys '18). ACM, New York, NY, USA, 242-250.

and also our own recent work, that has made be question the importance of gender for recommendation.

Christopher Strucks, Manel Slokom, and Martha Larson, BlurM(or)e: Revisiting Gender Obfuscation in the User-Item Matrix. In Proceedings of the Workshop on Recommendation in Multistakeholder Environments (RMSE) Workshop at RecSys 2019.
http://ceur-ws.org/Vol-2440/short2.pdf

Hope that these comments help with your thesis.

Best regards,
Martha

P. S. As I was about to hit the send button Sarah T. Roberts posted a thread on Twitter. I suggest that you read that, too.
https://twitter.com/ubiquity75/status/1193596692752297984

Sunday, November 10, 2019

The unescapable (im)perfection of data

In data science, we often work with data collected from people. In the field of recommender system research, this data consist of ratings, likes, clicks, transactions and potentially all sorts of other quantities that we can measure: dwell time on a webpage, or how long someone watches a video. Sometimes we get so caught up in creating our systems, that we forget the underlying truth:

Data is unescapably imperfect.

Let's start to unpack this with a simple example. Think about a step counter. It's tempting to argue that this data is perfect. The step counter counts steps and that seems quite straightforward. However, if you try to use this information to draw conclusions, you run into problems: How accurate is the device? Do the steps reflect a systematic failure to exercise, or did the person just forget to wear the device? Were they just feeling a little bit sick? Are all steps the same? What if the person was walking uphill? Why was the person wearing the step counter? How were they reacting to wearing it? Did they do more steps because they were wearing the counter? How were they reacting to the goal for which the data was to be used? Did they decide to artificially increase the step count (by paying someone else to do steps for them)?

In this simple example, we already see the gaps, and we see the circle: collecting data influences data collection. The collection of data actually creates patterns that would not be there if the data were not being collected. In short, we need more information to interpret the data, and ultimately the data folds back upon itself to create patterns with no basis in reality. It is important to understand that this is not some exotic rare state of data safely ignored in day-to-day practice (like the fourth state of water). Let me continue until you are convinced that you cannot escape the imperfection of data.

Imagine that you have worked very hard and have contolled the gaps in your data, and done everything to prevent feedback loops. You use this new-and-improved data to create a data-based system, and this system makes marvelous predictions. But here's the problem: the minute that people start acting on those predictions the original data becomes out of date. Your original data is no longer consistent with a world in which your data-based system also exists. You are stuck with a sort of Heisenberg's Uncertainty Principle: either you get a short stretch of data that is not useful because it's not enough to be statistically representative of reality, or a longer stretch of data, which is not useful because it encodes the impact of the fact that you are collecting data, and making predictions on the basis of what you have collected.

So basically, data eats its own tail like the Ouroboros (image above). It becomes itself. As science fictiony as that might sound, this issue has practical implications that researchers and developers deal with (or ignore) constantly. For example, in the area of recommender system research in which I am active, we constantly need to deal with the fact that people are interacting with items on a platform, but the items are being presented to them by a recommender system. There is no reality not influenced by the system.

The other way to see it, is that data is unescapably perfect. Whatever the gaps, whatever the nature of the feedback loops, data faithfully captures them. But if we take this perspective, we no longer have any way to relate data to an underlying reality. Perfection without a point.

And so we are left with unescapable.

Saturday, April 14, 2018

Pixel Privacy: Protecting multimedia from large-scale automatic inference

This post introduces the Pixel Privacy project, and provides related links. This week's Facebook congressional hearings have made us more aware how easily our data can be illicitly acquired and used in ways beyond our control or our knowledge. The discussions around Facebook have been focused on textual and behavior information. However, if we think forward, we should realize that now is the time to also start worrying about the information contained in images and videos. The Pixel Privacy project aims to stay ahead of the curve by highlighting the issues and possible solutions that will make multimedia safer online, before a multimedia privacy issues start to arise.

Pixel Privacy project is motivated by the fact that today's computer vision algorithms have super-human ability to "see" the contents of images and videos using large-scale pixel processing techniques. Many of us our aware that our smartphones are able to organize the images that we take by subject material. However, what most of us do not realize is that the same algorithms can infer sensitive information from our images and videos (such as location) that we ourselves do not see or do not notice. Even more concerning that automatic inference of sensitive information, is large-scale inference. Large scale processing of images and video could make it possible to identify users in particular victim categories (cf. cybercasing [1]).

The aim of the Pixel Privacy project is to jump-start research into technology that alerts users to the information that they might be sharing unwittingly. Such technology would also put tools in the hands of users to modify photos in a way that protects them without ruining them. A unique aspect of Pixel Privacy is that it aims to make privacy natural and even fun for users (building on work in [2]).

The Pixel Privacy project started with a 2 minute video:

The video was accompanied by a 2 page proposal. In the next round, I gave a 30 second pitch followed by rapid fire QA. The result was winning one of the 2017 NWO TTW Open Mind Awards (Dutch).

Related links:

The project was written up as "Change Perspective" feature on the website of Radboud University, my home institution: Big multimedia data: Balancing detection with protection (unfortunately, the article was deleted after a year or so).
The project also has been written up by Bard van de Weijer for Volkskrant in a piece with the title "Digital Privacy needs to become second nature". (In Dutch: "Digitale privacy moet onze tweede natuur worden")

References:

[1] Gerald Friedland and Robin Sommer. 2010. Cybercasing the Joint: On the Privacy Implications of Geo-tagging. In Proceedings of the 5th USENIX Conference on Hot Topics in Security (HotSec’10). 1–8.

[2] Jaeyoung Choi, Martha Larson, Xinchao Li, Kevin Li, Gerald Friedland, and Alan Hanjalic. 2017. The Geo-Privacy Bonus of Popular Photo Enhancements. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval (ICMR '17). ACM, New York, NY, USA, 84-92.

[3] Ádám Erdélyi, Thomas Winkler and Bernhard Rinner. 2013. Serious Fun: Cartooning for Privacy Protection, In Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop, Barcelona, Spain, October 18-19, 2013.

Monday, January 1, 2018

2018: The year we embrace the information check habit

The new year dawns in the Netherlands. The breakfast conversation was about the Newscheckers site in Leiden and about the ongoing "News or Nonsense" exhibition at the Netherlands Institute for Sound and Vision.

Signs are pointing to 2018 being the year that we embrace the information check habit: without thinking about it do a double check of the trustworthiness of the factuality and the framing of any piece of information that we consume in our daily lives. If the information will influence us, if we will act upon it, we will finally have learned to automatically stop, look, and listen: the same sort of skills that we internalized when we learned to cross the street as youngsters.

For me, 2018 is the year that I make peace with how costly that information quality is. On factuality: I spend hours reviewing papers and checking sources. On framing: I devote a lot of time to looking for resources in which key concepts and processes are explained in ways that my students would easily understand them. And too often I am prevented from working on factuality and framing by worrying about the consequences of missing something or making the wrong choices.

It is costly in terms of time and effort just to choose words. I need words to convey to the students in my information science course that the world is dependent on their skills and their professional standards: anyone whose work involves responsibility for communication must devote time and effort to information quality and must take constant care to inform, rather than manipulate.

What is the name for our era? I don't say "post-truth". A era can call itself "post-truth", but that's asking us to accept that it is fundamentally different than whatever came before---the "pre-post-truth" era. The moment we stop to reflect on how the evidence proves that we have shifted from truth to post-truth, we are engaging in truth seeking. Post-truth goes poof.

I don't say "fake news" era. I grew up with the National Enquirer readily available at the supermarket check out counter, with its bright and interesting pictures of UFOs and celebrity divorces. That content wasn't there to contribute to building my mental model of reality, any more than Pacman. "Fake news" has always been there.

My search for the right words continues. I am using the book Weaponized Lies by Daniel Levitin for the first time this year in order to teach critical thinking skills. Levitin uses words like "counterknowledge" and "misinformation". These are important terms, but they imply the existence of a intelligent adversary intentionally misleading us. It is important to defend against these forces. However, the idea that the problem is people putting effort into "weaponization" overlooks the less dramatic, and less easily identify problem, of reasoning from shaky, half remembered information sources or using flawed logic to build arguments.

Now at the end of the first day of 2018, I am staring at Weaponized Lies next to my keyboard, wishing there were shortcuts---that I didn't have to start from the bottom finding the words to talk about the importance of information quality, even before I start talking about information quality itself, and researching how to build safer more equitable information environments.

There are no shortcuts. The only thing that we can hope for is that we can routinize information check. Make it a habit.

I even stopped for a moment to dream about a rising demand for information quality creating new jobs. We need professionals who are able to help us monitor information without sliding into suppressing free speech and imposing censorship. This is the direction in which our knowledge society should grow.

I thought I remembered reading an article online that discussed 2018 as the "Information Year". Now, for the life of me, I cannot find it. It takes so long to track and keep track of sources. My first step in making peace with the cost of information quality: I end this blog post by admitting I have no proof for my thesis that 2018 is the year we embrace the information check habit. The title is instead an expression of hope that we can move in that direction.