N-grams

Sunday, February 7, 2016

MediaEval 2015: Insights from last year's experiences in multimedia benchmarking

This blogpost is a list of bullet points concerning MediaEval 2015. It represents the "meta-themes" of MediaEval that I perceived to be the strongest during the MediaEval 2015 season, which culminated with the MediaEval 2015 Workshop in Wurzen, German (14-15 September 2015). I'm putting them here, so we can look back later and see how they are developing.

How not to re-invent the wheel? Providing task participants with reading lists of related work and with baseline implementations helps ensure that it is as easy as possible for them to develop algorithms that extend the state of the art.
Reproducibility and replication: How can we encourage participants to share information about their approaches so that their results can be reproduced or replicated? How can we emphasize the importance of reproduction and replication and at the same time push for innovation, and forward movement in the state of the art (and avoid re-inventing the wheel as just mentioned)? One answer that arose this year was to reinforce student participation. Students should feel welcome at the workshop, even if they “just” reproduced an existing workflow.
Development of evaluation metrics for new tasks: Innovating a new task may involve a developing a new evaluation metric. All tasks face the challenges of ensuring that they are using an evaluation metric that faithfully reflects usefulness to users within an evaluation scenario.
How to make optimal use of leaderboards in evaluation: Participants should be able to check on their progress over the course of the benchmark, and aspire to ever-greater heights. However, it is important that leaderboards not discourage participants from submitting final runs to the benchmark. It is possible that an innovative new approach does very badly on the leaderboard, but is still valuable.
Understanding the relationship between the conceptual formulation of the task, and the dataset that is chosen for use in the task: Are the two compatible? Are there assumptions that we are making about the dataset that do not hold? How can we keep task participants on track: solving the conceptual formulation from the task, and not leveraging some incidental aspect of the dataset?
Disruption: Tasks are encouraged to innovate from year to year. However, 2015 was the first year that organizers started planning far ahead for “disruption” that would take the task to the next level in the next year.
Using crowdsourcing for evaluation: How to make sure that everyone is aware of and applies best practices? How to ensure that the crowd is reflective of the type of users in the use scenario of the task?
Engineering: Task organization involves an enormous amount of time and dedication to engineering work. We continuously seek ways to structure organizer teams and to recruit new organizers and task auxiliaries to make sure that no one feels that their scientific output suffered in a year where they spend time handling the engineering aspects of MediaEval task organization.
Defining tasks and writing task descriptions: We repeatedly see that the process of defining and new task and of writing task descriptions must involve a large number of people. If people with a lot of multimedia benchmarking experience contribute, they can help to make sure that the task definition is well grounded in the existing literature. If people with very little experience in multimedia benchmarking contribute, they can help to make sure that the task definition is understandable even to new participants. We try to write task descriptions such that a master student planning to write a thesis in a multimedia related topic would easily understand what was required for the task.

In order to round this off to a nice "10" points let me mention another issue that is constantly on my mind, namely, the way that the multimedia community treats the word "subjective".

"Subjective" is something that one feels oneself as a subject (and cannot be directly felt by another person---pain is the classic example). In MediaEval tasks, such as Violent Scene Detection, we would like to respect the fact that people are entitled to their own opinions about what constitutes a concept. Note that people can communicate very well concerning acts of violence, without all having an exactly identical idea of what constitutes "violence". Because the concept "works" in the face of the existence of person perspectives, we can consider the task "subjective".

So often researchers reason in the sequence, "This task is subjective, therefore it is difficult for automatic multimedia analysis algorithms to address". That reasoning simply does not follow. Consider this example: Classifying a noise source as painful is the ultimate "subjective task". You as a subject are the only one who knows that you are in pain. However: Create a device that signals "pain" when noise levels reach 100 decibels, and you have a solution to the task. Easy as pie. "Subjective" tasks are not inherently difficult.

Instead: whether a task is difficult to address with automatic methods depends on the stability of content-based features across different target labels.

The whole point of machine learning is to generalize across not only obvious cases, but also across cases in which no stability of features is apparent to a human observer. If we stuck to tasks that "looked" easy to a researcher browsing through the data, (exaggerating a bit for effect) we might as well handcraft rule-based recognizers. So my point 10 is to try to figure out a way to keep researchers from being scared off from tasks just because they are "subjective", without giving the matter a second thought. Multimedia research needs to tackle "subjective" tasks in order to make sure that it remains relevant to the real-world needs of users---once you understand subjectivity, you start to realize that it is actually all over the place.

In 2014, we noticed that the discussion of such themes was becoming more systematic, and that members of the MediaEval community were interested in having a venue in which they could publish their thoughts. For this reason, in 2015, we added a MediaEval Letters section to the MediaEval Working Notes Proceedings dedicated to short considerations of themes related to the MediaEval workshop. The Letter format allows researchers to publish their thoughts already as they are developing, even before they are mature enough to appear in a mainstream venue.

The concept of MediaEval Letters was described in the following paper, in the 2015 MediaEval Working Notes Proceedings:

Larson, M., Jones, G.J.F., Ionescu, B., Soleymani, M., Gravier, G. Recording and Analyzing Benchmarking Results: The Aims of the MediaEval Working Notes Papers. Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015, CEUR-WS.org, online http://ceur-ws.org/Vol-1436/Paper90.pdf

Look for MediaEval Letters to be continued in 2016.

Monday, December 21, 2015

Features, machine learning, and explanations

Selected Topics in Multimedia Computing is a master-level seminar taught at the TU Delft. This post is my answer to the question from one of this year's students, asked in the context of our discussion of the survey paper that he wrote for the seminar on emotion recognition for music. I wrote an extended answer, since questions related to this one come up often, and my answer might be helpful for other students as well.

Here is the text of the question from the student's email:

I also have a remark about something you said yesterday during our Skype meeting. You said that the premise of machine learning is that we don't exactly know why the features we use give an accurate result (for some definition of accuracy :)), hence that sentence about features not making a lot of sense could be nuanced, but does that mean it is accepted in the machine learning field that we don't explain why our methods work? I think admitting that some of the methods that have been developed simply cannot be explained, or at least won't be for some time, would undermine the point of the survey. Also, in that case, it would seem we should at least have a good explanation for why we are unable to explain why the unexplained methods work so well (cf. Deutsch, The beginning of infinity, particularly ch. 3).

Your survey tackles the question of missing explanations. In light of your topic, yes, I do agree that it would undermine your argument to identify methods that cannot be explained as valuable in moving the field forward.

My comment deals specifically with this sentence: "For some of the features, it seems hard to imagine how they could have a significant influence on the classification, and the results achieved do not outrank those of other approaches by much."

I'll start off by saying that a great deal more important that this specific sentence is a larger point that you are making in the survey, namely that meaningful research requires paying attention to whether what you think that you are doing and what you are actually doing are all aligned. You point out the importance of the "horse" metaphor of:

Sturm, B.L., "A Simple Method to Determine if a Music Information Retrieval System is a “Horse”," in Multimedia, IEEE Transactions on , vol.16, no.6, pp.1636-1644, Oct. 2014.

I couldn't agree more on that.

But here, let's think about the specific sentence above. My point is that it would help the reader to express what you are thinking here more fully. If you put two thoughts into one sentence like this, the reader will jump to the conclusion that one explains the other. You want to avoid assuming (or implying that you assume) that the disappointing results could have been anticipated by choosing features that a priori could be "imagined" to have significant influence.

(Note that there are interpretations of this sentence — i.e., if you read "and" as a logical conjunction — that do not imply this. As computer scientists, we are used to reading code, and these interpretations are, I have the impression, relatively more natural to us than to other audiences. So it is safer to assume that in general your readers will not spend a lot of time picking the correct interpretation of "and", and need more help from your side as the author :-))

As a recap: I said that Machine Learning wouldn't be so useful if humans could look at a problem, and tell which features should be used in order to yield the best performance. I don't want to go as far as claiming it is the premise, in fact, I rather hope I didn't actually use the word "premise" at all.

Machine learning starts with a feature engineering step where you apply an algorithm to select features that will be used, e.g., by your classifier. After this step, you can "see" which features were selected. So it's not the case that you have no idea why machine learning works.

My point is that you need to be careful about limiting your input to feature selection a priori. If you assume you yourself can predict which features will work, you will miss something. When you use deep learning, you don't necessarily do feature selection, but you do having the possibility of inspecting the hidden layers of the neural network, and these can shed some light on why it works.

This is not to say that human intelligence should not be leveraged for feature engineering. Practically, you do need to make design decisions to limit the possible number of choices that you are considering. Well-motivated choices will land you with a system that is probably also "better", also along the line of Deutsch's thinking (I say "probably" because I have not read the book the you cite in detail.)

In any case: careful choices of features are necessary to prevent you from developing a classifier that works well on the data set that you are working on because there is an unknown "leak" between a feature and the ground truth, i.e., for some reason one (or more) of the features is correlated with the ground truth. If you have such a leak, you methods will not generalize, i.e., their performance will not transfer to the case of unseen data. The established method for preventing this problem is ensuring that you carry out your feature engineering step on separate data (i.e., separate from both your training and your test sets). A more radical approach that can help when you are operationalizing a system is to discard features that work "suspiciously" well. A good dose of common sense is very helpful, but note that you should not try to replace good methodology and feature engineering with human intelligence (which I mention for completeness, and not because I think you had any intention in this direction).

It is worth pointing out there are plenty of problems out there that can indeed be successfully addressed by a classifier that is based on rules that were defined by humans, "by hand". If you are trying to solve such a problem, your shouldn't opt for an approach that would require more data or computational resources, merely for the sake of using a completely data-driven algorithm. The issue is that it is not necessarily easy to know whether or not you are trying to solve such a problem.

In sum, I am trying to highlight a couple of points that we tend to forget sometimes when we are using machine learning techniques: You should resist the temptation to look at a problem and declare it unsolvable because you can't "see" any features that seem like they would work. A second related temptation that you should resist is using sub-optimal features because you make your own assumptions about what the best features must be a priori.

A few further words on my own perspective:

There are algorithms that are explicitly designed to make predictions and create human-interpretable explanations simultaneously. This is a very important goal for intelligent systems that are being used by users who don't have the technical training to understand what is going on "under the hood."

Personally, I hold the rather radical position that we should aspire to creating algorithms that are effective, but yet so simple that they can be understood by anyone who uses their output. The classic online shopping recommender "People who bought this item also bought ...." is an example that hopefully convinces you such a goal is not completely impossible. A major hindrance is that we may need to sacrifice some of the fun and satisfaction we derive from cool math.

Stepping back yet further:

Underlying the entire problem is the danger is that the data you use to train you learner has properties that you have overlooked or did not anticipate, and that your resulting algorithm gives you, well, let me just come out and call it "B.S." without your realizing it. High profile cases get caught (http://edition.cnn.com/2015/07/02/tech/google-image-recognition-gorillas-tag/). However, these cases should also prompt us to ask the question: Which machine learning flaws go completely unnoticed?

Before you start even thinking about features, you need to explain the problem that you are trying to address, and also explain how and why you chose the data that you will use to address it.

Luckily, it's these sorts of challenges that make our field so fascinating.

Friday, October 30, 2015

Compressing a complicated situation: Making videos that support search

Mohammad Soleymani asked me to make a video for ASM 2015, the Workshop on Affect and Sentiment in Multimedia, held at ACM Multimedia 2015 on 30 October in Brisbane, Australia. He wanted to present during the workshop the view of different disciplines outside of computer science on sentiment. Since my PhD is in linguistics (semantics/syntax interface), I was the natural choice for "The Linguist". The results of my efforts was a video entitled "A Linguist Talks about Sentiment".

In this post, I briefly discuss my thoughts upon making this video. I wanted to give viewers with no linguistics background the confidence they needed in order to attempt to understand semantics, as it is studied by linguists, and leverage these insights in their work. Ultimately, such a "support for search" video has the goal of addressing people starting with the background that they already have, and giving them just enough knowledge in order to support them in searching for more information themselves.

Mohammad gave me four minutes of time for the video, and I pushed it beyond the limit: the final video runs over six minutes. I realized that what I needed to do is not to convey all possible relevant information, but rather show where a complicated situation is hiding behind something that might seem at first glance simple. The effect of the video is to convince the viewer that its worth searching for more information on linguistics, and giving just enough terminology to support that search.

My original motivation to make the video, was a strong position that I hold: At one level, I agree that anyone can study anything that they want. However, without a precisely formulated, well-grounded definition of what we are studying we are in danger of leaving vast areas of the problem unexplored, and prematurely declare the challenge a solved problem.

After making this video, I realized that one can consider "support for search" videos a new genre of videos. These videos allow people get their foot in the door, and provide the basis for search. A good support-for-search video needs to address specific viewer segments "where they are", i.e., given their current background knowledge. It must simplify the material without putting people off on completely the wrong track. Finally, it must admit this simplification, so that viewers realize that there is more to the story.

When I re-watched my video after the workshop, I found a couple places that make me cringe. Would a fellow linguist accept these simplifications, or did they distort too much? I make no mention of the large range of different psychological verbs, or the difference between syntactic and semantic roles. I put a lot of emphasis on the places that I myself have noticed that people fail to understand. On the whole, the goal is that the video allows the viewer to go from a state of being overwhelmed by a complex topic to having those handholds necessary in order to formulate information needs and support search.

Are "support for search" videos a new genre on the rise? If this "genre" is indeed a trend, it is a welcome one.

Recently browsing, I hit on a video entitled "Iraq Explained -- ISIS, Syria and War":

This video assumes of the reader a particular background knowledge (little) and focuses on introducing the major players and a broad sketch of the dynamics. There are points in the video where I find myself saying "Well, OK" (choice of music gives the impression of things happening with a sense of purpose that I do not remember at the time). This video is clearly a "support for search" video since it ends with the following information:

"We did our best to compress a complicated situation in a very short video, so please don’t be mad that we had to simplify things. There is sooo much stuff that we couldn’t talk about in the detail it deserved…But if you didn’t know much about the situation this video hopefully helps making researching easier."

On the whole, the video succeeds in the goal of giving viewers the information that they need to start sorting out a complicated situation. It gives them the picture on what they don't yet understand, in order that they can the start looking for more information.

Tuesday, September 22, 2015

CrowdRec 2015 Workshop Panel Discussion: Crowdsourcing in Recommender Systems

The CrowdRec 2015 Workshop on Crowdsourcing and Human Computation for Recommender Systems was held this past Saturday at ACM RecSys 2015 in Vienna, Austria. The workshop ended with a fish bowl panel on the topic of the biggest challenge facing the successful use of crowdsourcing for recommender systems. I asked the panelist to take a position as to the nature of this challenge, was it related to algorithms, engineering or ethics. During the panel the audience took crowdsourced notes about the panel on titanpad.

After the workshop I received two comments that particularly stuck in my mind. One was that I should have told people that if they contributed to the titanpad notes, I would write a blogpost summarizing them. I was happy that at least someone thought that a blog post would be a good idea. (I hadn't considered that having ones thoughts summarized in my blog would be a motivational factor.) The other comment was that the panel question was not controversial enough to spark a good discussion.

In response to these two comments here now a summary/interpretation of what was recorded in the crowdsourced notes about the panel.

The biggest challenge of crowdsourcing is to design a product in which crowdsourcing adds value for the user. Crowdsourcing should not be pursued unless it makes a clear contribution.

The way that the crowd uses a crowdsourcing platform, or a system that integrates crowdsourcing is essential. Here, engagement of the crowd is key, so that they are "in tune" with the goals of the platform, and make a productive contribution.

The biggest challenge is the principle of KYC. Here, instead of Know Your Client, this is Know Your Crowd. There are many individual and cultural differences between crowdmembers that need to be taken into account.

The problem facing many systems is not the large amount of data, but that they data is unpredictably structured and in homogenous, making it difficult to ask the crowd to actually do something with it.

With human contributors in the crowd, people become afraid of collusion attacks that go against the original, or presumed. intent of a platform. A huge space for discussion (which was not pursued during the panel) opens about who has the right to decide what the "right" and "wrong" way to use a platform.

Crowdwork can be considered people paying with their time: We need to carefully think about what they receive in return.

With the exception of this last comment, it seemed that most people on the panel found it difficult to say something meaningful about ethics in the short time that was available for the discussion.

In general, we noticed that there are still multiple definitions of crowdsourcing at play in the community. In the introduction to the workshop, I pointed out that we are most interested in definitions of crowdsourcing where crowdwork occurs in response to a task that was explicitly formulated. In other words, collecting data that was create for another purpose rather than in response to a taskasker is not crowdsourcing in the sense of CrowdRec. It's not uninteresting to consider recommender systems that leverage user comments collected from the web. However, we feel that such systems fall under the heading of "social" rather than "crowd", and reserve a special space for "crowd" recommender systems, which involve active elicitation of information. It seems that it is difficult to be productively controversial, if we need to delineate the topic at the start of every conversation.

At this point, it seems that we are seeing more recommender systems that involve taggers and curators. Amazon Mechanical Turk, of course, came into being as an in-house system to improve product recommendations, cf. Wikipedia. However, it seems that recommender systems that actively leverage the input of the crowd still need to come into their own.

See also:

Martha Larson, Domonkos Tikk, and Roberto Turrin. 2015. Overview of ACM RecSys CrowdRec 2015 Workshop: Crowdsourcing and Human Computation for Recommender Systems. In Proceedings of the 9th ACM Conference on Recommender Systems (RecSys '15). ACM, New York, NY, USA, 341-342.

Saturday, August 22, 2015

Choosing movies to watch on an airplane? Compensate for context

For some people, an airplane is the perfect place to catch up on their movies-to-watch list. For these people there is no difference between sitting on a plane and sitting on the couch in their living room.

If you are one of these people, you are lucky.

If not, then you make want to take a few moments to think about what kind of a movie you should be watching on an airplane.

These are our two main insights on how you should make this choice:

Watch a movie that uses a lot of closeups (or relatively little visual detail), is well lit, and moves relatively slowly so that you can enjoy it on a small screen.
Watch something that is going to engage you. Remember that the environment and the disruptions on an airplane might affect your ability to focus, and, in this way disrupt your ability to experience an empathetic relationship with the characters. In other words, unless the plot and characters really draw you in movie might not "work" in the way it is intended.

For an accessible introduction to how movies manipulate your brain see the Wired series on Cinema Science article:

http://www.wired.com/2014/08/how-movies-manipulate-your-brain
At the perceptual level your brain needs to be able exercise its ability to "stitch things together to make sense". It's plausible that this "stitching" also has to be able to take place at an emotional level. Certain kinds of distractors can be expected to simply get in the way of that happening as effectively as it is meant to.

When we began to study what kinds of movies that people watch on planes we used these two insights as a point of departure. We started with these insights after having made some informal observations about the nature of distractors on an airplane, which are illustrated by this video.

At the end of the video, we formulate the following initial list of distractors, which impact what you might want to watch on an airplane.

Engine noise
Announcements
Turbulence
Small screen
Glare on screen
Inflight service
Fellow travelers
Kid next to you

In short, when choosing a movies to view on the airplane, you should pick a movie that can "compensate for context", meaning that you can enjoy it despite the distractors inherent in the situation aboard an airplane.

We are looking to expand this list of distractors as our research moves forward.

Our ultimate goal (still a long way off) is to build a recommender system that can automatically "watch" movies for you ahead of time. The system would be able to suggest movies that you would like, but above and beyond that the suggested movies would be prescreened to be suitable for watching on an airplane. Such a system would help you to quickly decide what to turn on at 30,000 feet, without worrying that half way through you will realize that it might not have been a good choice.

Since we started this research, I have been paying more an more attention to the experiences that I have with movies on a plane. Here are two.

On a recent domestic flight: The woman next to me started to watch Penny Dreadful. She turned it off about ten minutes in. I then also tried to watch it. I really like the show, but it's meant to be dark, gory, and mysterious. These three qualities turned into poorly visible, disconcerting, and confusing at 10,000 feet. This is what I am trying to capture in the video above.
On a recent Transatlantic flight: The man sitting next to me turned on his monitor, and turned on The Color Purple as if it were on his watch list. It's probably on most people's movies-to-watch list, so this seems like a safe choice. However, I was trying to review a paper, and was subject to over two hours of unavoidable glimpses of violence on a screen a few feet from my own. It's the emotional impact of these scenes that make it a great movie. By the same token, you might not want to be watching it on a plane, especially if you are not going to experience the entire emotional arc. (The movie should be watched with full focus, from beginning to end.)

Until now, the work on context-aware movie recommender systems that I have encountered has recommended movies for situations that are part of what is considered to be "normal" daily life, e.g., watching movies during the week vs. on the weekend, watching movies with your kids vs. with your spouse. We need more recommender system work that will allow us to get to movies that are suitable for less ideal, more unpleasant, perhaps less frequent situations. Why waste a good movie by watching it in the wrong context? And why suffer anymore than necessary while on an airplane?

The work is being carried out within the context of the MediaEval Benchmarking Initiative for Multimedia Evaluation, see Context of Experience 2015. It owes a lot to the CrowdRec project, which pushes us to understand how we can make recommenders better by asking people explicitly to contribute their input.

Wednesday, July 29, 2015

Google Scholar: Sexist or simply statistical?

This is the first of what I intend to be a short series of posts related to my experience with Google Scholar, and to a phenomenon that I call "algorithmic nameism", an unfortunate danger of big data. Here, I describe what appears to be happening, on Google Scholar, to references to certain journal papers that I have co-authored, and briefly discuss the reasons for which we should be concerned, not for me personally, but in general.

I am currently Assistant Professor in computer science at Delft University of Technology, in Delft, Netherlands, and at this moment also a visiting researcher at the International Computer Science Institute in Berkeley, California. Like many in my profession, I maintain a Google Scholar profile, and rely on the service as a way of communicating my publication activities to my peers, and also keeping up with important developments in my field. Recently, I clicked the "View and Apply Updates" link (pictured above) in my Google Scholar profile, to see if Google Scholar had picked up on several recent publications, and was quite surprised by what I discovered.

For those perhaps not familiar with Google Scholar, a word of background information. On the update page for your profile, Google Scholar supplies a list of suggested edits to the publication references in your profile. As profile owner, you can then choose to accept or discard them individually.

In the list, I was very surprised to find the following suggested edit for a recent publication on which I am co-author:

In short, Google Scholar is suggesting that my first name "Martha" be changed to "Matt".

It is not an isolated case. Currently, in my suggested edits list, there are suggestions to change my name to "Matt" for a total of four papers that I have co-authored, all in IEEE Transactions on Multimedia.

Part of my specialization within the field of computer science is information retrieval. For this reason, I have insight into the probable reasons for which this might be happening, even without direct knowledge of the Google Scholar edit suggestion algorithm. The short story is that Google Scholar appears to be using a big data analytics algorithm to predict errors, and suggest edits. But it is clearly a case in which "big data" is "wrong data". I plan to delve into more detail in a future post.

Here, I would just like to state why we should be so concerned:

Suggested edits on the "View and Apply Updates" page find their way into the Google Scholar ranking function, and affect whether or not certain publications are found when people search Google Scholar. I have not, to my knowledge, ever clicked the "Edit article" link that would accept the suggestion to change my name to Matt in the reference to one of my publications. However, the Google Scholar search algorithm has apparently already integrated information about "Matt Larson".

Currently, if you go to Google Scholar and query "predicting failing queries matt larson", my paper comes up as a number the number one top-ranked result.

However, if you query "predicting failing queries martha larson", this paper can only be found in the sixth position on the second page (It is the bottom reference in this screenshot of the second page. I have put a red box around Page 2.)

Different people say different things about the importance of having a result on the first page of search results. However, you don't have to be literally of the first-page school (i.e., you don't have to believe "When looking something up on Google, if its not on the first page of search results then it doesn't exist and my journey ends there.") to imagine that my paper is going to be more readily found if my name were Matt. (For brevity's sake, I will just zip past the irony of discussing basically a failed query for a paper which is actually about failing queries.)

I myself search for people (and their publications) on Google scholar for a range of reasons. For example, last year I was appointed as an Associate Editor of IEEE Transactions on Multimedia. I search for people on Google Scholar in order to see if they have published the papers that would qualify them to review for the journal.

At the moment, my own IEEE Transactions papers seem to be being pushed down in the ranking because Google Scholar is confused about whether my name should actually be "Matt". In general, however, Google Scholar does a good job. I don't research lung cancer (second result Page 2 as shown above), but otherwise it can be seen from the results list above, that Google Scholar generally "knows" that I am me. My profile page does not have any of the shortcomings of the search results, that I am aware of.

I am someone with an established career, with tenure and relatively many publications. I have no problem to weather the Matt/Martha mixup.

However: Imagine someone who was at the fragile beginning of her career!

Having IEEE Transactions publications appearing low in her results (compared to her equally well-published colleagues) could make the difference between being invited to review or not. Or, goodness forbid, a potential employer is browsing her publications to determine whether she was qualified for a job, and misses key publications.

I'll conclude with what is without doubt and unexpected statement: It would be somehow positive if the Matt/Martha mix up were a case of sexism. If it were an "anti-woman" filter programmed into Google Scholar, the solution would be simple. The person/team responsible could be fired, and we could all get on with other things. However: With extremely high probability there are no explicitly "anti-woman" designs here. Although the example above looks for all the world like sexism, at its root it is most probably not. The odds are that the algorithm behind the suggestions to edit "Martha" to "Matt" has no knowledge of my gender whatsoever, and the discrimination is therefore not directly gender based.

The Matt/Martha mix up is actually more worrisome if it is not sexism. The more likely case is that this is a new kind of "ism" that has the danger of going totally under the radar. It is not related to specifically to gender, but rather to cases that are statistically unusual, given a particular data collection. It is the kind of big data mistake that can potentially disadvantage anyone if big data is applied in the wrong way.

Whether sexist or "simply statistical", we need to take it seriously.

An immediate action that we can take is realize that we should not trust Google Scholar blindly.

Sunday, July 19, 2015

Teaching the First Steps in Data Science: Don't Simplify Out the Essentials

Teachers of Data Science are faced with the challenge of initiating students into a new way of thinking about the world. In my case, I teach multimedia analysis, which combines elements of speech and language technology, information retrieval and computer vision. Students of Data Science learn that data mining and analysis techniques can lead to knowledge and understanding that could not be gained from conventional observation, which is limited in its scope and ability to yield unanticipated insights.

When you stand in front of an audience who is being introduced to data science for the first time, it is very tempting to play the magician. You set up the expectations of what "should be" possible, and then blow them away with a cool algorithm that does the seemingly impossible. Your audience will go home and feel that the got a lot of bang for their buck---they have witnessed a rabbit being pulled from a hat.

However: will they be better data scientists as a result?

In fact, if you produce a rabbit from a hat, your audience has not been educated at all, they have been entertained. Worse case they have been un-educated, since the success of the rabbit trick involves misdirection of attention away from the essentials.

My position is that when teaching the first steps in data science, it is important not to simplify out the essentials. Here, two points are key:

First, students must learn to judge the worth of algorithms in terms of the real-world applications that they enable. With this I do not mean to say that all science must be applied science. Rather, the point is that data science does not exist in a vacuum. Instead, the data originally came from somewhere. It is connected to something that happened in the real-world. Ultimately, the analysis of the data scientist must be relevant to that "somewhere", be it a physical phenomenon or a group of people.

Second, students must learn the limitations of the algorithms. Understanding an algorithm means also understanding what it cannot be used for, where it necessarily breaks down.

At a magic show, it would be ridiculous if a magician announced that his magic trick is oriented towards the real-world application of creating a rabbit for rabbit soup. And no magician would display alternative hats from which no rabbit could possibly be pulled. And yet, as data science teachers, this is precisely what we need to do. It is essential that our students know exactly what an algorithm is attempting to accomplish, and the conditions that cause failure.

Yesterday, was the final day of the Multimedia Information Retrieval Workshop at CCRMA at Stanford, and Steve Tjoa gave a live demo of a simple music identification algorithm. It struck me as a great example of how to teach data science. As workshop participants we saw that they algorithm is tightly connected to reality (it was identifying excerpts of pieces that he played right there in the classroom on his violin), and his demo showed its limitations (it did not always work).

This exposition did not simplify out the essentials. Students experiencing such a live demo learn the algorithm, but they also learn how to use and how to extend it.

We were blown away not so much by the cool algorithm, but by the fact that we really grasped what was going on.

Experiences like this are solid first steps for data science students, and will lead to great places.

Postscript:

That evening, one of my colleagues asked me if I still wrote on my blog. No, I said, I had a bit of writer's block. I had been trying to write a post on Jeff Dean's keynote at ACM RecSys 2014, "Large Scale Machine Learning for Predictive Tasks", and failing miserably. The keynote troubled me, and I was attempting to formulate a post that could constructively explain why. Ten months past.

With the example of Steve's live demo it became clear why my main problem was with the keynote. It contained nothing that I could demonstrate was literally wrong. It was simply a huge missed opportunity.

Since ACM RecSys is a recommender system conference, many people in the room were being thinking about natural language processing and computer vision problems for the first time. The keynote did not connect its algorithms to the source of the data and possible applications. Afterwards, the audience was none the wiser concerning the limitations of the algorithms it discussed.

I suppose some would try to convince me that when listening to a keynote (as opposed to a lecture) I need to stop being a teacher, and go into magic watching mode, meaning that I would, suspend my disbelief. "That sort of makes sense, it looks pretty good" Dean said to wrap up his exposition of paragraph vectors of Wikipedia articles.

https://www.youtube.com/watch?v=Zuwf6WXgffQ&feature=youtu.be&t=4m20s

If you watch at the deep link, you see that he would like to convince us that we should be happy because the algorithm has landed music articles far away from computer science.

In the end, I can only hope that the YouTube video of the keynote is no one's first steps in data science.

Independently of a particular application, landing music far away from computer science is also just not my kind of magic trick.