Wednesday, July 29, 2015

Google Scholar: Sexist or simply statistical?

This is the first of what I intend to be a short series of posts related to my experience with Google Scholar, and to a phenomenon that I call "algorithmic nameism", an unfortunate danger of big data. Here, I describe what appears to be happening, on Google Scholar, to references to certain journal papers that I have co-authored, and briefly discuss the reasons for which we should be concerned, not for me personally, but in general.

I am currently Assistant Professor in computer science at Delft University of Technology, in Delft, Netherlands, and at this moment also a visiting researcher at the International Computer Science Institute in Berkeley, California. Like many in my profession, I maintain a Google Scholar profile, and rely on the service as a way of communicating my publication activities to my peers, and also keeping up with important developments in my field. Recently, I clicked the "View and Apply Updates" link (pictured above) in my Google Scholar profile, to see if Google Scholar had picked up on several recent publications, and was quite surprised by what I discovered.

For those perhaps not familiar with Google Scholar, a word of background information. On the update page for your profile, Google Scholar supplies a list of suggested edits to the publication references in your profile. As profile owner, you can then choose to accept or discard them individually. 

In the list, I was very surprised to find the following suggested edit for a recent publication on which I am co-author:

In short, Google Scholar is suggesting that my first name "Martha" be changed to "Matt". 

It is not an isolated case. Currently, in my suggested edits list, there are suggestions to change my name to "Matt" for a total of four papers that I have co-authored, all in IEEE Transactions on Multimedia. 

Part of my specialization within the field of computer science is information retrieval. For this reason, I have insight into the probable reasons for which this might be happening, even without direct knowledge of the Google Scholar edit suggestion algorithm. The short story is that Google Scholar appears to be using a big data analytics algorithm to predict errors, and suggest edits. But it is clearly a case in which "big data" is "wrong data". I plan to delve into more detail in a future post.

Here, I would just like to state why we should be so concerned:

Suggested edits on the "View and Apply Updates" page find their way into the Google Scholar ranking function, and affect whether or not certain publications are found when people search Google Scholar. I have not, to my knowledge, ever clicked the "Edit article" link that would accept the suggestion to change my name to Matt in the reference to one of my publications. However, the Google Scholar search algorithm has apparently already integrated information about "Matt Larson".

Currently, if you go to Google Scholar and query "predicting failing queries matt larson", my paper comes up as a number the number one top-ranked result.

However, if you query "predicting failing queries martha larson", this paper can only be found in the sixth position on the second page (It is the bottom reference in this screenshot of the second page. I have put a red box around Page 2.)

Different people say different things about the importance of having a result on the first page of search results. However, you don't have to be literally of the first-page school (i.e., you don't have to believe "When looking something up on Google, if its not on the first page of search results then it doesn't exist and my journey ends there.") to imagine that my paper is going to be more readily found if my name were Matt. (For brevity's sake, I will just zip past the irony of discussing basically a failed query for a paper which is actually about failing queries.)

I myself search for people (and their publications) on Google scholar for a range of reasons. For example, last year I was appointed as an Associate Editor of IEEE Transactions on Multimedia. I search for people on Google Scholar in order to see if they have published the papers that would qualify them to review for the journal.

At the moment, my own IEEE Transactions papers seem to be being pushed down in the ranking because Google Scholar is confused about whether my name should actually be "Matt". In general, however, Google Scholar does a good job. I don't research lung cancer (second result Page 2 as shown above), but otherwise it can be seen from the results list above, that Google Scholar generally "knows" that I am me. My profile page does not have any of the shortcomings of the search results, that I am aware of.

I am someone with an established career, with tenure and relatively many publications. I have no problem to weather the Matt/Martha mixup.

However: Imagine someone who was at the fragile beginning of her career!

Having IEEE Transactions publications appearing low in her results (compared to her equally well-published colleagues) could make the difference between being invited to review or not. Or, goodness forbid, a potential employer is browsing her publications to determine whether she was qualified for a job, and misses key publications.

I'll conclude with what is without doubt and unexpected statement: It would be somehow positive if the Matt/Martha mix up were a case of sexism. If it were an "anti-woman" filter programmed into Google Scholar, the solution would be simple. The person/team responsible could be fired, and we could all get on with other things. However: With extremely high probability there are no explicitly "anti-woman" designs here. Although the example above looks for all the world like sexism, at its root it is most probably not. The odds are that the algorithm behind the suggestions to edit "Martha" to "Matt" has no knowledge of my gender whatsoever, and the discrimination is therefore not directly gender based.

The Matt/Martha mix up is actually more worrisome if it is not sexism. The more likely case is that this is a new kind of "ism" that has the danger of going totally under the radar. It is not related to specifically to gender, but rather to cases that are statistically unusual, given a particular data collection. It is the kind of big data mistake that can potentially disadvantage anyone if big data is applied in the wrong way.

Whether sexist or "simply statistical", we need to take it seriously.

An immediate action that we can take is realize that we should not trust Google Scholar blindly. 

Sunday, July 19, 2015

Teaching the First Steps in Data Science: Don't Simplify Out the Essentials

Teachers of Data Science are faced with the challenge of initiating students into a new way of thinking about the world. In my case, I teach multimedia analysis, which combines elements of speech and language technology, information retrieval and computer vision. Students of Data Science learn that data mining and analysis techniques can lead to knowledge and understanding that could not be gained from conventional observation, which is limited in its scope and ability to yield unanticipated insights.

When you stand in front of an audience who is being introduced to data science for the first time, it is very tempting to play the magician. You set up the expectations of what "should be" possible, and then blow them away with a cool algorithm that does the seemingly impossible. Your audience will go home and feel that the got a lot of bang for their buck---they have witnessed a rabbit being pulled from a hat.

However: will they be better data scientists as a result?

In fact, if you produce a rabbit from a hat, your audience has not been educated at all, they have been entertained. Worse case they have been un-educated, since the success of the rabbit trick involves misdirection of attention away from the essentials.

My position is that when teaching the first steps in data science, it is important not to simplify out the essentials. Here, two points are key:

First, students must learn to judge the worth of algorithms in terms of the real-world applications that they enable. With this I do not mean to say that all science must be applied science. Rather, the point is that data science does not exist in a vacuum. Instead, the data originally came from somewhere. It is connected to something that happened in the real-world. Ultimately, the analysis of the data scientist must be relevant to that "somewhere", be it a physical phenomenon or a group of people.

Second, students must learn the limitations of the algorithms. Understanding an algorithm means also understanding what it cannot be used for, where it necessarily breaks down.

At a magic show, it would be ridiculous if a magician announced that his magic trick is oriented towards the real-world application of creating a rabbit for rabbit soup. And no magician would display alternative hats from which no rabbit could possibly be pulled. And yet, as data science teachers, this is precisely what we need to do. It is essential that our students know exactly what an algorithm is attempting to accomplish, and the conditions that cause failure.

Yesterday, was the final day of the Multimedia Information Retrieval Workshop at CCRMA at Stanford, and Steve Tjoa gave a live demo of a simple music identification algorithm. It struck me as a great example of how to teach data science. As workshop participants we saw that they algorithm is tightly connected to reality (it was identifying excerpts of pieces that he played right there in the classroom on his violin), and his demo showed its limitations (it did not always work).

This exposition did not simplify out the essentials. Students experiencing such a live demo learn the algorithm, but they also learn how to use and how to extend it.

We were blown away not so much by the cool algorithm, but by the fact that we really grasped what was going on.

Experiences like this are solid first steps for data science students, and will lead to great places.


That evening, one of my colleagues asked me if I still wrote on my blog. No, I said, I had a bit of writer's block. I had been trying to write a post on Jeff Dean's keynote at ACM RecSys 2014, "Large Scale Machine Learning for Predictive Tasks", and failing miserably. The keynote troubled me, and I was attempting to formulate a post that could constructively explain why. Ten months past.

With the example of Steve's live demo it became clear why my main problem was with the keynote. It contained nothing that I could demonstrate was literally wrong. It was simply a huge missed opportunity. 

Since ACM RecSys is a recommender system conference, many people in the room were being thinking about natural language processing and computer vision problems for the first time. The keynote did not connect its algorithms to the source of the data and possible applications. Afterwards, the audience was none the wiser concerning the limitations of the algorithms it discussed.

I suppose some would try to convince me that when listening to a keynote (as opposed to a lecture) I need to stop being a teacher, and go into magic watching mode, meaning that I would, suspend my disbelief.  "That sort of makes sense, it looks pretty good" Dean said to wrap up his exposition of paragraph vectors of Wikipedia articles.

If you watch at the deep link, you see that he would like to convince us that we should be happy because the algorithm has landed music articles far away from computer science. 

In the end, I can only hope that the YouTube video of the keynote is no one's first steps in data science.

Independently of a particular application, landing music far away from computer science is also just not my kind of magic trick.