Wednesday, July 29, 2015

Google Scholar: Sexist or simply statistical?

This is the first of what I intend to be a short series of posts related to my experience with Google Scholar, and to a phenomenon that I call "algorithmic nameism", an unfortunate danger of big data. Here, I describe what appears to be happening, on Google Scholar, to references to certain journal papers that I have co-authored, and briefly discuss the reasons for which we should be concerned, not for me personally, but in general.

I am currently Assistant Professor in computer science at Delft University of Technology, in Delft, Netherlands, and at this moment also a visiting researcher at the International Computer Science Institute in Berkeley, California. Like many in my profession, I maintain a Google Scholar profile, and rely on the service as a way of communicating my publication activities to my peers, and also keeping up with important developments in my field. Recently, I clicked the "View and Apply Updates" link (pictured above) in my Google Scholar profile, to see if Google Scholar had picked up on several recent publications, and was quite surprised by what I discovered.

For those perhaps not familiar with Google Scholar, a word of background information. On the update page for your profile, Google Scholar supplies a list of suggested edits to the publication references in your profile. As profile owner, you can then choose to accept or discard them individually. 

In the list, I was very surprised to find the following suggested edit for a recent publication on which I am co-author:

In short, Google Scholar is suggesting that my first name "Martha" be changed to "Matt". 

It is not an isolated case. Currently, in my suggested edits list, there are suggestions to change my name to "Matt" for a total of four papers that I have co-authored, all in IEEE Transactions on Multimedia. 

Part of my specialization within the field of computer science is information retrieval. For this reason, I have insight into the probable reasons for which this might be happening, even without direct knowledge of the Google Scholar edit suggestion algorithm. The short story is that Google Scholar appears to be using a big data analytics algorithm to predict errors, and suggest edits. But it is clearly a case in which "big data" is "wrong data". I plan to delve into more detail in a future post.

Here, I would just like to state why we should be so concerned:

Suggested edits on the "View and Apply Updates" page find their way into the Google Scholar ranking function, and affect whether or not certain publications are found when people search Google Scholar. I have not, to my knowledge, ever clicked the "Edit article" link that would accept the suggestion to change my name to Matt in the reference to one of my publications. However, the Google Scholar search algorithm has apparently already integrated information about "Matt Larson".

Currently, if you go to Google Scholar and query "predicting failing queries matt larson", my paper comes up as a number the number one top-ranked result.

However, if you query "predicting failing queries martha larson", this paper can only be found in the sixth position on the second page (It is the bottom reference in this screenshot of the second page. I have put a red box around Page 2.)

Different people say different things about the importance of having a result on the first page of search results. However, you don't have to be literally of the first-page school (i.e., you don't have to believe "When looking something up on Google, if its not on the first page of search results then it doesn't exist and my journey ends there.") to imagine that my paper is going to be more readily found if my name were Matt. (For brevity's sake, I will just zip past the irony of discussing basically a failed query for a paper which is actually about failing queries.)

I myself search for people (and their publications) on Google scholar for a range of reasons. For example, last year I was appointed as an Associate Editor of IEEE Transactions on Multimedia. I search for people on Google Scholar in order to see if they have published the papers that would qualify them to review for the journal.

At the moment, my own IEEE Transactions papers seem to be being pushed down in the ranking because Google Scholar is confused about whether my name should actually be "Matt". In general, however, Google Scholar does a good job. I don't research lung cancer (second result Page 2 as shown above), but otherwise it can be seen from the results list above, that Google Scholar generally "knows" that I am me. My profile page does not have any of the shortcomings of the search results, that I am aware of.

I am someone with an established career, with tenure and relatively many publications. I have no problem to weather the Matt/Martha mixup.

However: Imagine someone who was at the fragile beginning of her career!

Having IEEE Transactions publications appearing low in her results (compared to her equally well-published colleagues) could make the difference between being invited to review or not. Or, goodness forbid, a potential employer is browsing her publications to determine whether she was qualified for a job, and misses key publications.

I'll conclude with what is without doubt and unexpected statement: It would be somehow positive if the Matt/Martha mix up were a case of sexism. If it were an "anti-woman" filter programmed into Google Scholar, the solution would be simple. The person/team responsible could be fired, and we could all get on with other things. However: With extremely high probability there are no explicitly "anti-woman" designs here. Although the example above looks for all the world like sexism, at its root it is most probably not. The odds are that the algorithm behind the suggestions to edit "Martha" to "Matt" has no knowledge of my gender whatsoever, and the discrimination is therefore not directly gender based.

The Matt/Martha mix up is actually more worrisome if it is not sexism. The more likely case is that this is a new kind of "ism" that has the danger of going totally under the radar. It is not related to specifically to gender, but rather to cases that are statistically unusual, given a particular data collection. It is the kind of big data mistake that can potentially disadvantage anyone if big data is applied in the wrong way.

Whether sexist or "simply statistical", we need to take it seriously.

An immediate action that we can take is realize that we should not trust Google Scholar blindly.