Monday, December 21, 2015

Features, machine learning, and explanations

Selected Topics in Multimedia Computing is a master-level seminar taught at the TU Delft. This post is my answer to the question from one of this year's students, asked in the context of our discussion of the survey paper that he wrote for the seminar on emotion recognition for music. I wrote an extended answer, since questions related to this one come up often, and my answer might be helpful for other students as well. 

Here is the text of the question from the student's email:


I also have a remark about something you said yesterday during our Skype meeting. You said that the premise of machine learning is that we don't exactly know why the features we use give an accurate result (for some definition of accuracy :)), hence that sentence about features not making a lot of sense could be nuanced, but does that mean it is accepted in the machine learning field that we don't explain why our methods work? I think admitting that some of the methods that have been developed simply cannot be explained, or at least won't be for some time, would undermine the point of the survey. Also, in that case, it would seem we should at least have a good explanation for why we are unable to explain why the unexplained methods work so well (cf. Deutsch, The beginning of infinity, particularly ch. 3).

Your survey tackles the question of missing explanations. In light of your topic, yes, I do agree that it would undermine your argument to identify methods that cannot be explained as valuable in moving the field forward.

My comment deals specifically with this sentence: "For some of the features, it seems hard to imagine how they could have a significant influence on the classification, and the results achieved do not outrank those of other approaches by much."

I'll start off by saying that a great deal more important that this specific sentence is a larger point that you are making in the survey, namely that meaningful research requires paying attention to whether what you think that you are doing and what you are actually doing are all aligned. You point out the importance of the "horse" metaphor of:

Sturm, B.L., "A Simple Method to Determine if a Music Information Retrieval System is a “Horse”," in Multimedia, IEEE Transactions on , vol.16, no.6, pp.1636-1644, Oct. 2014.

I couldn't agree more on that.

But here, let's think about the specific sentence above. My point is that it would help the reader to express what you are thinking here more fully. If you put two thoughts into one sentence like this, the reader will jump to the conclusion that one explains the other. You want to avoid assuming (or implying that you assume) that the disappointing results could have been anticipated by choosing features that a priori could be "imagined" to have significant influence.

(Note that there are interpretations of this sentence — i.e., if you read "and" as a logical conjunction — that do not imply this. As computer scientists, we are used to reading code, and these interpretations are, I have the impression, relatively more natural to us than to other audiences. So it is safer to assume that in general your readers will not spend a lot of time picking the correct interpretation of "and", and need more help from your side as the author :-))

As a recap: I said that Machine Learning wouldn't be so useful if humans could look at a problem, and tell which features should be used in order to yield the best performance.  I don't want to go as far as claiming it is the premise, in fact, I rather hope I didn't actually use the word "premise" at all.

Machine learning starts with a feature engineering step where you apply an algorithm to select features that will be used, e.g., by your classifier. After this step, you can "see" which features were selected. So it's not the case that you have no idea why machine learning works.

My point is that you need to be careful about limiting your input to feature selection a priori. If you assume you yourself can predict which features will work, you will miss something. When you use deep learning, you don't necessarily do feature selection, but you do having the possibility of inspecting the hidden layers of the neural network, and these can shed some light on why it works.

This is not to say that human intelligence should not be leveraged for feature engineering. Practically, you do need to make design decisions to limit the possible number of choices that you are considering. Well-motivated choices will land you with a system that is probably also "better", also along the line of Deutsch's thinking (I say "probably" because I have not read the book the you cite in detail.)

In any case: careful choices of features are necessary to prevent you from developing a classifier that works well on the data set that you are working on because there is an unknown "leak" between a feature and the ground truth, i.e., for some reason one (or more) of the features is correlated with the ground truth. If you have such a leak, you methods will not generalize, i.e., their performance will not transfer to the case of unseen data. The established method for preventing this problem is ensuring that you carry out your feature engineering step on separate data (i.e., separate from both your training and your test sets). A more radical approach that can help when you are operationalizing a system is to discard features that work "suspiciously" well. A good dose of common sense is very helpful, but note that you should not try to replace good methodology and feature engineering with human intelligence (which I mention for completeness, and not because I think you had any intention in this direction).

It is worth pointing out there are plenty of problems out there that can indeed be successfully addressed by a classifier that is based on rules that were defined by humans, "by hand". If you are trying to solve such a problem, your shouldn't opt for an approach that would require more data or computational resources, merely for the sake of using a completely data-driven algorithm. The issue is that it is not necessarily easy to know whether or not you are trying to solve such a problem.

In sum, I am trying to highlight a couple of points that we tend to forget sometimes when we are using machine learning techniques: You should resist the temptation to look at a problem and declare it unsolvable because you can't "see" any features that seem like they would work. A second related temptation that you should resist is using sub-optimal features because you make your own assumptions about what the best features must be a priori.

A few further words on my own perspective:

There are algorithms that are explicitly designed to make predictions and create human-interpretable explanations simultaneously. This is a very important goal for intelligent systems that are being used by users who don't have the technical training to understand what is going on "under the hood."

Personally, I hold the rather radical position that we should aspire to creating algorithms that are effective, but yet so simple that they can be understood by anyone who uses their output. The classic online shopping recommender "People who bought this item also bought ...." is an example that hopefully convinces you such a goal is not completely impossible. A major hindrance is that we may need to sacrifice some of the fun and satisfaction we derive from cool math.

Stepping back yet further:

Underlying the entire problem is the danger is that the data you use to train you learner has properties that you have overlooked or did not anticipate, and that your resulting algorithm gives you, well, let me just come out and call it "B.S." without your realizing it. High profile cases get caught (http://edition.cnn.com/2015/07/02/tech/google-image-recognition-gorillas-tag/). However, these cases should also prompt us to ask the question: Which machine learning flaws go completely unnoticed?

Before you start even thinking about features, you need to explain the problem that you are trying to address, and also explain how and why you chose the data that you will use to address it.

Luckily, it's these sorts of challenges that make our field so fascinating.