Last Thursday I was at a "sounding board" meeting of the Big Data Commission of the Royal Netherlands Academy of Arts and Sciences. This post highlights some points that I have continued to reflect upon since the meeting.
According to Wikipedia, "Big Data" are data sets that are too large and too complex for traditional data processing systems to work with them. Interestingly, the people who characterize "Big Data" in terms of volume, variety and velocity, often underemphasize, as the Wikipedia definition does, the aspect of velocity. Here, I argue it is important not to forget that Big Data is also Fast Data.
Fast Streams and Big Challenges
Because I work in the area of recommender systems, I quite naturally conceptualize problems in terms of a data stream rather than a data set. The task a stream-based recommender system addresses is the following: there is a stream of incoming events and the goal is to make predictions on the future of the stream. There are two issues that differentiate stream-based views of data from set-based views.
First: the temporal ordering in the stream means that ordinary cross-validation cannot be applied. A form of A/B testing must be used in order to evaluate the quality of predictions. Online A/B testing has implications for the replicability of experiments.
Second: at any given moment, you are making two intertwining predictions. One is the prediction of the future of the stream. The other is how much, if any, of the past is actually relevant in predicting the future. There are two reasons why the information in the past stream may not be relevant to the future: external and internal factors.
External factors are challenging because you may not know they are happening. A colleague doing medical research recently told me that when deductibles go up people delay going to the doctor, and suddenly the patients that are visiting the doctor have different problems, simply because they delayed their visit. Confounding variables of course exist for set-based data. However, if you are testing stream-based prediction online, you can't simply turn back the clock and start investigating confounding variables: it's already water under the bridge. As much as you may be recording, you cannot reply all of reality as it happened.
Internal factors are even tougher. Events occurring in the data stream influence the stream itself. A common example is the process by which a video goes viral on the Web. In this case, we have a stream of events consisting of viewers watching the video. Because people like to watch videos that are popular (or are simply curious about what everyone else is watching) events in the past actually serve to create the future, yielding an exponentially growing number of views. These factors can be understood as feedback loops. Another important issue, which occurs in recommender systems, is that the future of the stream is influenced by the predictions that you make. In a recommender system, these predictions are shown to users in the form of recommended items, and the users create new events by interacting with these items. The medical researcher is stuck with this effect: she cannot decide not to cure patients, just because it will create a sudden shift in the statistical properties of her data stream.
Time to Learn to do it Do it Right
In short, you are trying to predict and also to predict whether you can predict. We still call it "Big Data", but clearly we are at a place where the assumption that data greed pays off ("There's no data like more data") breaks down. Instead, we start to consider the price of Big Data failure ("The bigger they are, the harder they fall").
In a recent article Trump's Win Isn't the Death of Data---It Was Flawed All Along, Wired concluded that "...the data used to predict the outcome of one of the most important events in recent history was flawed." But if you think about it: of all the preposterous statements made during the campaign, no one proposed that the actual election be cancelled since Big Data could predict its outcome. There are purposes that Big Data can fulfill, and purposes for which it is not appropriate.
The Law of Large Numbers forms the basis for reliably repeatable predictions. For this reason, it is clear that Big Data is not dead. The situation is perhaps exactly the opposite: Big Data has just been born. We have reasons to believe in its enormous usefulness, but ultimately its usefulness will depend on the availability of people with the background to support it.
There is a division between people with a classic statistics and machine learning background who know how to predict (who may even have the technical expertise to do it at large scale) and people who, on top of a classical background, have the skills to approach the question of when does it even make sense to be predicting. Only the latter are qualified to pursue big data.
The difference is perhaps a bit like the difference between snorkeling and scuba diving. Both are forms of underwater exploration, and many people don't necessarily realize that there is a difference. However, if you can snorkel, you are still a long way from being able to scuba dive. For scuba diving, you need additional training, and more equipment, and a firm grasp of principles that are not necessarily intuitive, such as the physiological effects of depth and the wisdom of redundancy. There is a lot to be achieved on a scuba dive, that can't be accomplished by mere snorkeling: but the diver needs resources to invest, and above all needs to have the time to learn to do it right.
No Fast Track to Big Data
These considerations lead to the realization that although Big Data streams may be in and of themselves incredibly quickly changing, the overall process of making Big Data useful is, in fact, very slow. Working in Big Data requires an enormous amount of training going beyond a traditional data processing background.
Gaining the expertise needed for Big Data also requires understanding of domains that lie outside of traditional math and computer science fields. All work in Big Data areas must start from a solid ethical and legal foundation. Civil engineers are in some cases able to lift a building to add a foundation. With Big Data, this possibility is excluded.
To illustrate this point, it is worth returning to consider the idea of replacing the election with a group of data scientists carrying out Big Data calculations. It is perhaps an extreme example, but it is one that makes clear that ethical and legal considerations must come before Big Data. The election must remain the election because on its own a Big Data calculation has no way of achieving the necessary social trust necessary to ensure continuity of the government. For this we need a cohesive society and we need the law. Unless Big Data starts from ethics and from legal considerations, we risk time and effort developing a large number of algorithms that are solving the wrong problems.
Training data scientists while ignoring the ethical and legal implications of Big Data is a short cut that is tempting in the short run, but can do nothing but harm us in the long run.
Big Data as Slow Science
The amount of time and effort needed to make Big Data work, might lead us to expect that Big Data should yield some sort of Big Bang, a real scientific revolution. In fact, however, it are the principles the same old scientific method of centuries that we return to in order to define Big Data experiments. In short, Big Data is a natural development of existing practices. Some have even argued that data-driven science pre-dated the digital age, e.g., this essay entitled Is the Fourth Paradigm Really New?
However, it would also be wrong to characterize Big Data as business as usual. A more apt characterization is as follows: Before the Big Data age scientific research proceeded along a the conventional path: researchers would formulate their hypothesis, design their experiment, and then as the final step collect the data. Now, the path starts with the data, which inspires and informs the hypothesis. The experimental design must compensate for the fact that the data was "found" rather than collected.
Given this state of affairs, it is easy to slip into the impression that Big Data is "fast" in the sense that it speeds up the process of scientific discovery. After all, the data collection process, which in the past could take years, can be carried out quickly. If the workflow is implemented, a new hypothesis could be investigated in a matter of hours. However, it important to consider how the speed of the experiment itself influences the way in which we formulate hypotheses. Because there is little cost to running an experiment, there is little incentive to put a great deal of careful thought and consideration into which hypotheses we are testing.
A good hypotheses is one that is motivated by a sense of scientific curiosity, and/or societal need and that has been informed by ample amounts of real-world experience. If there is negligible additional cost to running an additional experiment, we need to find our motivation for formulating good hypotheses elsewhere. The price of thoughtlessly investigating hypotheses merely because they can be formulated given a specific data collection is high. Quick and dirty experiments lead to mistaking spurious correlations for effects, and yield insights that fall short of generalizing to meaning real-world phenomena, let alone use cases.
In sum, we should remember the "V" of velocity. Big Data is not just data sets, its also data streams, which makes Big Data also Fast Data. Taking a look at data streams makes it easier to see the ways in which Big Data can go wrong, and why it requires special training, tools, and techniques.
Volume, variety, and velocity have been extended by some to include other "Vs" such as Veracity and Value. Here, I would like to propose "Vigilance". For Big Data to be successful we need to slow down: train people with a broad range of expertise, connect people to work in multi-skilled teams, and give them the time and the resources needed in order to do Big Data right. In the end, the value of Big Data is the new insights that it reveals, and not the speed at which it reveals them.
Distribution of paper citations over time
11 months ago