As many data scientists, I have followed a non-linear career path: after having studied mathematics and physics, I entered the ‘survey world’ in 2007. My first role in surveys was as a sampling statistician and with time, I became a ‘complete’ survey methodologist. The survey methodologist’s primary concern is to collect data through surveys (and correct them post-survey) in a way that minimises bias and increases precision. So my background makes me very ‘aware‘ of data quality. A common misconception that survey methodologists and data scientists have to fight against is that a large amount of data means ‘good’ data. But quantity does not guaranty quality.
The aim of this blog post is to remind everyone who wants to hear (or better read) it to not trust their data blindly. Before developing analyses and predictive models using data, a good data scientist investigates the biasses that they may contain.
UNDERSTAND WHERE YOUR DATA COMES FROM
The source of your data is important as it can lead to all sorts of biases.
The data that you use for training your algorithm might only represent a (non-random) subset of your population or ‘data space' of interest. This is called representation bias and can have different sources. The first important source is non-observation or non- coverage bias, meaning that a part of the population or data space had no chance of making it in your data set. The other one is more particular to social sciences or human- based dataset and is called non-response, meaning you (or the data collector) tried to observe the person but they refused to collaborate.
A few examples are worth a thousand words. So I am going to illustrate representation bias and how it can lead to wrong conclusion with an example.
The current pandemic of COVID 19 provides the world with a lot of data on a daily basis: the number of new contaminations, new hospitalisations or deaths due to the virus for most countries.
The following table is an extract from a table published on the World Health Organisation website on May, 13, 2020. (https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200513-covid-19-sitrep-114.pdf?sfvrsn=17ebbbe_4)
The temptation is always strong to compare countries with one another, to see who is doing ‘better’.
If we compare Belgium and its neighbouring countries, the Netherlands and France, we obtain the following rates:
|Total population||Contamination rate||Death rate|
|Belgium||11 589 623||0.4%||0.07%|
|France||65 273 511||0.2%||0.04%|
|The Netherlands||17 134 872||0.2%||0.03%|
Belgium seems to be doing the worst, having the highest contamination and death rate (by more than a factor of 2 compared to the Netherlands!).
However, digging a little further, comparing the number of contaminations to the number of tested people (https://www.worldometers.info/coronavirus/#countries), the contamination rates are: 8% in Belgium, 9% in France and 15% in the Netherlands.
This gives another picture! We have to keep in mind how the tests are done and amongst which part of the population. A more targeted testing strategy may lead to a larger contamination rate when compared to the number of tested people. This also demonstrates how data can be manipulated to lead to different conclusions.
Let’s have a look at death rate. Without making a political statement, the higher death rate in Belgium is probably, at least partly, explained by the reporting strategy.
Death rates have relatively predictable seasonal patterns over time under "normal" conditions (e.g. no pandemics, no wars or other cataclysmic events). When something like the COVID 19 pandemic occurs, an anomalous spike in the death rate compared to the baseline (usual pattern) is observed in the data. If the COVID 19 deaths are reported properly, they should account for at least a significant amount of the difference of the baseline and the observed death counts.
This is where another difference between Belgium and other countries lies. In Belgium, every death suspected from COVID 19 is reported as ‘COVID 19’ deaths whilst in many other countries only confirmed cases are reported. In the graphs published daily in the Economist (https://www.economist.com/graphic-detail/2020/04/16/tracking-covid-19-excess-deaths-across-countries) to follow excess deaths, one can observe that the excess-mortality during the pandemic is fully explained by COVID 19 cases in Belgium and in France. This is not the case in all countries.
Admittedly the excess mortality rate in March and April 2020 in Belgium is very large compared to other countries in the world, but as I am not an epidemiologist I will not attempt to explain this observation.
DON’T INFER CAUSALITY FROM OBSERVATIONAL DATA
Data can be observational or experimental.
Observational data is typically the results of surveys, opinion polls or the well-known Iris dataset from Fisher (Source: Fisher, Ronald A., and Michael Marshall. "Iris data set." RA Fisher, UC Irvine Machine Learning Repository 440 (1936): 87) to give a few examples. We can describe links between variables in the datasets but never ever infer causality because of unobserved factors and because of the chicken and the egg paradox (which came first?). Here by causality, I mean an interdependence of cause and effect which by definition is not symmetric. If I bang my little toe I will feel pain and it is not the pain that causes me to bang my little toe.
To infer causality we need experimental data in which random groups are assigned to different interventions. In this case, we can infer the effect of the intervention. Typically, clinical trials deliver experimental data.
One of the big dangers of wrongly inferring causality is ecological fallacy - a formal fallacy in the interpretation of statistical data that occurs when inferences about the nature of individuals are deduced from inferences about the group to which those individuals belong (https://en.wikipedia.org/wiki/Ecological_fallacy).
A known example of ecological fallacy, called the Simpson paradox, can be illustrated with the admission figures for the fall of 1973 to the University of California, Berkeley, see the table below. (Source: https://en.wikipedia.org/wiki/Simpson's_paradox). The aggregated numbers at the university level suggested that men have an easier time to be admitted at Berkeley university.
However, when examining the individual departments (table here under), it appears that six out of 85 departments were significantly biased against men, whereas four were significantly biased against women.
This case is a typical example of an unobserved factor, that influence the conclusion drawn form your data.
So, don’t infer causality from observational data.
ALWAYS PLOT YOUR DATA!
Statistical theory offers to data scientists a range of ‘tools’ to describe data, even large amount, with a limited number of parameters. Think for instance about means or medians, standard deviations, or even correlations. These statistical parameters are very nice as they help us to ‘summarise’ the information contained in the data, but to be a ‘good’ summary, the data actually has to full-fill some assumptions - ideas we have about how the data is structured , which is easy to forget! This is also particularly dangerous if your data has many dimension, meaning if you want to summarise multiple variables, for example height, weight and bone density in different populations.
To illustrate how summarising your data in a few statistical parameters can be misleading, let us take the famous Anscombe's quartet (Source: https://en.wikipedia.org/wiki/Anscombe's_quartet). The quartet are 4 sets of data (x, y) - eleven pairs - with the same statistical parameters, described here under.
|Mean of x||9||exact|
|Sample variance of x||11||exact|
|Mean of y||7.50||to 2 decimal places|
|Sample variance of y||4.125||±0.003|
|Correlation between x and y||0.816||to 3 decimal places|
|Linear regression line||y = 3.00 + 0.500x||to 2 and 3 decimal places, respectively|
|Coefficient of determination of the linear regression||0.67||to 2 decimal places|
Plotting the data, however, display different relationship between x and y! From roughly linear, linear with an outlier influencing the regression coefficient, through quadratic, to no relationship at all, these datasets have different behaviours - relation between x and y - even though they are described by the same statistical parameter values, see the graph below.
I am not saying that looking at means and variance is a bad idea but keep at the back of your mind that you are implicitly thinking that you variable is normally distributed. A small plot - such as box-lots and histograms for one variable or a good old scatter plot when considering two variables - is often enough to remind you of the easy pitfall.
To sum up
These are some of the common traps that are easy to fall into as you confront your data. The list is of course not exhaustive, other things to watch out for are:
- The curse of dimensionality
- Missing data
- Time dependence, distribution shift
- Lack of objective truth, mistake in your data
- And many more that I am forgetting...
I hope I made you aware that it is important to consider representation biases when analysing data. Another common cause of bias in data, which I did not elaborate on to keep the blog post at a reasonable length, are measurement errors. Examples of sources of measurement errors are people answering a questionnaire not truthfully, ‘manual’ labelling errors or wrong calibration of the measurement tool.
So when working with data, keep representation and measurement biases in mind! Also don’t infer causality when your data has not been collected in a suitable way for it, and take the time to do a bunch of plots.
When you start a data science project, be curious and critical. You can trust your data but not blindly.
Levitt Steven, D., and Stephen J. Dubner. "Freakonomics: a rogue economist explores the hidden side of everything." New York: William Morrow (2005).
Wheelan, Charles. Naked statistics: Stripping the dread from the data. WW Norton & Company, 2013.
I realise this statement can be controversial. Some people may argue than in very simple data structure (with no unobserved factors) this can be done, although I personally would always avoid it. ↩︎