# Dealing with your crappy data

Let’s confront a real problem in Big Data. Inside your data warehouse lurk errors that could potentially render your data as useless as a thousand disks packed with random numbers. With all the hype in the industry around storage, transfer, data access, point-and-click analysis software, etc., more emphasis should be placed on detecting, evaluating, and mitigating data errors.

If you are thinking, “my data contains no errors”, you are living in denial. Unless you are storing output produced by a closed-form mathematical expression or a random number generator, you should recognize that any sensor or data collector — whether an industrial process temperature sensor, a bar code reader, a social media sentiment analyzer, or a human recording a customer transaction — generates some error along with the true value.

In its basic form, the data residing in your data warehouse, the “measured value”, is a combination of the true value plus some unknown error component:

### Measured Value = True Value + Unknown Error

This expression is valid for continuous variables where differences can be quantified (like position, account balance, number of miles between customer and retail store, etc.) It is also valid for categorical variables, discrete values that can be ordered (like level of income, or age category), or unordered (like product category or name of sales associate).

Now, if we knew the error for each measured value, we could simply subtract it and restore the true value. But, unfortunately, that is usually not possible (otherwise, we would just store the true value right away), so we have to characterize the error as an uncertainty:

### True Value = Measured Value ± Uncertainty

Clearly, if the error is significant, it can hide the true value, resulting in useless and potentially misleading, deeply flawed analytics. One of the fundamental responsibilities of a data scientist is to characterize the uncertainty in the data due to the error, and know (or estimate) when the error could contribute to faulty analysis, incorrect conclusions, and ultimately bad decisions.

Leeds General Infirmary in England shut down its children’s heart surgery ward for eleven days in March 2013 as errors and omissions in patient data lead the hospital’s directors to incorrectly conclude that the child mortality rate was almost twice the national average. We assess that had these data errors been properly detected and mitigated, the results of the analysis would have been accurate, and the decision to shut down the ward would not have occurred, and more patients could have been treated. Source: http://www.bbc.com/news/health-22076206 |

#### So, what are the types of errors in my data?

Data errors occur in all shapes and sizes, and uncertainty analysis should consider both the type of error, and the relative magnitude. A multitude of methods of error mitigation are available, but they are only effective for specific types of error, and could even enhance the error data if applied carelessly. Let’s decompose the unknown error into random errors and systematic errors, because the ways of handling them are very different:

Measured Value = True Value + Random Error + Systematic Error

#### Noise — the Random Errors

Suppose a sales associate repeatedly types in customers’ names, phone numbers, and addresses, and every once in a while, makes a typo – a replacement, deletion, or insertion error. But any previous typo has no influence on the next typo; these typos are assumed to be independent of each other. Or suppose, you are recording temperature at various times and places in an industrial plant, where each sensor, reports a variation about an average, and the average is an accurate representation of the true temperature for a time and location. In both examples, the errors have no sequential pattern and are entirely random, or noisy.

The nice thing about random error is that they are readily handled by general-purpose statistical methods that you can find in mainstream statistical software packages. Noise can be estimated and smoothed out, with simple averages and standard deviations, or with more advanced filtering or predicting techniques, such as Kalman Filters or Spectral Analysis (such as those based on Fourier or Wavelet Transforms). Other noise sources, such as the random typos above, can be readily fixed using statistical parsers or spelling or grammar checking techniques.

#### Bias — the Systematic Errors

Just like a wristwatch that is running ahead by three minutes, systematic errors maintain consistent patterns from measurement to measurement. An important distinction between random errors and systematic errors is that while random errors can be handled statistically, systematic errors cannot (Taylor, 1997).

Unfortunately, biases – fixed or drifting offset errors – tend to work their way into most measurements from all types of data collectors, from uncalibrated sensors to inadequately sampled survey populations. In fact, the only real way to detect or eliminate bias is to compare with some form of truth reference. For example, one of our customers presented us with data sets of physical measurements from heavy land machinery, where each machine reported large quantities of data with different biases. We quantified bias characteristics by comparing the machinery measurements against a validated baseline for a limited subset of the data. Then, by applying relaxation algorithms, we were able to minimize the bias errors relative to the machines.

Suppose we survey a consumer group about product preferences from urban communities in one large metropolitan area. Even with very large sample sizes, are these results meaningful for rural populations? Are they representative of other cities throughout the US? If we gathered even more responses from the same city, would our data be a better approximation for the US? The answer is most likely *no* in each case, because the data is biased toward the preferences of the sampled population.

Another representative example of sampling biases that we see regularly – based on the Nyquist criterion — suppose we want to compute the slopes along a particular road from recorded GPS elevation measurements at regular quarter-mile intervals. The obvious problem with this approach is that any slope between two hilltops separated by less than half a mile, will be aliased – the slopes will appear much smaller than they are in reality. (This is the same type of problem that causes the wagon wheels to appear to rotate backwards in old western movies.)

We regularly see such biases working themselves undetected into analytics that could lead to bad decisions. In our experience, detecting and mitigating bias is much more challenging than dealing with random error, because it requires an intimate knowledge of the domain and because standard statistical methods are not generally applicable.

### How to make your crappy data useful

Now that we have described how error can dramatically reduce the utility of your data, what should one do to mitigate its bad influence on analytics?

First, know your data and quantify its uncertainty. Understand the conditions and environment under which the data is collected, then for a representative part of your data, find a trustworthy baseline to compare against. Using the baseline to “reverse-engineer” the errors, quantify the random and the systematic errors separately, realizing that the mitigation techniques will be quite different. Describe the spread of the random error with the common measures, such as the standard deviation; describe the systematic fixed or varying offset errors, as means and slopes over sequential segments in the data. It is especially important to characterize the uncertainty when merging in new data sources to ensure that the new data doesn’t enhance the errors significantly.

Second, understand how error can affect your analysis. We frequently use a simulation for a sensitivity analysis, starting with an error-free condition and progressively increase the random and systematic errors, until we detect a significant reduction in performance. Suppose we have a model that predicts whether an automobile service customer will return to the service bay, in part based on the distance the customer lives from the service bay. We can then insert different error conditions on the distance variable, and empirically determine when the model fails to predict customer behavior reliably.

Third, apply error mitigation as a preprocessing stage. From our experience, many analytic tools, such as classifiers can perform better when the random error is smoothed out. Unmitigated biases can propagate inconsistent data features into downstream analytics, so it useful to first determine the regions in the data that are potentially affected by bias. Assuming the high-bias regions can be identified, they can be excluded if it is not possible to mitigate the bias error. Detection and mitigation of bias is specifically tailored to the type of data and the method of collection.

#### So, how crappy is your data?

Do you know if the errors are affecting your results, and providing potentially flawed “insights”? Are you tracking the noise or the signal? Is your data so corrupted by error that any advanced analytics lead to contradictory conclusions? If so, you may need to refocus your corporate data strategy on more enhanced error characterization and mitigation techniques.

## Leave a Reply

Want to join the discussion?Feel free to contribute!