Guide to Bad Data
I recently discovered the Quartz Guide to Bad Data, I wish I had found this long ago. I’ve been sharing this with my teams to help them spot issues with fishy and broken datasets. There’s lots of useful indicators and gotcha’s here.
This guide was written to help Quartz staff identify problems with data they report on. The most up-to-date version of this guide can always be found on GitHub.
An exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.
As a reporter your world is full of data. And those data are full of problems. This guide presents thorough descriptions and possible solutions to many of the kinds of problems that you will encounter when working with data.
Most of these problems can be solved. Some of them can’t be solved and that means you should not use the data. Others can’t be solved, but with precautions you can continue using the data. In order to allow for these ambiguities, this guide is organized by who is best equipped to solve the problem: you, your source, an expert, etc. In the description of each problem you may also find suggestions for what to do if that person can’t help you.
You can not possibly review every dataset you encounter with for all of these problems. If you try to do that you will never get anything published. However, by familiarizing yourself with the kinds of issues you are likely to encounter you will have a better chance of identifying an issue before it causes you to make a mistake.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
A few of my favorites
Values are missing
Beware blank or “null” values in any dataset unless you are certain you know what they mean. If the data are annual, was the value for that year never collected? If it is a survey, did a respondent refuse to answer the question?
Any time you’re working with data that has missing values you should ask yourself: “Do I know what the absence of this value means?” If the answer is no, you should ask your source.
Zeros replace missing values
Worse than a missing value is when an arbitrary value is used instead. This can be the result of a human not thinking through the implications or it can happen as the result of automated processes that simply don’t know how to handle null values. In any case, if you see zeros in a series of numbers you should ask yourself if those values are really the number `0` or if they instead means “nothing”. (`-1` is also sometimes used this way.) If you aren’t sure, ask your source.
The same caution should be exercised for other non-numerical values where a `0` may be represented in another way. For example a false `0` value for a date is often displayed as `1970-01-01T00:00:00Z` or `1969-12-31T24:59:59Z` which is the Unix epoch for timestamps. A false `0` for a location might be represented as `0°00’00.0″N+0°00’00.0″E` or simply `0°N 0°E` which is a point in the Atlantic Ocean just south of Ghana often referred to as Null Island.
Suspicious values are present
If you see any of these values in your data, treat them with an abundance of caution:
- `99999` (or any other long sequence of 9’s)
- `00000` (or any other sequence of 0’s)
- `0°00’00.0″N+0°00’00.0″E` or simply `0°N 0°E`
- US zip code `12345` (Schenectady, New York)
- US zip code `90210` (Beverly Hills, CA)
Each of these numbers has an indication of a particular error made by either a human or a computer. If you see them, ensure they actually mean what you think they mean!
Spreadsheet has 65536 rows
The maximum number of rows an old-fashioned Excel spreadsheet was allowed to have was 65,536. If you receive a dataset with that number of rows you have almost certainly been given truncated data. Go back and ask for the rest. Newer versions of Excel allowed for 1,048,576 rows, so it’s less likely you’ll be working with data that hits the limit.
Spreadsheet has dates in 1900, 1904, 1969, or 1970
For reasons beyond obscure, Excel’s default date from which it counts all other dates is `January 1st, 1900`, *unless* you’re using Excel on a Mac, in which case it’s `January 1st, 1904`. There are a variety of ways in which data in Excel can be entered or calculated incorrectly and end up as one of these two dates. If you spot them in your data, it’s probably an issue.
Many databases and applications will often generate a date of `1970-01-01T00:00:00Z` or `1969-12-31T24:59:59Z` which is the Unix epoch for timestamps. In other words this is what happens when a system tries to display an empty value or a `0` value as a date.
Text has been converted to numbers
Not all numerals are numbers. For instance, the US Census Bureau uses “FIPS codes” to identify every place in the United States. These codes are of various lengths and are numeric. However, they are *not* numbers. `037` is the FIPS code for Los Angeles County. It is not the number `37`. The numerals `37` are, however, a valid FIPS code: for North Carolina. Excel and other spreadsheets will often make the mistake of assuming numerals are numbers and stripping the leading zeros. This can cause all kinds of problems if you try to convert it to another file format or merge it with another dataset. Watch out for data where this has happened before it was given to you.
Data assert unrealistic precision
Outside of hard science, few things are routinely measured with more than two decimal places of accuracy. If a dataset lands on your desk that purports to show a factory’s emissions to the 7th decimal place that is a dead giveaway that it was estimated from other values. That in and of itself may not be a problem, but it’s important to be transparent about estimates. They are often wrong.
Benford’s Law fails
Benford’s Law is a theory which states that small digits (1, 2, 3) appear at the beginning of numbers much more frequently than large digits (7, 8, 9). In theory Benford’s Law can be used to detect anomalies in accounting practices or election results, though in practice it can easily be misapplied. If you suspect a dataset has been created or modified to deceive, Benford’s Law is an excellent first test, but you should always verify your results with an expert before concluding your data have been manipulated.
Too good to be true
There is no global dataset of public opinion. Nobody knows the exact number of people living in Siberia. Crime statistics aren’t comparable across borders. The US government is not going to tell you how much fissile material it keeps on hand.
Beware any data that purport to represent something that you could not possibly know. It’s not data. It’s somebody’s estimate and it’s probably wrong. Then again… it could be a story, so ask an expert to check it out.