If there’s one thing I’ve learned in my decades as a data professional, it’s this:
Things aren’t what you think they are.
Data's ultimate purpose is to drive decisions. But our data isn’t as reliable or accurate as we want to believe. This leads to a most undesirable result:
Bad data means bad decisions
As a data professional, part of our mission is to make data "good enough" for use by others. We spend time scrubbing and cleaning data to make it consumable by other teams.
But we didn’t go to school to become a data janitor.
And yet, here we are. We understand that all data is dirty, but some data is useful.
The origins of dirty data
The main reason why data is dirty and often unreliable is simple: human intervention.
If not for humans, then data would be clean, perfect, and clear. If you’ve ever played the game “Telephone
,” then you understand how humans are awful at relaying even the simplest of data. But it gets worse when you realize that data today is collected from machines that are programmed by humans. There are assumptions being made by both the programmer and the person using the tool. These assumptions lead to data quality issues.
Here are but a few examples to consider:
Your standards change
– Standards change all the time. For example, scientific standards have been updated over the years in an effort to improve precision. In November 2018, the standard definition of a kilogram changed
. This means that any system using the old standard is producing wrong calculations. Tools are subject to rounding errors when the standards change. This leads to bad calculations and dirty data.
Your data collections fail
– Our collection tools and methods can collect the wrong data, or no data. Or worse, they could have issues with unit conversion. You might see (ms) and assume milliseconds, but it could be microseconds. Just because a collection is automated doesn’t mean it can be trusted. Humans are still touching the data.
Your data sets are incomplete
– You could have a rather large dataset and think, “Jackpot.” But large datasets are often incomplete. They have missing attributes or were put together by someone scraping websites. While the internet provides everyone the ability to access data at any time and for any need, it does not guarantee that the data is valid.
Time series collections lack context
– Time series collections are all the rage these days. In our effort to be DevOps-y, we stream logs to achieve observability
and perform analytics for insights. The problem is that this streaming data often lacks context for what is being measured. Often, the data being measured is changing. The simplest example is retail sales tied to seasons. You need context with your data. And SysAdmins know that measuring CPU by itself doesn’t have enough context—you need to collect additional metrics to tell the whole story.
All of the above can lead to the following:
– A single event is recorded and entered into your dataset twice.
– Fields that should contain values don’t.
– Information not entered correctly or not maintained.
– Typos, transpositions, variations in spelling, or formatting (say hello to my little friend Unicode!)
– Data entered in the wrong field.
By now you should understand that it is difficult, nay impossible, to determine if data is ever clean.
Methods to clean your dirty data
Here's a handful of techniques that you should consider when working with data. Remember, all data is dirty; you won't be able to make it perfect. Your focus should be making it "good enough" to pass along to the next person.
The first thing you should do when working with a dataset is to examine the data. Ask yourself, "Does this data make sense?
Then, before you do anything else, make a copy or backup of your data before you begin to make the smallest change
. I cannot stress this enough.
OK, so we've examined the data to see if it makes sense, and we have a copy. Here are a few data cleaning techniques.
Identify and remove duplicate data
– Tools such as Excel and PowerBI make this easy. Of course, you'll need to know if the data is duplicated, or two independent observations. For relational databases, we often use primary keys
as a way to enforce this uniqueness of records. But such constraints aren't available for every system that is logging data.
Remove data that doesn't fit
– Data entered that doesn't help you answer the question you are asking.
Identify and fix issues with spelling, etc.
– There are lots of ways to manipulate strings to help get your data formatted and looking pretty. For example, you could use the TRIM function to remove spaces from the text in a column, then sort the data and look for things like capitalization and spelling. There are also regional terms, like calling a sugary beverage "pop" or "soda."
– Set a standard for the data. If the data is a number, make sure it is a number. Often times you will see “three” instead of a 3, or a blank instead of a 0. If the data attribute is categorical, make sure the entries that apply for that category.
– But only when it makes sense to do so! If the outlier was due to poor collection, then it could be safe to remove. Hammond’s Law states
that “Ninety percent of the time, the next measurement will fall outside the 90% confidence interval.” Be mindful that outliers are innocent until proven guilty.
Fix missing data
– This gets... tricky. You have two options here. Either you remove the record, or you update the missing value. Yes, this is how we get faux null values. For categorical data, I suggest you set the data to the word “missing.” For numerical data, set the value to 0, or to the average of the field. I avoid using faux nulls for any data, unless it makes sense to note the absence of information collected. Your mileage may vary.
We all fall into the same trap: we never have time to do it right in the first place, but somehow think there will be time to fix it later.
When it comes to data, it’s a never-ending process of curating, consuming, and cleaning. And no matter how much you scrub that data, it will never be clean… but it can be good enough.
Life’s dirty. So is your data. Get used to it.