Everyone who deals with data should have heard of the term “dirty data”. Dirty data takes away the integrity of the dataset. Some characteristics of dirty data are incomplete, inaccurate, inconsistent and duplicated data.
Incomplete data is when some essential features are empty. For example, let’s say your task is to predict house prices. Let us assume that the “area of the house” is critical to make a good prediction, but it’s missing. That might be challenging for you and your model might not perform well.
Inaccurate and inconsistent data is when the values are technically correct, but wrong based on the context. For example, when an employee changed his address, and it wasn’t updated. Or when there are many copies of the data and the data scientist got an outdated version.
Duplicate data is a common problem. Let me share with you a story that has happened to me while working in an e-commerce company. By design, when a visitor clicked on a “collect voucher” button, the website sent a response to the server. This allowed us to measure the number of users who have collected vouchers.
The site was running well, until one day something has changed and I didn’t know about it. The frontend developer added another response for when someone collected vouchers successfully. The rationale was that some vouchers could be out of stock. They wanted to track visitors who clicked on the button and those who have collected the vouchers.
At that point, two responses were sent to the same log table. Looking at my reporting tool, the number of vouchers collected seemed to have doubled overnight! As I had deployed a model the day before, I assumed that my new model was just that impressive. I remember giving a mental standing ovation to my little model, but later, I realised it was just double counting.
Also, in the last five years as a data scientist, some of the data I’ve received are manual entries by corporate staff. These data are in Excel spreadsheets; many are inaccurate, incomplete, and inconsistent.
Whether the data comes from manual human input or machine logs, data wrangling is a large part of what happens in the real world. Data scientists have to deal with it. For supervised learning to work, we need reliable, labelled data. You can’t build a predictive model unless the data are correctly labelled. But nobody likes labelling data.
Many describe this as the 80/20 rule. Data scientists spend only 20 percent of their time on building models and the other 80 percent gathering, analysing, cleaning, and reorganising data. Dirty data is the most time-consuming aspect of the typical data scientist’s work.
It’s necessary to point out that data cleaning is incredibly essential; messy data won’t produce good results. You might have heard the phrase, “garbage in, garbage out”.
Data scientists do make discoveries while swimming in data, but before data scientists can start training any models, they must first become data janitors. The data needs cleaning, the data needs labelling.