Data Cleaning Series: An Introduction to Data Cleaning

By Ashmika Gouchwal &Himanshi Sharma

Data cleaning bridges the gap between data capture and data analysis. Even though data cleaning remains skill that is rarely taught, and almost always picked up on the job, its importance in research studies is unquestionable. When data informs policies that affect lives, it is imperative that that data be reliable, accurate, and precise. “Data cleaners” work precisely on this. Here, we try to give you an insight into why data cleaning is essential, what it is, how we go about cleaning data and a few tips and tricks that we picked up on the way. 


Why data cleaning is important


Imagine trying to cook something delicious with spoiled ingredients. That is how data analysis would be with a dataset, which is unclean. If we had a nickel for every time that we hear of policies being based on unreliable data, we would not need to work a day of our lives. As development practitioners, we understand that the stakes are high when policies are informed by shoddy data. Dirty data would churn out erroneous results, compromise the effectiveness of policies and programmes, cause wastage of resources. Data cleaning can avoid this chain of events, ensure that policies have a real impact and that lives do change.


What is data cleaning?

To get reliable and accurate results from a dataset, that data set must be of the best quality, because as they say – “garbage in; garbage out”. Data cleaning is the process of making sure that the quality of your data is top-notch. This may mean several things, depending on the specific dataset you are dealing with. However, there are a few general guidelines that may be followed.


Data cleaning essentially starts from identifying the issues that your dataset may be suffering with. For instance, if you are collecting information on the reproductive health of adolescent girls, you would not want your dataset throwing up information on the reproductive health of women in their thirties. To streamline this discovery of errors, something we learned early on from various resources was this: The data must be relevant, valid, accurate, complete, consistent, and uniform. These terms are illustrated below with relevant examples.


Relevance: Make sure the dataset meets the purpose of the study. A study concerning the impact of skill development programme on girls renders data collected on its effects on boys irrelevant. 


Validity: The entries in each cell must be valid, and adhere to constraints imposed. For example, when recording the height of a respondent, the entry must not be negative or an outlier (for example 50 feet). Similarly, age, number of assets, number of family members, etc. must not be negative. Text fields must not have numbers and vice-versa. Make sure to figure out the validity constraints of each of the columns of your dataset, and check for any invalid entries. For example, some questions may be mandatory, and the recorded response must not be empty. Another validity constraint could be on the range of responses that can be entered (gender can only be male or female, age may be constrained to 18-65 years, etc.)


Accuracy: The data collected must be as close to the true value as possible. This could be as simple as looking at phone numbers. See if any of them start with an area code of a whole other region, check to see if it is something like 9999999999. This exercise could be a little bit more complicated too. Say a survey asks the number of children for a female over 14 years of age. For a 14-year-old girl, the record says she has four children. This information is potentially inaccurate and necessitates investigation into whether there was an error at the time of data entry or in the respondent’s understanding of the question.


Completeness: While missing information is a common malady that plagues data, it is crucial to find out what is missing. Missing information may lead to misleading results. For example in a study of contraception practices, the prevalence of sexually transmitted diseases is a vital variable. If this variable has a high number of “Refused to answer” or “Don’t Knows”, the study will not be able to communicate much. In such cases, the best practice is to go back to the field and re-interview the respondents, if that is possible. The best way is to reach out to the respective respondent and interview them again. Moreover, do check that the number of variables and the number of observations in the dataset is correct.


Consistency: Look out for inconsistent entries. For example, it seems a bit fishy if a respondent’s age is 65 years, but the day of his/her marriage is two years ago. Also, it is essential to check that all the skips in the questionnaire are correctly coded.


Uniformity: The data for each variable, that is, in each column, must be of the same unit. This is to say that if aquestion records the age of a child in days, the age of each child in the dataset must be in days - none in years or months. If Panchayat is a variable in your dataset, make sure you are using standardised names for them, and other such variables. You must translate all text fields to the same language, and change the case of all text fields to match.


These data cleaning checks are generic and can be further customised for any dataset. However, before anything else, it is crucial that you are familiar with the survey tool, inside and out. Understand its purpose, its flow, the skips, the number of variables, the units of variables, etc. Once that is done, the data cleaning protocol will become much easier, and may seem to be developing on its own! This is to say that once you are thorough with your survey tool, you will be able to intuitively know what to look for in a dataset while cleaning, instead of having to refer to a list of checks. We will discuss a few specific ways of performing data cleaning in the second article of the series.



 

 

CARPE DATUM

Subscribe to our newsletter

Top