Data Cleaning Series: A Guide to Data Cleaning

By Ashmika Gouchwal & Himanshi Sharma

Data cleaning is about understanding the data at hand as well as the expanse of the study under consideration. Data cleaning is a simple exercise when done using proper methods and planning. It is vital to start from the basics and build your way up.

 

Things to Remember

 

The first and foremost thing to keep in mind when working with multiple datasets or multiple copies of the same dataset is the name assignment on files. It is easy to get swamped by the sea of complicated and huge master databases. The approach that we follow is to note down not only the date of creation of the file but also the number of data points contained in it. This is especially useful in case the data needs to be split up for any reason. For more clarity, save your files in dated folders to keep track of your daily work.

 

It is also imperative to keep a tab on the number of observations in the database. Hence a rule of thumb to be followed when dealing with data is that count is of utmost importance! (Also always subtract 1 (for the first row with variable names) from the count of observations in a single column generated in excel, unless you want to spend 20 minutes trying to find the missing data!)

 

Every beginner in the world of data cleaning wonders what tool would be the best for data cleaning. From experience, we realised that Stata, R, and Excel are capable of performing the basic checks discussed in this article. Ultimately, the choice of the tool depends on how comfortable you are with it and how accessible it is. 

 

The aforementioned points should be kept in mind while dealing with any kind of data and can make the data cleaning exercise more efficient and straightforward. 

 

Things to look out for 

 

Almost all primary datasets have a unique identifier attached to each observation. This can either be a unique ID or the name of the respondent or another related variable. These are key variables for the examination and analysis of the data since the information that we want to understand is contained at the unit data level. However, duplicity is an issue faced when dealing with household level data. The duplication signifies either multiple visits to the same household or input of the wrong ID for different households. 

 

A two-step approach should be followed to make corrections for duplicates: 

 

Step 1: Identification 

We need to first identify the duplicate values in the database. The unique identifier is the key variable to be used for this purpose. Finding duplicates of numeric or alpha-numeric IDs can be done using simple commands on STATA (the duplicates command) or in Excel (highlight duplicates function). It is possible that a revisit has been made to the same household due to lack of availability on the first visit (a consent no will be recorded for such a survey). In this case, this input is not a duplicate and may be controlled for during the analysis. 

 

Using the respondent name as an identifier comes with some caveats. An obvious issue is that two or more people can have the same name. In this case, the survey entries should be compared to ascertain if duplicate values have been recorded or not. It is advisable to compare more than one variable to check for duplicity. Key variables to be compared are personal details like address, age, education level, marital status, and so on that are furnished by the respondent. 

 

Step 2: Rectification 

 

Having identified the duplicate values in the database, a decision needs to be taken to keep one of the multiple recordings. Complete surveys containing information about the vital parameters for the study should always have precedence over the alternative entries or incomplete entries.

 

After completing the aforementioned steps, the new dataset will contain unique observations, and any further cleaning of the database has to be carried out after removing the duplicate values.

 

An efficient way to study the dataset is to observe it column-wise. It is imperative to have knowledge of which question of the survey tool the variable represents, and any relevant validity constraints.  

 

The next thing to look out for is typing errors in the dataset. These can exist in entry fields for names, addresses or numeric entries for multiple choice questions. For example, a “don’t know” response can be coded as “999” but the response entry may contain “99” or “9” instead. Skimming through the observations in the filter set for the particular column in Excel is an easy approach to spot typing errors in the dataset. Another approach is using the tabulate command in STATA. This command will generate a table that will list out all the recorded entries and the corresponding frequencies of a particular variable. Typing errors may be spotted in this list. 

 

Another issue that can come up is erroneous negative numeric data entries. They can be identified by using the methodology delineated above for typing errors. For example, calculated fields such as total spending or earning can have negative numbers that must be flagged. These fields are automatically calculated from responses given in the survey. Say, we ask the respondent the number of days they worked in the last month, and their average daily earnings. The survey platform automatically calculates total earnings by multiplying the number of days worked with average daily earnings for each respondent. However, sometimes a respondent may not remember or may not want to answer these questions. In such cases, if “Do not remember” has been coded in as -777, the calculated field for total earning will have an erroneous. This has been illustrated below.

 

Respondent

Number of Days of Work 

Average Daily Earning

Total Earnings

Status

A

21

500

10500


B

15

-777

-11655

Flag

C

-777

-777

603729

Flag

 

In a survey, there are cases wherein personal opinions are recorded. They can correspond to perceptions about an issue or just reasons for non-availability or refusal. These opinions will, most of the time, be recorded in the local language of the respondent or will be approximate translations posted by the enumerator. The appropriate method to deal with such inconsistencies is to take a note of the target users of the dataset and then use appropriate translations for the same. I recommend writing the translated answers in another new column next to the original entry to maintain the authenticity of the data collection exercise. To quote an example, the entry “pair mein dard” may be translated to “pain in legs” (in another column) for the question asking what diseases the respondent is currently suffering from.

 

There is a very thin line between data cleaning and data analysis. While one may perceive replacements to be a function that is performed by a data cleaner, the reality is that a data cleaner ensures that the data is consistent and of good quality and is in a ready to use state for the analysis team. Replacement for missing data or outlier values in the dataset are functions that are performed in tandem with the analysis of the dataset. This ensures that the replacements are suitable for the purpose of the study. 

 

Recommendations for STATA users

 

Users of STATA know how easy it is to perform basic checks on the dataset. The commands tabulate, summarise and duplicate, when combined with conditions come in handy for any kind of database. To illustrate, out of 505 respondents, a few consented to the survey and a few did not. In order to see the number of respondents who consented to the survey divided between males and females, the following command tabulate may be used. Here, 1 for consent corresponds to “yes”.

 

The summarise command is helpful when you want to look at descriptive statistics (average, range, standard deviation etc.) for a numeric variable such as rainfall, age, income, cholesterol level and so on. This command also detects outlier entries in the variable. 

 

The duplicate command can be used to list and tag duplicate entries for any variable. The tagging exercise involves the generation of a new variable that takes the value 1 if the observation has one duplicate, 2 if the observation has two duplicates and so on, and takes the value 0 if the observation has no duplicates. The generation of this variable is beneficial for identifying, ordering and studying the duplicate values in the dataset.

 

To list duplicates for a variable use: duplicates list variabe_name

To tag duplicates for a variable use: duplicates tag variable_name, generate(new_variable)

 

Use the generate command to create dummies wherever possible. Dummy variables can be useful when one wants to apply multiple conditions on one or more variables. For example, we want to understand the newspaper reading habits of males who are over 25 years of age, with higher education, who live in state A. We will start by generating a dummy variable to identify these respondents in the dataset by using the following set of commands. For gender, 1 corresponds to male, and for education (edu), 4 corresponds to higher education. 

 

generate a = 1 if  gender == 1 & state == “A” & edu == 4 & age >25

tabulate newspaper_var if a == 1

 

The first step tags the observations for which all of the conditions are satisfied. The second step lists out the responses of the variable for the identified group of individuals. When carrying out your analysis, we recommend using the two-step approach of identification and rectification listed out for duplicate values, as it is vital to examine the nature of errors in the dataset before proceeding with the rectification exercise. 

 

Automating the cleaning process by creating do-files that can be replicated for a small section of the master database can make our lives a lot easier, and the data cleaning exercise more fun. Remember that writing STATA commands is like writing sentences but in STATA’s language. It is advisable to keep your commands as simple and your do file as explanatory as possible.

 

Conclusion

 

Notwithstanding how exciting one may find data cleaning to be, the best way to clean a dataset is to minimise the possibility of receiving incorrect, irrelevant or missing data. As an agency that collects data from the ground, we make sure to make our surveys as foolproof as possible, and we train the enumerators to collect quality data. Moreover, the data cleaning exercise complements data collection and monitoring. For instance, for a survey that would span a few months, initial sets of data received from the field can shed light on where the data is subpar and also let us know the kind and the extent of errors the enumerators are making. Such monitoring will allow for early detection and speedy action to amend further data collection. 

 

With an ever-growing dependence on data for policy-making, there is an immediate requirement to standardise the protocols for cleaning and maintaining databases. This series is a small step in that direction.

 

 Ashmika Gouchwal is a Quantitative Researcher at Outline India. Himanshi Sharma is a Research Associate at Outline India. 

CARPE DATUM

Subscribe to our newsletter

Top