Data cleaning bridges
the gap between data capture and data analysis. Even though data cleaning
remains skill that is rarely taught, and almost always picked up on the job,
its importance in research studies is unquestionable. When data informs
policies that affect lives, it is imperative that that data be reliable,
accurate, and precise. “Data cleaners” work precisely on this. Here, we try to
give you an insight into why data cleaning is essential, what it is, how we go
about cleaning data and a few tips and tricks that we picked up on the
way.
Why data cleaning is important
Imagine trying to cook something delicious with spoiled ingredients. That
is how data analysis would be with a dataset, which is unclean. If we had a
nickel for every time that we hear of policies being based on unreliable data,
we would not need to work a day of our lives. As development practitioners, we
understand that the stakes are high when policies are informed by shoddy data.
Dirty data would churn out erroneous results, compromise the effectiveness of
policies and programmes, cause wastage of resources. Data cleaning can avoid
this chain of events, ensure that policies have a real impact and that lives do
change.
What is data cleaning?
To get reliable and
accurate results from a dataset, that data set must be of the best quality,
because as they say – “garbage in; garbage out”. Data cleaning is the process
of making sure that the quality of your data is top-notch. This may mean
several things, depending on the specific dataset you are dealing with.
However, there are a few general guidelines that may be followed.
Data cleaning essentially starts from identifying the issues that your dataset
may be suffering with. For instance, if you are collecting information on the
reproductive health of adolescent girls, you would not want your dataset
throwing up information on the reproductive health of women in their thirties.
To streamline this discovery of errors, something we learned early on from
various resources was this: The data must be relevant, valid, accurate,
complete, consistent, and uniform. These terms are illustrated below with
relevant examples.
Relevance: Make sure the dataset meets the purpose of the study. A study
concerning the impact of skill development programme on girls renders data
collected on its effects on boys irrelevant.
Validity: The entries in each cell must be valid, and adhere to constraints
imposed. For example, when recording the height of a respondent, the entry must
not be negative or an outlier (for example 50 feet). Similarly, age, number of
assets, number of family members, etc. must not be negative. Text fields must
not have numbers and vice-versa. Make sure to figure out the validity
constraints of each of the columns of your dataset, and check for any invalid
entries. For example, some questions may be mandatory, and the recorded
response must not be empty. Another validity constraint could be on the range
of responses that can be entered (gender can only be male or female, age may be
constrained to 18-65 years, etc.)
Accuracy: The data collected must be as close to the true value as possible.
This could be as simple as looking at phone numbers. See if any of them start
with an area code of a whole other region, check to see if it is something like
9999999999. This exercise could be a little bit more complicated too. Say a
survey asks the number of children for a female over 14 years of age. For a
14-year-old girl, the record says she has four children. This information is
potentially inaccurate and necessitates investigation into whether there was an
error at the time of data entry or in the respondent’s understanding of the
question.
Completeness: While missing information is a common malady that plagues data,
it is crucial to find out what is missing. Missing information may lead to
misleading results. For example in a study of contraception practices, the
prevalence of sexually transmitted diseases is a vital variable. If this
variable has a high number of “Refused to answer” or “Don’t Knows”, the study
will not be able to communicate much. In such cases, the best practice is to go
back to the field and re-interview the respondents, if that is possible. The
best way is to reach out to the respective respondent and interview them again.
Moreover, do check that the number of variables and the number of observations
in the dataset is correct.
Consistency: Look out for inconsistent entries. For example, it seems a bit
fishy if a respondent’s age is 65 years, but the day of his/her marriage is two
years ago. Also, it is essential to check that all the skips in the questionnaire
are correctly coded.
Uniformity: The data for each variable, that is, in each column, must be of the
same unit. This is to say that if aquestion records the age of a child in days,
the age of each child in the dataset must be in days - none in years or months.
If Panchayat is a variable in your dataset, make sure you are using
standardised names for them, and other such variables. You must translate all
text fields to the same language, and change the case of all text fields to
match.
These data cleaning checks are generic and can be further customised for any
dataset. However, before anything else, it is crucial that you are familiar
with the survey tool, inside and out. Understand its purpose, its flow, the
skips, the number of variables, the units of variables, etc. Once that is done,
the data cleaning protocol will become much easier, and may seem to be
developing on its own! This is to say that once you are thorough with your
survey tool, you will be able to intuitively know what to look for in a dataset
while cleaning, instead of having to refer to a list of checks. We will discuss
a few specific ways of performing data cleaning in the second article of the
series.
Subscribe to our newsletter