When pioneers research, they know their tools and sample well.
Data collection would begin after a researcher has clearly defined and articulated his/her research problems. Data can be collected in two ways: by using primary sources and by using secondary sources. The primary data is collected after visiting the field. Thus, it acts as a first-hand information to address a specific research problem. The secondary data has been collected by someone else and it has been through a rigorous statistical process. For collecting data, the researcher must decide which kind of data he/she would be using for the study. The methods of collecting primary and secondary data differ as primary data is collected originally while in case of secondary data, the data collection work is merely about compiling the given data.
Steps for Data Collection
Various Methods of Data Collection
There are various methods of collecting data. A researcher must know the pros and cons of each method so as to choose the best method for addressing research problems and suggesting recommendations on the basis of proper data analysis.
Collecting the Primary Data
Collection of Primary Data in Social Sciences is in the form of performance surveys like the census and sample surveys then we can collect primary data through observation or by conversing/interviewing respondents. There are various methods for collecting primary data including, observation method, Interview, through questionnaires, through schedules, content analysis, etc.
A. Observation Method
This method is commonly used in Behavioural Sciences. It can only serve as a scientific tool if the researcher has planned it well and recorded all the incidents in a systematic manner. In this method, the researcher has to observe the situation and how respondents behave in a specific setting. While using this method, the researcher must focus on the process of carrying out a structured observation to address questions like, what should be observed? How observations will be recorded? How the accuracy of observation can be ensured? There are two ways to carry out observation: participant observation and non-participant observation. In participant observation, the researcher becomes the member of the group so as to experience how the other group members feel but if the researcher detaches from the group then he/she is being a non-participant observer.
B. Interview Method
Interview method involves interaction between the researcher and the respondents upon a particular issue or for capturing the impact of the issue of his/her interest. Interviews can be conducted through personal interviews or via using techno-aids like Skype, Google Hangout, email, telephone, etc.
C. Collecting data through questionnaires
While collecting data for a large sample size, questionnaires could be designed in a simplistic way to generate quantitative inputs for verifiability of a certain hypothesis considered by the researcher. In this method, the researcher prepares a questionnaire and asks the respondents to fill in the questionnaire and then asks the respondents to return the filled questionnaires. For each question, the researcher selects a measurement scale and analyses the data accordingly.
Collection of Secondary Data
Secondary data is the data which has been collected and analysed by someone else. Secondary data is available in various research journals; publications of international organisations, governmental organisations; books, magazines, newspapers, public records and statistics, etc.
Data collection is contextual and it should be only used for research/academic purposes. Therefore, research ethics need to be followed during the collection of data. The researcher must ensure that the data should be reliable, suitable, adequate, and accurate.
Selection of Appropriate Method for Data Collection
Thus, the most desirable approach for collecting data depends on the nature of problem, on the time and resources available and the level of precision required for addressing the research problems. But, selection of method also depends on the experience and ability of the researcher. Thus, Dr. Bowley said, while collecting data, ‘common sense is the chief requisite and experience is the chief teacher.’
Kothari, C.R. (2013). Research Methodology: Methods and Technology. New Age International Publishers: Mumbai.
Ontario Human Rights Commission (2019). What is involved in collecting data:Six steps to success. Accessed on June 18, 2019. From HTTP://WWW.OHRC.ON.CA/EN/COUNT-ME-COLLECTING-HUMAN-RIGHTS-BASED-DATA/6-WHAT-INVOLVED-COLLECTING-DATA-%E2%80%93-SIX-STEPS-SUCCESS
Democracy as a system of governance is supposed to allow extensive representation and inclusiveness of people from diverse backgrounds and perceptions to feed into the functioning of a fair and just society. The definition of democracy could only be understood if it is defined in social and individual contexts which government provides its citizens through the plans, programs, policies and schemes.
Democratic ideals represent various aspects of the broad idea of “government of the people, by the people and for the people.” India is proud to be the largest democracy in the world. For more than sixty-five years, we have witnessed the conduct of successful elections, peaceful changes happening at the Centre and in the States, people exercising their rights and performing their civic duties. At the same time, we quite often experience rampant inequalities, injustice or nonfulfillment of social expectations.
Today, people believe that the government is unable to fulfill their expectations. In the last decade, many incidents took place which led to unrest among civilians. In July 2018 Union health Secretary, CK Mishra made an honest acknowledgement stating that there are serious problems with India’s public health statistics.
He also mentioned that the data from the latest round of the National Family Health Survey (NFHS-4) which is the major source for detailed health statistics in India, conducted under the umbrella of the Ministry of Health and Family Welfare (MoHFW), itself is unreliable for certain states.
On top of that, the Health Management Information System (HMIS), which Mishra assumed as “a data mine” is not being effectively used. “We use very little of it in the planning process” due to lack of expertise to read and understand the data, he stated.
The health secretary’s statement raises concerns: how can the country formulate evidence-based policy or plan wisely for the future without credible data? Also, a recent paper by the Health Team of the National Institute of Public Finance and Policy, New Delhi, found that the country’s health data was unreliable, irregularly published, and failed to cover a broad-enough population. And such problems are not restricted to the health sector alone. The entire Indian data ecosystem needs improvement.
The debate over the unimpeachable India’s data, GDP and other statistics related to Economics, remains unsettled. Unemployment is a major challenge for the Government and it is a key socio-economic concern. Thus, economists cannot measure the problem’s magnitude because they do not have credible figures and surveys. India’s agricultural statistics have also come under the scanner. Talking about crime and all the aggregated DATA collected from FIRs, no official CRIME VICTIMIZATION SURVEYS have been instituted yet but discussions are happening around the corners. Official data sets are required for understanding situations or issues. Thus, every data set comes with warning that must be considered while making interpretations. But, Indian data sets are unable to meet standard expectations. The digitized world of today is producing data at a pace that is unprecedented in human history. It is estimated that today more than 3 billion people are connected to the internet (compared to only 2.3 million people in 1990). Access to internet has led to the rise of big data analytics, commonly defined using the four Vs: volume, variety (of sources), velocity (effectively around the clock) and veracity (given abundance, quality assurance becomes key).
If used effectively, big data analytics can be a powerful tool. It has the same performance enhancement potential for the public sector in terms of better policies, more tailored government services, and more effective and efficient distribution of resources. It can also lead to negative outcomes if used incorrectly, in addition to the much-discussed issue of privacy.
To begin with, there isn’t enough data. The data that does exists is mostly unreliable but is being used because there is no alternative. Several important data sets are released with a huge time lag. Others are missing granular low-level estimates. Even if such estimates are present, they are not always used for policy making or governance. Even when data sets are good and people want to use them, only a few people can understand how to work with them and use them for findings, analysis or recommendations. All these shortcomings amount to a failed Indian statistical ecosystem that falls short of the needs of the world’s largest democracy. The problem arises because government employs less enumerators and internal staff. Therefore, the government sector must sign a contract or MoU with private agencies engaged in Data based research.
Experts say that technology can be leveraged to improve data collection systems. PRIVATE DATA COLLECTION agencies are already making use of apps and tools to conduct surveys electronically, rather than on paper. For this, Outline India has developed an app, popularly known as, Track your Metrics (TYM) which is a simple tech-based solution, a self-reporting tool drawn from the works of sectoral leaders and internationally recognized bodies. The platform comes with pre-loaded survey questions to match your study objectives. TYM is all-in-one platform and application which allows survey formulation, data collection, data monitoring, outcome and impact evaluation.
While working with the government and various non-profits, I found that many of its trainees have never used a smartphone. Data collection technology must be made simple, and appropriate training must be conducted, so that people are able to use it without much trouble. “Thus hit-and-miss approach is not acceptable for data that form the core of our policy-building process.”
Disclaimer: Through this article, I do not wish to educate the reader about field planning; rather it is an experiential account of a novice learner in the field of data science and its usefulness in the social sector.
As an Economics major, I have spent a considerable amount of time learning about policy drafts and their implementation. I have used databases and statistical packages to run regression on highly complex variables only to arrive at conclusions that met the demands of the paper alone. Often, I found myself wondering about the composite structure of data, its procurement and how it serves as a fertile ground for policy change. In due time I realised that a holistic learning approach id needed much more than reading massive literature on data and instead, necessitated interaction with skilled researchers who ace the art of data collection and have been involved in the same as first-hand users and collectors.
Outline India is an organization that sincerely believes in enabling social impact through data. Inspiration runs rife in the work that it does and the impact that it creates and by all means, the place had my inquisitiveness. A neophyte in the field of data science, the common parlance sounded like a maze of technical mumbo jumbo and the people seemed very distant. However, in due course of time, I had caught up on the lingo and was seemingly enjoying the sheltered confines of this organisation.
One afternoon as my colleagues sat down to have lunch, I heard them reminisce about their days on the field – from hilarious accounts of getting ambushed by goats to getting sunburnt in Rajasthan. It was a treat to hear them speak so ardently about what they love doing the most. Their anecdotes had piqued my curiosity and I could not help but ask them about the things one should keep in mind before going on the field, the challenges that plague field operations, the most prevalent use of data collection and other things of the like. Their recollection of their field days is what prompted me to write this feature. So for the first time researchers and data collectors, below is a list of things to remember before you go on the field.
It is essential to have additional resources, for the field is an uncertain terrain. One should have a backing for mobile chargers, SIM cards, training material, clothes, food & water etc.
Preparedness is the key to successful field operation. One should make lists and create realistic schedules. It is wise to start early in the day to avoid last-minute hassles and it is imperative to return to the establishment before reasonable time has passed. It is unsafe to be out during the night.
The permission letters should always be kept handy.
Should any discomfort be felt, the interview is to be discontinued immediate effect and the enumerator are to leave the premise at the earliest.
Females and children ought not to be interviewed alone in closed rooms. They should always be accompanied by a member of their family or a female fieldworker.
At all points, the enumerator must have the contact details of the following:
c) At least one fieldworker (male and female)
d) Your colleagues
e) Hotel reception
carry a notebook and a pen to make notes. The memory starts losing detail within twenty-four hours.
While using recorders/tablets or any other digital devices, the following is to be kept in mind:
a. Ensure that they are switched off when not in use.
b. All devices are to be charged daily. Power situations may be tricky, hence, one should charge as time allows.
c. Ensure that devices are given out to field workers are capable of being tracked. The devices are to be collected on departure and it is to be ensured that they are in working condition upon receipt.
d. Ensure devices are being used only for research purposes and not for other activities.
The following is to be ensured when dealing with field staff:
a. Avoid being over-friendly with the field staff.
b. One meets new field members, whose background one is often oblivious about. Should something
appear as strange, the field supervisor and the team should be immediately informed.
c. Check and re-check the data from the first interview before proceeding to the second respondent.
d. All fields of the questionnaire should necessarily be filed.
In the event, that you do not have network connectivity, be accompanied by at least one team member at all points. (This is especially important for the female field workers.)
So that is all for now. I hope by means of this article, I have been able to reach to a few of you and aided you with your field woes.
Lakshita Arora is interning with Outline India.
Backchecks are quality control measures undertaken to corroborate the credibility and the legitimacy of data collected from a survey. Back-checks necessitate the installation of a back-check team. This team holds responsibility for a re-evaluation of survey outcomes by returning to a randomly selected subset of households from where the data had originally been collected. These randomly selected households are re-interviewed using a back-check survey. The back-check survey contains a subset of questions from the original survey.
Why are back-checks conducted?
Back-checks help to legitimize the data collected from a survey. They act as a tool to assess the enumerator’s effectiveness in conducting a survey and drawing outcomes from the same, help detect discrepancy, if any, in survey outcomes and help in identifying malfunctioning and obsolete survey tools and under-equipped enumerators.
How are back-checks conducted?
The following steps are to be undertaken for the purpose of back-checking a survey.
Coordinating and sampling for back-checks:
The back-checks are conducted by a back-check team which involves enumerators of sound credibility and immaculate character to safeguard against biases and errors.
The duration of back-check surveys should ideally be 10-15 minutes.
20% of the back-checks should be conducted within two weeks from the commencement of fieldwork. This will ensure that the operativeness of the questionnaire and the effectiveness of the enumerators are communicated to the researchers in time, helping them regulate the means to gather high-quality data.
10-20% of the total observations should ideally be back-checked. Missing respondents should be included in the back-check sample.
Designing back-check surveys:
Data cleaning bridges the gap between data capture and data analysis. Even though data cleaning remains skill that is rarely taught, and almost always picked up on the job, its importance in research studies is unquestionable. When data informs policies that affect lives, it is imperative that that data be reliable, accurate, and precise. “Data cleaners” work precisely on this. Here, we try to give you an insight into why data cleaning is essential, what it is, how we go about cleaning data and a few tips and tricks that we picked up on the way.
Why data cleaning is important
Imagine trying to cook something delicious with spoiled ingredients. That is how data analysis would be with a dataset, which is unclean. If we had a nickel for every time that we hear of policies being based on unreliable data, we would not need to work a day of our lives. As development practitioners, we understand that the stakes are high when policies are informed by shoddy data. Dirty data would churn out erroneous results, compromise the effectiveness of policies and programmes, cause wastage of resources. Data cleaning can avoid this chain of events, ensure that policies have a real impact and that lives do change.
What is data cleaning?
To get reliable and accurate results from a dataset, that data set must be of the best quality, because as they say – “garbage in; garbage out”. Data cleaning is the process of making sure that the quality of your data is top-notch. This may mean several things, depending on the specific dataset you are dealing with. However, there are a few general guidelines that may be followed.
Data cleaning essentially starts from identifying the issues that your dataset may be suffering with. For instance, if you are collecting information on the reproductive health of adolescent girls, you would not want your dataset throwing up information on the reproductive health of women in their thirties. To streamline this discovery of errors, something we learned early on from various resources was this: The data must be relevant, valid, accurate, complete, consistent, and uniform. These terms are illustrated below with relevant examples.
Relevance: Make sure the dataset meets the purpose of the study. A study concerning the impact of skill development programme on girls renders data collected on its effects on boys irrelevant.
Validity: The entries in each cell must be valid, and adhere to constraints imposed. For example, when recording the height of a respondent, the entry must not be negative or an outlier (for example 50 feet). Similarly, age, number of assets, number of family members, etc. must not be negative. Text fields must not have numbers and vice-versa. Make sure to figure out the validity constraints of each of the columns of your dataset, and check for any invalid entries. For example, some questions may be mandatory, and the recorded response must not be empty. Another validity constraint could be on the range of responses that can be entered (gender can only be male or female, age may be constrained to 18-65 years, etc.)
Accuracy: The data collected must be as close to the true value as possible. This could be as simple as looking at phone numbers. See if any of them start with an area code of a whole other region, check to see if it is something like 9999999999. This exercise could be a little bit more complicated too. Say a survey asks the number of children for a female over 14 years of age. For a 14-year-old girl, the record says she has four children. This information is potentially inaccurate and necessitates investigation into whether there was an error at the time of data entry or in the respondent’s understanding of the question.
Completeness: While missing information is a common malady that plagues data, it is crucial to find out what is missing. Missing information may lead to misleading results. For example in a study of contraception practices, the prevalence of sexually transmitted diseases is a vital variable. If this variable has a high number of “Refused to answer” or “Don’t Knows”, the study will not be able to communicate much. In such cases, the best practice is to go back to the field and re-interview the respondents, if that is possible. The best way is to reach out to the respective respondent and interview them again. Moreover, do check that the number of variables and the number of observations in the dataset is correct.
Consistency: Look out for inconsistent entries. For example, it seems a bit fishy if a respondent’s age is 65 years, but the day of his/her marriage is two years ago. Also, it is essential to check that all the skips in the questionnaire are correctly coded.
Uniformity: The data for each variable, that is, in each column, must be of the same unit. This is to say that if aquestion records the age of a child in days, the age of each child in the dataset must be in days - none in years or months. If Panchayat is a variable in your dataset, make sure you are using standardised names for them, and other such variables. You must translate all text fields to the same language, and change the case of all text fields to match.
These data cleaning checks are generic and can be further customised for any dataset. However, before anything else, it is crucial that you are familiar with the survey tool, inside and out. Understand its purpose, its flow, the skips, the number of variables, the units of variables, etc. Once that is done, the data cleaning protocol will become much easier, and may seem to be developing on its own! This is to say that once you are thorough with your survey tool, you will be able to intuitively know what to look for in a dataset while cleaning, instead of having to refer to a list of checks. We will discuss a few specific ways of performing data cleaning in the second article of the series.
Data cleaning is about understanding the data at hand as well as the expanse of the study under consideration. Data cleaning is a simple exercise when done using proper methods and planning. It is vital to start from the basics and build your way up.
Things to Remember
The first and foremost thing to keep in mind when working with multiple datasets or multiple copies of the same dataset is the name assignment on files. It is easy to get swamped by the sea of complicated and huge master databases. The approach that we follow is to note down not only the date of creation of the file but also the number of data points contained in it. This is especially useful in case the data needs to be split up for any reason. For more clarity, save your files in dated folders to keep track of your daily work.
It is also imperative to keep a tab on the number of observations in the database. Hence a rule of thumb to be followed when dealing with data is that count is of utmost importance! (Also always subtract 1 (for the first row with variable names) from the count of observations in a single column generated in excel, unless you want to spend 20 minutes trying to find the missing data!)
Every beginner in the world of data cleaning wonders what tool would be the best for data cleaning. From experience, we realised that Stata, R, and Excel are capable of performing the basic checks discussed in this article. Ultimately, the choice of the tool depends on how comfortable you are with it and how accessible it is.
The aforementioned points should be kept in mind while dealing with any kind of data and can make the data cleaning exercise more efficient and straightforward.
Things to look out for
Almost all primary datasets have a unique identifier attached to each observation. This can either be a unique ID or the name of the respondent or another related variable. These are key variables for the examination and analysis of the data since the information that we want to understand is contained at the unit data level. However, duplicity is an issue faced when dealing with household level data. The duplication signifies either multiple visits to the same household or input of the wrong ID for different households.
A two-step approach should be followed to make corrections for duplicates:
Step 1: Identification
We need to first identify the duplicate values in the database. The unique identifier is the key variable to be used for this purpose. Finding duplicates of numeric or alpha-numeric IDs can be done using simple commands on STATA (the duplicates command) or in Excel (highlight duplicates function). It is possible that a revisit has been made to the same household due to lack of availability on the first visit (a consent no will be recorded for such a survey). In this case, this input is not a duplicate and may be controlled for during the analysis.
Using the respondent name as an identifier comes with some caveats. An obvious issue is that two or more people can have the same name. In this case, the survey entries should be compared to ascertain if duplicate values have been recorded or not. It is advisable to compare more than one variable to check for duplicity. Key variables to be compared are personal details like address, age, education level, marital status, and so on that are furnished by the respondent.
Step 2: Rectification
Having identified the duplicate values in the database, a decision needs to be taken to keep one of the multiple recordings. Complete surveys containing information about the vital parameters for the study should always have precedence over the alternative entries or incomplete entries.
After completing the aforementioned steps, the new dataset will contain unique observations, and any further cleaning of the database has to be carried out after removing the duplicate values.
An efficient way to study the dataset is to observe it column-wise. It is imperative to have knowledge of which question of the survey tool the variable represents, and any relevant validity constraints.
The next thing to look out for is typing errors in the dataset. These can exist in entry fields for names, addresses or numeric entries for multiple choice questions. For example, a “don’t know” response can be coded as “999” but the response entry may contain “99” or “9” instead. Skimming through the observations in the filter set for the particular column in Excel is an easy approach to spot typing errors in the dataset. Another approach is using the tabulate command in STATA. This command will generate a table that will list out all the recorded entries and the corresponding frequencies of a particular variable. Typing errors may be spotted in this list.
Another issue that can come up is erroneous negative numeric data entries. They can be identified by using the methodology delineated above for typing errors. For example, calculated fields such as total spending or earning can have negative numbers that must be flagged. These fields are automatically calculated from responses given in the survey. Say, we ask the respondent the number of days they worked in the last month, and their average daily earnings. The survey platform automatically calculates total earnings by multiplying the number of days worked with average daily earnings for each respondent. However, sometimes a respondent may not remember or may not want to answer these questions. In such cases, if “Do not remember” has been coded in as -777, the calculated field for total earning will have an erroneous. This has been illustrated below.
Number of Days of Work
Average Daily Earning
In a survey, there are cases wherein personal opinions are recorded. They can correspond to perceptions about an issue or just reasons for non-availability or refusal. These opinions will, most of the time, be recorded in the local language of the respondent or will be approximate translations posted by the enumerator. The appropriate method to deal with such inconsistencies is to take a note of the target users of the dataset and then use appropriate translations for the same. I recommend writing the translated answers in another new column next to the original entry to maintain the authenticity of the data collection exercise. To quote an example, the entry “pair mein dard” may be translated to “pain in legs” (in another column) for the question asking what diseases the respondent is currently suffering from.
There is a very thin line between data cleaning and data analysis. While one may perceive replacements to be a function that is performed by a data cleaner, the reality is that a data cleaner ensures that the data is consistent and of good quality and is in a ready to use state for the analysis team. Replacement for missing data or outlier values in the dataset are functions that are performed in tandem with the analysis of the dataset. This ensures that the replacements are suitable for the purpose of the study.
Recommendations for STATA users
Users of STATA know how easy it is to perform basic checks on the dataset. The commands tabulate, summarise and duplicate, when combined with conditions come in handy for any kind of database. To illustrate, out of 505 respondents, a few consented to the survey and a few did not. In order to see the number of respondents who consented to the survey divided between males and females, the following command tabulate may be used. Here, 1 for consent corresponds to “yes”.
The summarise command is helpful when you want to look at descriptive statistics (average, range, standard deviation etc.) for a numeric variable such as rainfall, age, income, cholesterol level and so on. This command also detects outlier entries in the variable.
The duplicate command can be used to list and tag duplicate entries for any variable. The tagging exercise involves the generation of a new variable that takes the value 1 if the observation has one duplicate, 2 if the observation has two duplicates and so on, and takes the value 0 if the observation has no duplicates. The generation of this variable is beneficial for identifying, ordering and studying the duplicate values in the dataset.
To list duplicates for a variable use: duplicates list variabe_name
To tag duplicates for a variable use: duplicates tag variable_name, generate(new_variable)
Use the generate command to create dummies wherever possible. Dummy variables can be useful when one wants to apply multiple conditions on one or more variables. For example, we want to understand the newspaper reading habits of males who are over 25 years of age, with higher education, who live in state A. We will start by generating a dummy variable to identify these respondents in the dataset by using the following set of commands. For gender, 1 corresponds to male, and for education (edu), 4 corresponds to higher education.
generate a = 1 if gender == 1 & state == “A” & edu == 4 & age >25
tabulate newspaper_var if a == 1
The first step tags the observations for which all of the conditions are satisfied. The second step lists out the responses of the variable for the identified group of individuals. When carrying out your analysis, we recommend using the two-step approach of identification and rectification listed out for duplicate values, as it is vital to examine the nature of errors in the dataset before proceeding with the rectification exercise.
Automating the cleaning process by creating do-files that can be replicated for a small section of the master database can make our lives a lot easier, and the data cleaning exercise more fun. Remember that writing STATA commands is like writing sentences but in STATA’s language. It is advisable to keep your commands as simple and your do file as explanatory as possible.
Notwithstanding how exciting one may find data cleaning to be, the best way to clean a dataset is to minimise the possibility of receiving incorrect, irrelevant or missing data. As an agency that collects data from the ground, we make sure to make our surveys as foolproof as possible, and we train the enumerators to collect quality data. Moreover, the data cleaning exercise complements data collection and monitoring. For instance, for a survey that would span a few months, initial sets of data received from the field can shed light on where the data is subpar and also let us know the kind and the extent of errors the enumerators are making. Such monitoring will allow for early detection and speedy action to amend further data collection.
With an ever-growing dependence on data for policy-making, there is an immediate requirement to standardise the protocols for cleaning and maintaining databases. This series is a small step in that direction.
Ashmika Gouchwal is a Quantitative Researcher at Outline India. Himanshi Sharma is a Research Associate at Outline India.