Types of Missing Data in Data Science

Types of Missing Data in Data Science
Types of Missing Data in Data Science

Missing data is a common problem in data analysis that can have a significant impact on the results of statistical analysis. There are different types of missing data that can occur for various reasons. Understanding the types of missing data is important for making appropriate decisions in statistical analysis and to avoid biases that can arise from missing data. In this article, we will discuss the four types of missing data: structurally missing data, missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).

Structurally Missing Data

Structurally missing data refers to the situation where data is missing because it was never collected in the first place. For example, if a survey did not include a question about income, then income data will be structurally missing. This type of missing data cannot be recovered, and it is not related to the study participants or the study design. Structurally missing data can lead to biased estimates and reduced statistical power.

Let’s say that a section of our health survey is asking about common respiratory conditions, and we see a section in our data that look like the following:

ParticipantIDAsthmaFlagInhalerFrequencyInhalerBrand
100TRUETwice dailyBreathe-Rite
101TRUEOnce weeklyBreathe-Rite
102FALSE
103TRUEOnce dailyAsthm-Away
104FALSE

As we can see, two rows do not have any data for the frequency and brand. This is expected since we see in the AsthmaFlag field that these participants do not use an inhaler at all.

Missing Completely at Random (MCAR)

Missing completely at random (MCAR) is a type of missing data that occurs randomly and independently of any other variables or factors in the dataset. In other words, the missingness is not related to any observed or unobserved characteristics of the study participants or the study design. For example, if some survey participants did not complete the survey due to a power outage, this would be considered MCAR. MCAR data can be handled with complete case analysis, where only complete cases are used in the analysis. However, this may lead to reduced statistical power and biased estimates if the data are not MCAR.

Let’s imagine our health survey data again. Steps are part of activity minutes, and let’s say there is a bug in the software that causes the device to not record steps. It’s completely random if someone has the bug in their device or not, and we know from the developers that about 20% of devices are affected. Therefore, we might expect that any missing step counts are MCAR. The below data shows a sampling of our data:

ParticipantIDWalkedSteps (1,000)
25TRUE2.1
43TRUE15
61TRUE6
62TRUE
78TRUE3
84TRUE
90TRUE0.5
102TRUE1.5
110TRUE.01
115TRUE4.1

Since all of these people have identified that they walked on that day, we can assume that the value for Steps should not be missing. We also know about this bug, and see that about 20% of our respondents are recording no steps. Therefore we may be able to assume that the missing values are simply missing by chance, and there isn’t an underlying reason for the missingness (this is actually a really big assumption – but we know about the bug, so seems ok).

Missing at Random (MAR)

Missing at random (MAR) is a type of missing data that occurs when the probability of missing data depends on other observed variables in the dataset. In other words, the missingness is related to the observed variables, but not to the unobserved variables. For example, if a survey participant did not respond to a question about income but was more likely to respond if they had a higher education level, this would be considered MAR. MAR data can be handled with methods such as multiple imputation, where missing values are imputed based on observed data.

Let’s look at a sampling of our survey data.

ParticipantIDHeight (cm)Weight (kg)
117684
2190
316061
4180
5184
615872
715250
8156
9194104
1018079

As we can see, some data is missing. Since we know that some people don’t like to report weight, and we also know that not everyone feels that way, and it doesn’t apply to our entire population, this data is Missing At Random (we know that this is not what “random” usually means, but it’s the statistical definition).

Missing Not at Random (MNAR)

Missing not at random (MNAR) is a type of missing data that occurs when the probability of missing data depends on unobserved variables in the dataset. In other words, the missingness is related to the unobserved variables that may be related to the outcome variable. For example, if survey participants with high levels of income are less likely to respond to a question about income, this would be considered MNAR. MNAR data can be challenging to handle, and methods such as selection models may be used to account for the missingness.

So, missing data can occur in various ways, and different types of missing data require different methods of handling. Understanding the types of missing data is essential for ensuring unbiased estimates and accurate conclusions in statistical analysis. Careful consideration should be given to the reasons for missing data and the appropriate methods for handling missing data.

Leave a Reply

Prev
Analyzing Data from Multiple Files in Python
Analyzing Data from Multiple Files in Python

Analyzing Data from Multiple Files in Python

Very often, we have the same data separated out into multiple files

Next
Analyze Data from the National Parks about Endangered Species
Analyze Data from the National Parks about Endangered Species

Analyze Data from the National Parks about Endangered Species

This goal of this project is to analyze biodiversity data from the National

You May Also Like