EDA step of the Data Science Process
A critical step in the data science process is the Exploratory Data Analysis or EDA. Linus Torvald once said, “Bad programmers worry about the code. Good programmers worry about the data structures and their relationships.” Although Torvald states this in the context of programming, it essentially applies also in the data science setting. Examining your data thoroughly to understand the underlying data structure, is imperative to building good and even better models.
Recently, when I did my first data science project as I started on the EDA step, I searched for answers to the best practices in this step. I knew why EDA was important, but I wanted to understand the what and how part of this. I learned that EDA has got to do with, how you look at data, what you look for in the data and how you interpret what you find. Although, this blog might only seem to add to the already plethora of information on EDA online, I hope to provide a meaningful approach to the EDA process.
There are no set on stone rules to exactly how to explore data. It is important to understand that the EDA process is iterative. It is not just a one time step in the data science process, rather something you come back to again and again, till the data science process is complete. You start out by generating questions about the data. To answer those questions, you use various visualization tools, data transformations, try out different models. Then you refine your questions and/or again generate new questions using what you have already learned. You go back to answering those questions, so on and so forth. So let’s begin.
Look for Missing Values
“Are there missing values in my data?” Once you have investigated the shape of the data set, identified the number of observations and independent variables or features, investigated the data types of each of the variables or features, you need to look for missing values in your data set. If any missing or ‘Nan’ values are discovered you are left with three options.
#import necessary libraries
#load and save data set as a pandas dataframe df.shape
#tells you the number of rows(observations) and columns(variable or features) present in the dataframe 'df'df.head()
#previews the first five observations of the dataframe dfdf.info()
#gives you a series with all the variables, the number of data points in each variable and the data types of each variabledf.isna().sum()
#returns a series showing the total number of missing values at each column of the dataframe df
Firstly, you can replace or impute them. When replacing or imputing missing values, consider how the replacement might change or effect the distribution of the data or the meaning of the observation. You may impute the missing value with any statistic that describes the central tendency (mean, median, or mode) of a data or another value from the data (be sure to explain your reasoning regardless of what you choose). What you choose will depend on the type of data relevant at the time and most preferably could be one that will not change the distribution of the data or the representation of the observation.
Second option is to remove it altogether. Sometimes, you will encounter missing values for which any imputation might skew the distribution or the meaning of the data. Then, it might be best to remove the observation altogether. (Be wary of your sample size and or number of features, in case you don’t have a large n)
The third option is to keep it as is. Sometimes, missing values provide meaning in the data set and removing them could actually skew the distribution or the meaning of the observation. Then, keeping it as is, might be the best approach.
Look at the descriptive statistics
“What are the descriptives points in my data?” Descriptive statistics describes the data within the subareas of central tendency, measure of spread or dispersion, and shapes of distributions. Generating the descriptive statistics of a dataset provides information on the key statistics; count, mean, standard deviation, min, max and the values at different quartiles.
df.describe()
# returns a summary table of the descriptive statistics for all the variables in the dataframe dfdf['column_name'].describe()
#returns the descriptive statistics for the column name defined in the code
Look at the count values for each variable. This should also tell you whether you are missing data points. Some questions to consider would be,
- Is the variable continuous or categorical?
- Are there any variables for which the descriptive statistics was not calculated? Why?
- Is the mean above 0 and of reasonable value? How does it relate to the median (50%) value and what does this mean?
- Is the standard deviation above 0 and of a reasonable value? What does this tell you about the variation in the variable data?
- Looking at the max and min, are the data points within a reasonable range? If the range is too wide, could there be outliers?
- Are the Quartile range values reasonable?
Look at the Variation
“What type of variation occurs within my variables?” Variation is the measure of dispersion or spread of data. It tells you of the tendency of the values of a variable to change from observation to observation. The best way to understand variation of a variable is to visualize it. You can also apply a “.value_counts()” command on the dataframe column to check for the distribution of all the unique values in the column or variable. Use a barchart to visualize categorical variables or features and a histogram to examine the distribution of continuous variables or features. Other helpful data viz are scatterplots and boxplots.
df.hist()
# returns a histogram for each column (variable or feature) of the dataframeplt.bar(x, y)
# returns a barplot for x measured by ydf['column_name'].value_counts()
# returns a series with counts of unique values in the column
Some of the queries to investigate when you examine the plots are:
- Which values are the most common and Why?
- Which values are rare? Why? Does that match expectations?
- Can you see any unusual patterns? What might explain them?
If you notice clusters of similar values with multiple peaks in the visuals, this might suggest that subgroups exist in your data
- How are the observations within each cluster similar to each other?
- How are the observations in separate clusters different from each other?
- How can you explain or describe the clusters?
- Could the appearance of clusters be misleading?
There may be certain unusual data points that linger far away or separately from all the other data points. These data points are called outliers. Look at your boxplots here. Following are some questions you could consider as you decide what to do with the outliers.
- Is this outlier observation a data entry error? Should I replace this observation or remove it entirely? (you will have to explain your decision in the project documentation)
- Is this outlier significant and relevant to the data science question at hand? (try a regression plot with and without the outliers to observe difference)
- Does removing the outlier have a significant effect on the distribution or overall representation of the data?
Look at the correlation
“What type of correlation occurs between my variables?” Correlation describes the relationship between two variables. It tells you how the change in variable A is related to the change in variable B. When variable A increases, does variable B also increase, decreases or does not change. The correlation coefficient ranges from a value of 0 to 1. As the value approaches 1, the correlation increases or gets stronger. A positive value indicates a positive relationship while a negative value indicates an inverse or negative relationship. At 0, there is no relationship between the variables
In data science we want our dependent or target variable (is the variable or feature of a data set about which we are trying to grasp a deeper understanding of) to be highly correlated with all the other or independent variables or features (are the variable or features of a data set which allows us to learn about the target variable). We do not want our independent variables to be correlated to each other. The correlation values between all the variables can be calculated using the .corr() command in python. You can also generate a seaborn heatmap for a visual representation of correlation between variables.
corr = df.corr()
#Returns and assigns a table showing the pairwise correlation coefficients of the columns to 'corr'.sns.heatmap(corr,cmap="BuPu",annot=True)
# Produces a heatmap showing the strength of correlation between variables for table 'corr'df['column_name1'].corr(df['column_name2'])
# Returns the correlation coefficient between column 1 and column2
You can also use a boxplot to explore relationships by displaying the distribution of continuous variables grouped by categorical variables.
df.boxplot('target_variable', by = 'categorical_variable')
Viewing relation between two continuous variables can be achieved using scatterplots. As patterns emerge during data visualization, here are some questions to consider.
- Could this pattern be due to coincidence (i.e. random chance)?
- How can you describe the relationship implied by the pattern?
- How strong is the relationship implied by the pattern?
- What other variables might affect the relationship?
- Does the relationship change if you look at individual subgroups of the data?
Realizing meaningful and significant relationships between variables can help in feature engineering and transformations. You can create new features depending on your end goal of a strong predictive model or a highly interpretable model etc.
Statistical Tests
Consider looking at the individual variables or features and testing whether each or groups of variables, is statistically significant in explaining variation in the target variable. This can be achieved through quantitative techniques such as hypothesis testing, analysis of variance, point estimates and confidence intervals, and least squares regression.
The EDA process can seem tedious and repetitive but be mindful that the steps taken here will give you insights into what your data represents , how to effectively use this information to best suit your project, and may even tell you what is not in the data (what you need more). I hope this gives you a good understanding of what you are expected to do at the EDA step of the Data Science process.
The main resource to writing this blog came from here. Please feel free to leave any feedbacks that may be helpful in correcting errors, additional content suggestion.