Chapter 12 Data Analysis Workflow
In previous lessons, we went over the basics of data analysis. We went from how to ask data science questions and finding data to inferential and predictive data analysis. Since as a data scientist you may end up working on different projects at the same time. To prevent forgetting important steps that you learned in this course it’s crucial that you follow all the steps. In this lesson, we’re going to talk about a workflow for your data analysis projects.
12.0.1 What are the steps?
To begin with, these are the main steps you have to follow in your data analysis question:
- Define the question
- Define the ideal dataset
- Determine what data you can access and obtain the data
- Clean the data
- Exploratory data analysis
- Statistical analysis
- Interpret results
- Challenge results
- Synthesize/write up results
- Create reproducible code
12.0.2 An example
Let’s start with a hypothetical example and go through each step in our example. Imagine we’re interested to automatically detect emails that are SPAM from the ones that are not. So our general question is “can I automatically detect emails that are SPAM from the ones that are not?”
12.0.2.1 Define the question
However, this question is not written entirely in data science terms. We have to make sure our question can be measured and quantified with data. We have to make our question more concrete. So a better way to ask the question is this: “Can I use quantitative characteristics of the emails to classify them as SPAM?”
12.0.2.2 Define the ideal dataset
The second step in data analysis is to imagine an ideal dataset for our analysis. You don’t have to be practical in your thinking. Just imagine what type of data would be best for your analysis. In an ideal world, you would want a dataset of all emails received through major email providers such as Gmail or Yahoo and whether the email was flagged as SPAM or not.
12.0.2.3 Determine what data you can access and obtain the data
You may be lucky to have such dataset somehow. However, it’s unlikely that due to privacy reasons you can access other people’s emails. Even if you can, the data will be gazillions of bytes and it won’t be practical to analyze such a large dataset. So our best bet is to see if there’s any dataset online.
One of the best datasets for analyzing SPAM data is the spam data in the kernlab
package in R. The spam dataset is collected at Hewlett-Packard Labs and classifies 4601 e-mails as spam or non-spam. Additionally, there are 57 variables indicating the frequency of certain words and characters in the e-mail. Let’s install the package first.
library(kernlab)
data(spam)
12.0.2.4 Clean the data
In most cases, your data doesn’t come clean. In fact, it may come from different sources with different standards. Therefore, you should first tidy up the data. Lucky for us, the spam dataset in the kernlab
package is already tidy, so we can skip this step. However, if we’re doing predictive analysis, it’s better to have a training and a test set. The code below creates the train and test sets.
12.0.2.5 Exploratory data analysis
We learned about the following steps for doing exploratory data analysis:
- Look at summaries of the data
- Check for missing data
- Create exploratory plots
- Perform exploratory analyses
First, we look at column names.
And the first few rows of our training data.
Let’s see how many of the emails are flagged as SPAM and how many are not.
We can also plot the average length of capital letters in the text of the email for SPAM and non-SPAM emails. The variable in the data that measures the average length of capital letters in the text is called capitalAve
.
To better distinguish the difference in capitalAve
for SPAM and non-SPAM emails, we can use the log scale. We can convert the variable to log. Be careful that if you have zeros in your data (which you may have), by transforming the variable into log you will run into trouble (log of zero is infinity). To avoid this, we can add 1 to the variable.
We can see if there is any relationship between some of the predictors such as free
, original
, and receive
.
12.0.2.6 Statistical analysis
The type of analysis that we need is predictive analysis since, at the end of the day, our algorithm should predict whether an email is SPAM or not. Note that:
- Exact methods depend on the question of interest
- Transformations/processing should be accounted for when necessary
- Measures of uncertainty should be reported
We can use the following code to perform our prediction analysis using the training set. Note that the function cv.glm()
calculates the estimated K-fold cross-validation prediction error for generalized linear models (glms). The code chunk below finds the variable (among all of our variables) that has the lowest prediction error in finding the probability of being SPAM.
We can then use the test set to get a measure of uncertainty (or accuracy) of the model. For each observation in the test set, we predict whether the observation is a SPAM or not. Note that we already know whether the observation is a SPAM or not but we want to test our model’s ability. Once we find the predicted values, we can use them along with the actual values (whether the observations are indeed SPAM or not) and create an error matrix. The error matrix shows how many of the SPAM emails we thought were SPAM and how many we didn’t. The same for non-SPAM emails. The line table(predictedSpam,testSpam$type)
shows that there are 61 non-SPAM emails that our model predicted as SPAM and 458 SPAM emails that our model predicted as non-SPAM. The rest of the observations were predicted correctly. The last line of the code calculates the prediction error.
12.0.2.7 Interpret results
Once you have done the preliminary analysis, you should interpret the results so others know what conclusions can be made from your analysis. You should be careful not to confuse the following words:
- Describes (only if you observe a phenomenon without doing any inferential or predictive analysis)
- Correlates with/associated with (only if you look at the association between variables without any causal interpretation)
- Leads to/causes (only if you have performed causal inference analysis)
- Predicts (only if you have performed predictive analysis)
Make sure you give enough explanation to your analysis.
- Give an explanation as to what your numbers are telling (and not telling)
- If you do regression analysis, interpret the coefficients
- Interpret measures of uncertainty
In our example, here are some of the interpretations we can give.
- The fraction of characters that are dollar signs can be used to predict if an email is Spam
- Anything with more than 6.6% dollar signs is classified as Spam
- More dollar signs always means more Spam under our prediction
- Our test set error rate was 22.4%
12.0.2.8 Challenge results
A good analyst is a good critique of his/her work since the analyst knows the data and the methods best. Your self-critique and self-challenge should start from the start: the question. Challenge
- Question: Start from your question. Is your question asked properly?
- Data source: Challenge your data and make sure your data is good enough for your question.
- Processing: Check whether the data cleaning and processing are done properly.
- Analysis: Challenge your methods; is it the best method you could use given your question and your data.
- Conclusions: Check whether your conclusions are drawn properly.
- Measures of uncertainty: There are various ways to measure the uncertainty of your model. Check whether you have used the best measure.
- Choices of terms to include in models: You always include specific variables in your model. Make sure you have included the ones that make sense regardless of whether they make your results look good.
- Finally, think of potential alternative analyses: Recognize that there might be alternative approaches to your question such as the use of data, method, etc. Recognizing them will show some honesty on your part and will pave the way for future analyses.
12.0.2.9 Synthesize/write-up results
Once you’re done with the analysis and checking the credibility of your analysis, start writing up your results. These are some of the important steps:
- Lead with the question
- Summarize the analyses into the story
- Don’t include every analysis, include it
- If it is needed for the story
- If it is needed to address a challenge
- Order analyses according to the story, rather than chronologically
- Include “pretty” figures that contribute to the story
In our example we should:
- Lead with the question
- Can you use quantitative characteristics of the emails to classify them as SPAM/HAM?
- Describe the approach
- The source of our SPAM data and how we created training/test sets
- Explored relationships
- Choose logistic model on the training set by cross-validation
- Applied to test, 78% test set accuracy
- Interpret results
- Number of dollar signs seems reasonable, e.g. “Make CASH from home \$ \$ \$ \$!”
- Challenge results
- 78% isn’t that great
- I could use more variables
- Why logistic regression?
12.0.2.10 Create reproducible code
As you learned in previous lessons, make sure you document every step. This is important for two reasons: for you in future and for others to redo your analysis. Using Rmarkdown is a good way to accompany your analysis with good documentation. So it’s important that:
- Files are properly named.
- There is some explanation of the data.
- Each code file has some description as to what it does.
- Wherever you should add comments for important code chunks within your code files.