Chapter 51 Translating Questions to Data Science Questions

The approach to data analysis that we prefer is “problem forward, not solution backward”. The main point of this approach is to start with the question that you want to ask before you look at the data or the analysis.

In some cases, the question that you want to answer will be a question driven by your curiosity. For example, you may be interested in a question about your fitness. You could collect data using a Fitbit and the MyFitnessPal app. Or you may have a question about what kind of songs you like best using data from your Spotify profile. You might also be interested in where the potholes are most common in your city. You could collect information from your city’s open data website.

Another really common situation is that someone else is coming up with the question. When you are working as a data scientist this might be the marketing team, an executive, or an engineer who has a question that they would like to answer with the data. You might be asked to categorize photos on a website like Airbnb. They might bring you the question, the data, or both. Part of your job as a data scientist is to translate general questions into data science questions.

51.1 Getting Specific with Questions

The first step in translating a general question into a data science question is to make it as concrete as possible. For example here are some generic questions you might be interested in:

  1. When I run more do I lose weight?
  2. Are customers more likely to click on ads with puppies?
  3. Do I need to take an umbrella with me when I leave the house today?

These questions are interesting but they aren’t very specific. This is how most good data science projects start. To make a question more concrete you need to think about the data you would use to answer the question. This could either be data that you have or data that you think you could find.

For each of these questions you need to ask these questions:

  • What or who am I trying to understand with data?
  • What measurements do I have on those people or objects that help me answer the question?
  • How do the data I have limit the type of question I can answer?
  • What is the type of data science question we are trying to answer?

What or who am I trying to understand with data?

The first question is focused on figuring out who or what you are trying to study. In the world of statistics this is sometimes called the “population” you are trying to study. When you ask a question it is best to be as specific as possible about what you are trying to study. A good way to be specific is to imagine the individual people or objects you are going to measure data on. Realistically, what is the group that you have or will collect data from?

What measurements do I have on those people or objects that help me answer the question?

The second question focuses on figuring out which variables are measured or will be measured in the data that you have. We have discussed previously about all the different potential data types you might have, including standard quantitative or qualitative data, text, images, or videos can be data. When answering this question it helps to be specific. For example, unstructured text from a social media post may not be helpful, but words and labels for the words in that post may be the data that you are looking for.

How do the data I have limit the type of question I can answer?

The third question is critical for being careful in a data analysis. When you use the problem forward approach, you might start with a general question. But it might not be possible to answer that question with data we have. For example, it may be difficult to study directly the way that cigarette smoke affects children since most children don’t smoke. You might have to change your question to studying the way that second-hand smoking affects children or the way that parents smoking habits affect children. A key part of translating a general question into a data science question is identifying these limitations.

What is the type of data science question we are trying to answer?

The fourth question is focused on figuring out what type of analysis you are doing. We introduced the flowchart for defining the question type in an earlier lesson. The key questions to ask are how the data are summarized, what are the goals you are trying to achieve, and what does success for your analysis look like? One of the most common errors that people make in doing a data analysis is to answer the wrong type of data science question.

51.2 Translating Our Questions

Let’s try this approach out on a couple of made up examples and a real example to help you understand how to translate general questions into data science questions.

When I run more do I lose weight?

  • What or who am I trying to understand with data? I’m trying to understand something about only myself, not about others.
  • What measurements do I have on those people or objects that help me answer the question? I have data on how many steps I take with Fitbit and measure my weight with a scale.
  • How do the data I have limit the type of question I can answer? I only have data on me. I only have measurements on my weight every day and I need to summarize my Fitbit data to understand my runs, but won’t have information on whether I ran up and down hills or any information on my diet.
  • What is the type of data science question we are trying to answer? In this case we are looking for a relationship between two variables that we measured. We’re only using the data we have so it is an exploratory analysis.

When I run more do I lose weight?

Are customers more likely to click on ads with puppies?

  • What or who am I trying to understand with data? I’m trying to understand something about customers. I would need to figure out which customers. Customers trying to buy motorbikes might be different than customers going to Pets.com. We’d have to ask further questions to figure out which customers we are talking about.
  • What measurements do I have on those people or objects that help me answer the question? Suppose we have all the data from a set of customers who visited a website for buying dog food for a single day. Some of the customers saw a puppy ad and some didn’t. We also have data on how much dog food they bought.
  • How do the data I have limit the type of question I can answer? I only have data on a single website and only on a single day. So I might not be able to say things about other websites or other days.
  • What is the type of data science question we are trying to answer? In this case we are looking for a relationship between two variables and trying say something about all the customers for a website. So this is an inferential analysis.

Are customers more likely to click on ads with puppies?

Do I need to take an umbrella with me when I leave the house today?

  • What or who am I trying to understand with data? I’m trying to predict the weather in my hometown on a particular day so I know whether to take an umbrella.
  • What measurements do I have on those people or objects that help me answer the question? I could take predictions from different weather services, look out the window and see if it is cloudy, or go out and feel if it is humid outside. To build my prediction I could collect these measurements for a year and also record whether I needed an umbrella that day.
  • How do the data I have limit the type of question I can answer? I only have data on my hometown, I’ve only collected data from a few weather prediction services, and the data are only collected over one year. So it might be hard to say things for other people, other places, or other times.
  • What is the type of data science question we are trying to answer? In this case we are looking to use historical data to predict something about a single day. So this is a prediction analysis.

Do I need to take an umbrella with me when I leave the house today?

51.2.1 A real example

Let’s practice translating questions to data science questions through an example. We briefly mentioned this example in Forming Questions section. The analysis is based on data scientist Dimiter Toshkov’s blog.

Dimiter Toshkov’s IMDB Blog

The blog author wanted to optimize his movie watching time by not watching movies which he would not enjoy. His model used a combination of metadata (data about data) such as movie title, director, duration, year of release, genre, IMDb rating, and a few other less interesting variables. What or who am I trying to understand with data? We are trying to predict the author’s movie score based on the inputs from IMDb:

Will I like a movie? I will base it on a new score. The new score combines the official rating but also the other input variables as well.

What measurements do I have on those people or objects that help me answer the question? Many data we generate has what is called metadata associated with it, information about the data itself. When you watch a movie on a streaming platform, the bulk of the data will be the video file, but there are also other characteristics surrounding that file. The genre, country of origin, language, duration, and year are all metadata which hold a huge amount of information for the budding data scientist. These are all input variables into our model, but we also require an output variable; what we are trying to predict. We need a sample of data which as both the input and output data, where the latter is the personal score of the movie. Once we have these 2 parts we can use them to train out model. If the model is working well, it means we can take a new movie, unseen by us, and predict what we will think the score would be if we watched it. That is the billion dollar question that the movie streaming industry is trying to solve today!

How do the data I have limit the type of question I can answer?

The main limitation here is that the user needs to rate a number of movies before we can even start this process. The other limitation is that we are only trying to predict the personal score. This is okay because we aren’t trying to predict any of the other variables (for example, we don’t care about predicting the year a movie came out).

What is the type of data science models are we using?

This type of analysis can be broadly labelled as linear regression, where we think there is a steady and constant relationship between 1 or more variables, and the ideal solution is a straight line through it. This is usually the first port of call for most data scientists as it accomplishes 2 things:

  • It is quick and easy to implement
  • It will show if a simple relationship is sufficient to answer the problem at hand

The simple approach

To further dive into this problem, the author subsequently adds more and more complexity to his approach.

He initially just started by comparing IMDb ratings to his own, this referred to as a univariate or simple linear regression. The next step to improve the prediction is to add more variables to the model. If you are using many variables (year, genre, director), to predict your own score this is called multiple linear regression.

Multiple linear regression Analysis

The author is also very aware that this analysis introduces selection bias, more specifically survivorship bias. Any movie made before the 1990s that he has watched has likely survived the test of time, so older movies in this sample appear to have higher scores, but it’s unlikely that a random movie from the 1970s would be any better than a random movie from the 2010s.

51.2.2 Summary

In conclusion, the author has demonstrated how you can use regression to try to build a model to predict an outcome, in this case, movie ratings. Another critical point to extract from this example is how valuable it can be to have the thought process of an analysis written out, and available for others to see and give feedback on. The author wrote this up in a blog post for easy accessibility.