Chapter 25 Finding Data

Now that we know what data are, how to work with them in Posit Cloud, and how to get them into Posit Cloud, if you have a question you want to answer with data, where do you find data to work with? In some cases you’ll have to create your own data set but in other cases you can find data that others have already generated and start from there! In this lesson, we’ll discuss the difference between public and private data and direct you to a number of resources where you can find helpful data sets for data science projects!

25.0.1 Public versus Private Data

Before discussing where to find data, we need to know the difference between private and public data. Private data are datasets to which a limited number of people or groups have access. There are many reasons why a dataset may remain private. If the dataset has personally-identifiable information within it (addresses, phone numbers, etc.), then the dataset may remain private for privacy reasons. Or, if the dataset has been generated by a company, they may hang onto it so that they have an advantage over their competitors. Often, you will not not have access to private data (although sometimes you can request and gain access to the data or pay for the data to get access). But that’s OK because, in general, public data are freely-available. Unlike private data generated by companies, data generated by governments are often made public and are available to anyone for use.

25.0.2 Publicly-available data

As a data scientist, there’s a good chance you may work with private company data as part of your job. However, before you have that job, it’s great practice to work with datasets that are publicly-available and waiting for you to use them! In this section, we’ll direct you to sources of different datasets where you can find a dataset of interest to you and get working with it!

25.0.2.1 Open Datasets

There are a number of companies dedicated to compiling datasets into a central location and making these data easy to access. Two of the most popular are Kaggle and data.world. On each site, you’ll have to register for a free account. After registering you’ll have access to many different types of datasets! Explore what’s available there and then start playing around with a dataset that interests you!

kaggle and data.world are great places to look for datasets

Publicly-available datasets are also curated at Awesome Public Datasets, so feel free to look around there as well!

25.0.2.2 Government Data

Government data can provide a wealth of information to a data science. Government data sets cover topics from education and student loan debt to climate and weather. They include business and finance datasets as well as law and agriculture data.

Here we provide lists of governments’ open data to just give you and idea of how many datasets are out there. This will only include a tiny portion of what cities and federal governments’ data are available for you to use. So, if there’s a place whose data you want to work with, look on Google for “open data” from that place!

25.0.2.2.1 US Data

If you’re interested in working with government data from the United States, data.gov is place to get datasets that have been released by the the United States government. Here you can find hundreds of thousands of datasets. These data cover many topics, so if you’re interested in working with government data, data.gov datasets is a great place to start!

data.gov has hundreds of thousands of datasets

25.0.2.2.2 Census Data

The US Census is responsible for collecting data about the people within the United States and United States’ economy every ten years. These data are also accessible online and they can be worked with in R using the very helpful tidycensus package!

The US Census provides data about the US people and economy

25.0.2.2.3 Open City Data

The US’s federal government is of course not the only place to obtain government data. More and more cities across the world are starting to release open data at the city level. A few of these cities and their respective open city data links are provided below:

Additionally, to see a summary of what datasets are available from cities across the USA, check out the US Open City Data Census from the Sunlight Foundation.

US City Open Data Census

25.0.2.2.4 Global Data

In addition to the United States, there are many other countries providing access to open data with more and more providing access and updated datasets each year. These include (but are not limited to!) datasets from many countries within Africa and Latin America as well as Canada, Ireland, Japan, Taiwan, and the UK.

Additionally, to see what datasets are available globally, the Global Open Data Index is a great place to start!

Global Open Data Index

25.0.2.3 APIs

We’ve mentioned APIs previously, but it’s important to include them here as well. APIs provide access to data you’re interested in obtaining from websites. There are APIs for so many of the websites you access regularly. Google, Twitter, Facebook, and GitHub (among many others) all have APIs that you can access to obtain the dataset you’re interested in working with!

25.0.2.4 Company Data

Finally, we mentioned above that companies often keep their data private for a number of reasons, and that’s ok! When companies do release their data, they will often be found on websites like Kaggle and data.world. If there is a company whose data you’re interested in, you can search for the company’s data on either of these two data repositories or on on the company’s website directly to see if they provide the data there or if you can scrape their website to obtain the information you need! There may not always be a way to get the exact dataset you’re looking for, but you can often find something that will work!

25.0.3 Data You Already Have

Sometimes, it’s not about finding data someone else has already collected on a bunch of individuals in a population. Rather, getting data sometimes just involves taking a look at things you already have but just haven’t yet realized are data you can analyze.

For example, MP4 files you’ve bought and have on your computer are data! They can be analyzed using tuneR and seewave. You could use this type of data to categorize the music in your library or to build a model that takes data on what songs were already big hits to determine what qualities of a song predict that it may be a big hit.

Alternatively, you could scrape the websites you frequently visit (using rvest!) to answer interesting questions. For example, if you were interested in writing a really great title for the newest video of your pet doing something super cute, you might scrape the web for titles of pet videos that have recently gone viral. You could then craft the perfect title to use when you upload your pet video. Granted, this may not be an example answering the most important type of data science question; however, writing up how you did this would make a really great blog post, which is something we’ll discuss in a lesson in a few courses!

Finally, social networking websites like Facebook and Twitter, collect a lot of data about you as an individual. You have access to this information through the websites APIs, but can also download data directly. After news of the Facebook and Cambridge Analytica data breach, many articles were published about how to download your Facebook data. These data can be downloaded and then analyzed to look at trends in your data over time. How many pictures have you uploaded and been tagged in over time - has that changed? What topics do you most frequently discuss in Messenger? Or, maybe you’re interested in mapping the places you’ve been based on where you’ve checked in. All of these data can be analyzed from data that are already there, just waiting for you to work with them!

In all, sometimes getting the data just means realizing the data you already have at your disposal, figuring how to get the data into a format you can use, and then working with the data using the tools you have!

25.1 Where to get open source data

Often times for a particular project, you will have data given to you to work with. But sometimes you may want to go out and find data to work with for a particular question. In this section, we will give you some places you can go to get open source data that is freely available for you to use! Just make sure that you attribute where you got the data from originally

25.2 dslabs

dslabs

There’s 26 datasets already set up in R for you. To use the datasets in this package you need to do the following kind of steps:

# Install the package
install.packages("dslabs")
library(dslabs)

For the dataset admissions for example, you first need to prep it with data() function and then you can call it as an object.

data(admissions)
admissions

25.3 FiveThirtyEight Data

There’s a variety of datasets available as CSVs.

Go to the GitHub When you find the dataset you are interested in, click on that folder and then click on the data file you are interested in. Then click Raw. Copy and paste the URL that that brings you to. For example, for the airline-safety dataset, the URL would look like this:

https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv

Then you can use that URL in a read_csv function like this:

airline_safety <- readr::read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv")

25.4 Kaggle

Kaggle has a variety of datasets that are generally available as CSV. You’ll have to make a login before you download, but you can use a Google login.

To download a dataset, you can browse them and then when you find one you like, you can download the CSV. It will download it in a zip file. Upload this to your RStudio server using the Upload button.

Then you will see a CSV included in that data file and you can use readr::read_csv()to read in the file.

For example, for the Data Science Job Salaries dataset, you can click Download and then upload the resulting zip file. Then you can read in the file like:

readr::read_csv("ds_salaries.csv")

25.5 More places to get data

There are endless numbers of places on the internet to get data. Above are some great ones to get started with. But as you get more comfortable with data analysis, you may want to look at some of these other sources. Some of these source will require a bit more work to get these data, but are just as available for you to use.