Sources of open data for statistics, data science, and machine learning

Rohit Farmer

Sources of open data for statistics, data science, and machine learning

Even though the default data included in R and other languages are carefully selected and valuable. They are old, limited in sample size, and may not be more than toy examples in the current context.

Author

Rohit Farmer

Published

October 25, 2022

Updates

2022-10-26 Added sections for Kaggle and other platforms
2022-10-25 Initial uncomplete post and Kaggle notebook

Introduction

R and Python, the two most popular data science and machine learning programming languages, come with a default data set for demonstration and educational purposes. Moreover, many popular data science libraries such as tidyverse, lme4, nlme, MASS, survival, Bioconductor, and sklearn, amongst others, also contain example datasets for unit testing and demonstration. However, even though the included data are carefully selected, most of them are old (from the 1970s and ’80s) and hence limited in sample size (due to the limitation in the then available computing power) and may not be more than a toy example in the current context. Therefore, in a learning path after utilizing the default datasets for initial testing of a statistical function or a machine learning model, it’s both desired and recommended to practice on a real-life dataset. Working on a real-life dataset that is apt in the current context of data science and computing resources would not only teach the desired statistical/machine learning technique but also expose the learner to challenges that are usually not associated with toy datasets—for example, class imbalance, missing values, wrongly labeled datatypes, statistical noise, etc.

For a real-life dataset, I recommend using open data released by government agencies and other non-government organizations as part of their openness in operations. These datasets are not only of adequate size but also represent what is happening around us in a field of interest. Many countries around the world release data to inform citizens about their operations and fair practices. However, I do not think, other than the established data analytics-based companies and research groups, ordinary citizens ever look into such resources. I bet many would not even know that their government agencies are making ample amounts of data available for scrutiny. Therefore, in my opinion, budding data scientists should take it upon themselves to utilize these datasets in place of toy examples to not only justify the existence of such a resource but also, through analysis, gain first-hand insight and transfer it to their friends, family, and broader audience. Such a practice by amateur data scientists may have far-reaching implications than what I can convey here.

In the sections below, for completeness, I will briefly discuss how to access default datasets available in R and then move on to other resources, including open government data mentioned earlier.

Tip

Please check out this notebook on Kaggle to work with the code presented in this post https://www.kaggle.com/code/rohitfarmer/007-open-data-for-datascience. Some of the code is changed slightly to match Kaggle’s notebook environment.

Code

suppressMessages(library(DT))
suppressMessages(library(tidyverse))
suppressMessages(library(kableExtra))

Default datasets in R

In R (v4.1.3), there are 104 datasets for various statistical and machine-learning tasks. The commands in the cell below list all the datasets available by default (Table 1) and across all the installed packages, respectively. This article summarizes some of R’s popular datasets, namely mtcars, iris, etc.

# Default datasets
data()

# Datasets across all the installed packages
data(package = .packages(all.available = TRUE))

Code

dat <- data()
dat <- as_tibble(dat$results) %>% dplyr::select(-LibPath) %>%
  dplyr::filter(Package == "datasets")
knitr::kable(dat) %>%
    kable_styling(bootstrap_options = c("striped", "hover")) %>%
    scroll_box(width = "100%", height = "300px")

Table 1: Default datasets in R
Package	Item	Title
datasets	AirPassengers	Monthly Airline Passenger Numbers 1949-1960
datasets	BJsales	Sales Data with Leading Indicator
datasets	BJsales.lead (BJsales)	Sales Data with Leading Indicator
datasets	BOD	Biochemical Oxygen Demand
datasets	CO2	Carbon Dioxide Uptake in Grass Plants
datasets	ChickWeight	Weight versus age of chicks on different diets
datasets	DNase	Elisa assay of DNase
datasets	EuStockMarkets	Daily Closing Prices of Major European Stock Indices, 1991-1998
datasets	Formaldehyde	Determination of Formaldehyde
datasets	HairEyeColor	Hair and Eye Color of Statistics Students
datasets	Harman23.cor	Harman Example 2.3
datasets	Harman74.cor	Harman Example 7.4
datasets	Indometh	Pharmacokinetics of Indomethacin
datasets	InsectSprays	Effectiveness of Insect Sprays
datasets	JohnsonJohnson	Quarterly Earnings per Johnson & Johnson Share
datasets	LakeHuron	Level of Lake Huron 1875-1972
datasets	LifeCycleSavings	Intercountry Life-Cycle Savings Data
datasets	Loblolly	Growth of Loblolly pine trees
datasets	Nile	Flow of the River Nile
datasets	Orange	Growth of Orange Trees
datasets	OrchardSprays	Potency of Orchard Sprays
datasets	PlantGrowth	Results from an Experiment on Plant Growth
datasets	Puromycin	Reaction Velocity of an Enzymatic Reaction
datasets	Seatbelts	Road Casualties in Great Britain 1969-84
datasets	Theoph	Pharmacokinetics of Theophylline
datasets	Titanic	Survival of passengers on the Titanic
datasets	ToothGrowth	The Effect of Vitamin C on Tooth Growth in Guinea Pigs
datasets	UCBAdmissions	Student Admissions at UC Berkeley
datasets	UKDriverDeaths	Road Casualties in Great Britain 1969-84
datasets	UKgas	UK Quarterly Gas Consumption
datasets	USAccDeaths	Accidental Deaths in the US 1973-1978
datasets	USArrests	Violent Crime Rates by US State
datasets	USJudgeRatings	Lawyers' Ratings of State Judges in the US Superior Court
datasets	USPersonalExpenditure	Personal Expenditure Data
datasets	UScitiesD	Distances Between European Cities and Between US Cities
datasets	VADeaths	Death Rates in Virginia (1940)
datasets	WWWusage	Internet Usage per Minute
datasets	WorldPhones	The World's Telephones
datasets	ability.cov	Ability and Intelligence Tests
datasets	airmiles	Passenger Miles on Commercial US Airlines, 1937-1960
datasets	airquality	New York Air Quality Measurements
datasets	anscombe	Anscombe's Quartet of 'Identical' Simple Linear Regressions
datasets	attenu	The Joyner-Boore Attenuation Data
datasets	attitude	The Chatterjee-Price Attitude Data
datasets	austres	Quarterly Time Series of the Number of Australian Residents
datasets	beaver1 (beavers)	Body Temperature Series of Two Beavers
datasets	beaver2 (beavers)	Body Temperature Series of Two Beavers
datasets	cars	Speed and Stopping Distances of Cars
datasets	chickwts	Chicken Weights by Feed Type
datasets	co2	Mauna Loa Atmospheric CO2 Concentration
datasets	crimtab	Student's 3000 Criminals Data
datasets	discoveries	Yearly Numbers of Important Discoveries
datasets	esoph	Smoking, Alcohol and (O)esophageal Cancer
datasets	euro	Conversion Rates of Euro Currencies
datasets	euro.cross (euro)	Conversion Rates of Euro Currencies
datasets	eurodist	Distances Between European Cities and Between US Cities
datasets	faithful	Old Faithful Geyser Data
datasets	fdeaths (UKLungDeaths)	Monthly Deaths from Lung Diseases in the UK
datasets	freeny	Freeny's Revenue Data
datasets	freeny.x (freeny)	Freeny's Revenue Data
datasets	freeny.y (freeny)	Freeny's Revenue Data
datasets	infert	Infertility after Spontaneous and Induced Abortion
datasets	iris	Edgar Anderson's Iris Data
datasets	iris3	Edgar Anderson's Iris Data
datasets	islands	Areas of the World's Major Landmasses
datasets	ldeaths (UKLungDeaths)	Monthly Deaths from Lung Diseases in the UK
datasets	lh	Luteinizing Hormone in Blood Samples
datasets	longley	Longley's Economic Regression Data
datasets	lynx	Annual Canadian Lynx trappings 1821-1934
datasets	mdeaths (UKLungDeaths)	Monthly Deaths from Lung Diseases in the UK
datasets	morley	Michelson Speed of Light Data
datasets	mtcars	Motor Trend Car Road Tests
datasets	nhtemp	Average Yearly Temperatures in New Haven
datasets	nottem	Average Monthly Temperatures at Nottingham, 1920-1939
datasets	npk	Classical N, P, K Factorial Experiment
datasets	occupationalStatus	Occupational Status of Fathers and their Sons
datasets	precip	Annual Precipitation in US Cities
datasets	presidents	Quarterly Approval Ratings of US Presidents
datasets	pressure	Vapor Pressure of Mercury as a Function of Temperature
datasets	quakes	Locations of Earthquakes off Fiji
datasets	randu	Random Numbers from Congruential Generator RANDU
datasets	rivers	Lengths of Major North American Rivers
datasets	rock	Measurements on Petroleum Rock Samples
datasets	sleep	Student's Sleep Data
datasets	stack.loss (stackloss)	Brownlee's Stack Loss Plant Data
datasets	stack.x (stackloss)	Brownlee's Stack Loss Plant Data
datasets	stackloss	Brownlee's Stack Loss Plant Data
datasets	state.abb (state)	US State Facts and Figures
datasets	state.area (state)	US State Facts and Figures
datasets	state.center (state)	US State Facts and Figures
datasets	state.division (state)	US State Facts and Figures
datasets	state.name (state)	US State Facts and Figures
datasets	state.region (state)	US State Facts and Figures
datasets	state.x77 (state)	US State Facts and Figures
datasets	sunspot.month	Monthly Sunspot Data, from 1749 to "Present"
datasets	sunspot.year	Yearly Sunspot Data, 1700-1988
datasets	sunspots	Monthly Sunspot Numbers, 1749-1983
datasets	swiss	Swiss Fertility and Socioeconomic Indicators (1888) Data
datasets	treering	Yearly Treering Data, -6000-1979
datasets	trees	Diameter, Height and Volume for Black Cherry Trees
datasets	uspop	Populations Recorded by the US Census
datasets	volcano	Topographic Information on Auckland's Maunga Whau Volcano
datasets	warpbreaks	The Number of Breaks in Yarn during Weaving
datasets	women	Average Heights and Weights for American Women

Open government data

As I mentioned in the introduction, many governments worldwide release data for transparency and accountability; for example, https://data.gov, the US federal government’s open data site. Data.gov also maintains a list of websites at https://data.gov/open-gov/ pointing to data repositories related to US cities and counties, US states, and international countries and regions. The primary aim of these repositories is to publish information online as open data using standardized, machine-readable data formats with their metadata.

Depending upon the type of data requested, most of the data can be downloaded in multiple machine-readable formats either via the export option on the website or programmatically through APIs (see section Section 2.4). For example for tabular data popular formats include CSV, XML, RDF, JSON, and XML.

Interactive and exportable tables below show the list of websites at https://data.gov/open-gov/.

Code

open_gov <- read.csv("https://data.gov/datagov/wordpress/2019/09/opendatasites91819.csv", header = FALSE)
colnames(open_gov) <- c("Item", "Website", "Type")
cat("Total number of entries: ", nrow(open_gov))

Total number of entries:  314

US cities and counties

Code

city_county <- dplyr::filter(open_gov, Type == "US City or County")
DT::datatable(city_county, options = list(pageLength = 5))

Show entries

Search:

	Item	Website	Type
1	Albuquerque	http://www.cabq.gov/abq-data/	US City or County
2	Ann Arbor	https://www.a2gov.org/services/data/Pages/default.aspx	US City or County
3	Arvada	http://arvada.org/opendata/	US City or County
4	Asheville	http://data.ashevillenc.gov/	US City or County
5	Atlanta	http://gis.atlantaga.gov/	US City or County

Showing 1 to 5 of 48 entries

Previous1 2 3 4 5…10Next

US states

Code

us_state <- dplyr::filter(open_gov, Type %in% c("US State", "Other State Related"))
DT::datatable(us_state, options = list(pageLength = 5))

Show entries

Search:

	Item	Website	Type
1	Alabama	http://open.alabama.gov/	US State
2	Alaska	http://dggs.alaska.gov/pubs/series/dggs/digital-data-series	US State
3	Arizona	http://openbooks.az.gov/app/transparency/index.html	US State
4	Arizona Goelogical Survey	https://azgs.arizona.edu/index.php	Other State Related
5	Arkansas	https://transparency.arkansas.gov/	US State

Showing 1 to 5 of 59 entries

Previous1 2 3 4 5…12Next

International countries and regions

Code

int_count <- dplyr::filter(open_gov, Type %in% c("International Country", "International Regional"))
DT::datatable(int_count, options = list(pageLength = 5))

Show entries

Search:

	Item	Website	Type
1	Aarhus	http://www.odaa.dk/	International Regional
2	Alberta	https://open.alberta.ca/opendata	International Regional
3	Alkmaar	http://www.alkmaar.nl/opendata	International Regional
4	Allerdale	http://datacatalogs.org/catalog/allerdale	International Regional
5	Amsterdam O+S	http://www.os.amsterdam.nl/	International Regional

Showing 1 to 5 of 206 entries

Previous1 2 3 4 5…42Next

An example of using Maryland state open data via an API

Since I live and work in Maryland, I want to see how wages in Maryland and its counties have changed over time. I also want to test if Montgomery county (where I live) has different wages compared to Frederick, Howard, and Prince George’s counties which borders Montgomery on the north, east, and south sides. Therefore, in this example, I will fetch Maryland Average Wage Per Job (Current Dollars): 2010-2020 data via API using RSocrata library in R and carry out some analysis.

Note

See https://dev.socrata.com/ to learn more about how to work with open data APIs in various programming languages.

In Table 2, each row has an average wage for a year for Maryland, and each of its counties (columns) from 2010-2020 and Figure 1 shows the same data as a line graph depicting the change in wages (y-axis) over time (x-axis).

Table 3 lists the results of an unpaired two-sample t-test between wages from Montgomery and Frederick, Howard, and Prince George’s counties. As you can see from the t-test results, wages differ between Montgomery and Frederick, Howard, and Prince George’s counties, with Montogomery county residents earning higher than all its three bordering counties.

Code

library(RSocrata)
# Fetch the data using the API endpoint
maw <- read.socrata("https://opendata.maryland.gov/resource/mk5a-nf44.json")
knitr::kable(dplyr::select(maw, -date_created)) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
    scroll_box(width = "100%", height = "400px")

Table 2: Maryland Average Wage Per Job (Current Dollars): 2010-2020
year	maryland	allegany_county	anne_arundel_county	baltimore_city	baltimore_county	calvert_county	caroline_county	carroll_county	cecil_county	charles_county	dorchester_county	frederick_county	garrett_county	harford_county	howard_county	kent_county	montgomery_county	prince_george_s_county	queen_anne_s_county	somerset_county	st_mary_s_county	talbot_county	washington_county	wicomico_county	worcester_county
2010	53096	35771	56745	55640	49986	42726	34616	38027	42027	41290	35489	48018	31591	46741	58130	36334	65178	51808	36018	38228	60032	37845	38228	38472	30799
2011	54517	36677	58011	57027	50914	43431	35981	39039	42465	42200	35718	48794	32484	48558	60448	36815	67247	52844	36437	39652	63057	38462	39420	38915	31438
2012	55466	36983	58706	58876	51722	44239	37506	39919	43260	42888	37172	49972	32506	49772	62371	36622	68159	53292	37258	39853	63698	39807	39564	39066	31641
2013	55555	37827	59384	59318	51778	44126	38404	40736	44214	42909	37773	49570	33477	49624	62271	37572	67437	53441	36848	40744	63501	39901	40032	39714	32384
2014	56924	38449	60551	61112	52961	45162	39383	41607	45051	44260	39094	50747	34195	50205	64784	38411	68731	54985	37932	41802	64691	40118	41018	40863	33635
2015	58729	39888	62195	63389	54248	48825	41043	43325	46776	44919	40022	51510	35067	52418	66677	38741	71480	56456	38970	43397	65497	41313	42270	42599	34524
2016	59710	40708	63147	64481	55159	53657	40832	43815	47300	46958	40431	51630	34925	52862	67621	39504	72904	57251	39941	43575	65937	41740	42725	43875	35260
2017	61298	42143	64629	66365	56887	55922	42034	45576	48662	47673	41711	52270	35971	53775	68958	40446	74709	58829	42099	45988	67622	43105	44039	45491	35802
2018	62836	43197	66458	67005	58793	53557	43190	45690	49981	48225	41987	53624	37575	54921	71300	42422	76867	60383	43582	45381	68887	44670	45846	45567	37231
2019	64690	44692	68586	69930	60116	51598	45190	47189	52177	49193	43271	55621	38290	57349	74136	42575	78386	62096	44011	49234	70807	45115	46965	46620	38234
2020	70446	48294	74533	74483	65743	55903	49336	51470	55854	53404	47182	60646	40690	62395	82780	45891	86138	66777	48385	53880	77490	48338	50743	50556	41605

Code

maw_gather <- maw %>% dplyr::select(-date_created) %>%
  gather(key = "county", value = "wage", -year ) %>% as_tibble()
ggplot(maw_gather, aes(x = year, y = as.numeric(wage), color = county)) +
  geom_line(aes(group = county)) + 
  labs(x = "Year", y = "Wage", color = "") +theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Figure 1: Maryland Average Wage Per Job (Current Dollars): 2010-2020

Code

mft <- broom::tidy(t.test(as.numeric(maw$montgomery_county), 
                   as.numeric(maw$frederick_county))) %>% 
                   dplyr::mutate("test" = "Montgomery vs. Frederick")

mht <- broom::tidy(t.test(as.numeric(maw$montgomery_county), 
                   as.numeric(maw$howard_county))) %>% 
                   dplyr::mutate("test" = "Montgomery vs. Howard")

mpgt <- broom::tidy(t.test(as.numeric(maw$montgomery_county), 
                    as.numeric(maw$prince_george_s_county))) %>% 
                    dplyr::mutate("test" = "Montgomery vs. Prince George's")

all_t <- dplyr::bind_rows(mft, mht, mpgt) %>%
  dplyr::select(all_of(c("test", "estimate", "estimate1", 
  "estimate2", "statistic", "p.value")))

knitr::kable(all_t) %>%
    kable_styling(bootstrap_options = c("striped", "hover"))

Table 3: T-test results between wages from Montgomery and Frederick, Howard, and Prince George’s county
test	estimate	estimate1	estimate2	statistic	p.value
Montgomery vs. Frederick	20439.455	72476	52036.55	9.452393	0.0000001
Montgomery vs. Howard	5250.909	72476	67225.09	1.858408	0.0781124
Montgomery vs. Prince George's	15370.364	72476	57105.64	6.597893	0.0000030

Kaggle

Kaggle is a website that hosts data science and machine learning competitions. Users can compete to win prizes, and the site also has a public dataset repository. In addition to hosting competitions and disseminating public datasets, Kaggle also hosts tutorials and Jupyter-like notebook environments for Python and R runtimes. Users can clone an existing notebook on Kaggle, create it from scratch or upload it from their local computer and provision free hardware resources, including GPUs, to execute their notebooks.

While uploading public datasets, users can also generate a DOI for their dataset to give them a permanent identifier on the internet. For example, I am distributing “Classify the bitter or sweet taste of compounds” (Rohit Farmer 2022a) and “Tweets from heads of governments and states” (Rohit Farmer 2022b) datasets on Kaggle for classification and natural language processing analysis, respectively.

Datasets on Kaggle can be downloaded using a web browser or via its API, which also interacts with its competition environment. Users can find datasets for various statical and machine learning tasks available in various formats, including CSVs, MS Excel, images, SQLite databases, pandas pickle, etc.

Note

I am still determining if Kaggle datasets similar to the ones from Zenodo, Figshare, and Dryad (mentioned in the section Section 4) are used in research/scientific studies or only in testing and learning data science. If you know the answer, then please put it in the comments.

Zenodo, Figshare, and Dryad

In contrast to Kaggle (mentioned in section Section 3) which is mainly a competition hosting and data science learning community Zenodo, Fighsare, and Dryad are data hosting platforms primarily used by scientists and researchers to serve data, manuscripts, and articles. The data hosted on these three platforms usually do not qualify for a domain-specific data respository or too large to attach as supplementary material to an article. They also aim to serve as a long-term archiving solution; therefore, even if the data/manuscript is published elsewhere under the correct licensing term, it can be redistributed here.

Zenodo is a digital repository that allows users to upload and share digital content such as datasets, software, and research papers. Zenodo is designed to preserve and provide long-term access to digital content. Zenodo is developed and operated by OpenAIRE, a European consortium that promotes open access to research. A unique feature of Zenodo is to archive releases from a GitHub repository and provide a DOI making it easier to cite a GitHub repository, for example, (Farmer 2022b). Another excellent feature is the ability to create communities to organize data and manuscripts of a similar kind. Anyone can create a community, and the owner can accept or reject a request for an item to be indexed in their community, for example, https://zenodo.org/communities/data-all-the-way/.

Tip

Besides data and manuscripts, Zenodo can also be utilized to archive a blog post and generate a DOI, for example, (Farmer 2022a).

Like Zenodo, Figshare is a web-based platform for storing, sharing, and managing research data. Figshare also provides a unique identifier (DOI) for each data set, which can be used to cite the data set in publications. Data can be stored privately or publicly, and Figshare provides tools for managing data access and permissions. Data scientists and machine learning engineers often use Figshare to share data sets and models with collaborators or the public.

Dryad is also a digital repository primarily used for scientific and medical data associated with a publication in a scholarly journal. Dryad makes data discoverable, usable, and citable by integrating it into the scholarly publication process.

Zenodo, Figshare, and Dryad are available to anyone to upload data (download is always free and allowed); however, limitations may apply to the file size uploaded or the private vs. public data status. Additionally, many educational/research institutions may partner with one or all of these platforms, thus providing enhanced options.

Tip

Unlike open government data sources that are primarily used to download data, Kaggle, Zenodo, Figshare, and Dryad can be used to find the suitable dataset for your analysis and share your curated dataset publicly in a permanent citable manner.

References

Farmer, Rohit. 2022a. “A Case for Using Google Colab Notebooks as an Alternative to Web Servers for Scientific Software,” October. https://doi.org/10.5281/ZENODO.7232108.

———. 2022b. ColabHDStIM: A Google Colab Interface to HDStIM (High Dimensional Stimulation Immune Mapping). Zenodo. https://doi.org/10.5281/ZENODO.7231731.

Rohit Farmer. 2022a. “Classify the Bitter or Sweet Taste of Compounds.” Kaggle. https://doi.org/10.34740/KAGGLE/DSV/4234193.

———. 2022b. “Tweets from Heads of Governments and States.” Kaggle. https://doi.org/10.34740/KAGGLE/DSV/4208877.

Back to top

Citation

BibTeX citation:

@online{farmer2022,
  author = {Farmer, Rohit},
  title = {Sources of Open Data for Statistics, Data Science, and
    Machine Learning},
  date = {2022-10-25},
  url = {https://dataalltheway.com/posts/007-open-data-for-datascience},
  langid = {en}
}

For attribution, please cite this work as:

Farmer, Rohit. 2022. “Sources of Open Data for Statistics, Data Science, and Machine Learning.” October 25, 2022. https://dataalltheway.com/posts/007-open-data-for-datascience.