Data Sources Page
Collection of Datasets:
- FEC contributions data (as
part of Hadley Wickham’s dplyr package)
- Medicare
dataset (discussed on
- Yahoo big data datasets
- SF OKCupid
Users Everett Wetchler wrote a python script back in the day to rip
the public profiles of San Francisco OkCupid
users. He pulled one snapshot (June 26, 2012) of all OkCupid users who lived within 25 miles of San
Francisco along with other caveats.
It might be of interest to students given the recent press that
data-driven approaches to online dating have been getting, specifically
the Wired article "How a Math Genius Hacked OkCupid
to Find True Love" and Amy Webb's Ted Talk "How I hacked online dating".
- This
growing dataset repository presents raw data from real medical studies and
offers (a) a vignette summarizing the study, research question and study
design; (b) a data dictionary with clear documentation of variables and
codes; (c) a complete citation for the associated study publication; and
(d) a variety of data formats compatible with the majority of statistical
- CAUSEweb data
- Data
formatted to use in R
- Finding
Data on the Internet
- State
Health Facts:
- – a fascinating website
with amazing graphics (social and economic data broken down by
country). Click on the spreadsheet
links to download the data.
- Wolfram/Alpha ( )-- This
is billed as a computational search engine. Put in "nachos" you get a
detailed nutritional analysis, put in "GDP of Albania"and
you get several forms of GDP, a historical graph and other economic
variables, put in your favorite college and get lots of info (including
number of degrees in mathematics in 2009, location on a map and link to a
satellite view of campus). While
the case by case data display is not so convenient for building datasets
there are pretty good links to the sources that Wolfram is pulling data
from. For example, the
Wolfram/Alpha page of info on a college or university has a data source
link at the bottom to the National Center For Educational Statistics
website where you can download your own custom data files from the IPSEDS
(Integrated Post Secondary Education Data System) - want to know the
average faculty salary by rank for all the schools in your comparison
group? or the nacho search gives a link to the
USDA's National Nutrient database and a few clicks later I've got a spreadsheet
with data on 50+ nutrients in 7400+ foods (and that's the abbreviated
- Many
Eyes ( This is billed
as a wiki for data and visualizations of data. Users can contribute
datasets and graphics as well as comment on what others have
contributed. Some of the
"visualizations" are pretty bizarre - others are interesting,
e.g. I'm not sure where else I could find different datasets (e.g. current
average home rental prices) from counties in Ireland
and display the data by shading a map of Ireland with the variable I
choose and have a link to the report where the data appear. A search with
keyword "golf" produced 14 hits - including several of which referred
to the Volkswagen Golf, a couple where individual golfers posted datasets
with their own scores (and quite detailed info for each round), listings
of the length and price to play golf courses in the Toronto area, the
World Gold Rankings Top 250 golfers (from 2007) and data on PGA Tour
golfers (from ESPN) for the 2007 season.
- -- time-series data
sets, uploaded by users.
-- UC Irvine’s Machine Learning Repository
- Journal
of Statistics Education Data Archive – datasets contributed by
statistics teachers. Raw data are given in a .dat file
with explanations of the variables in an accompanying
.doc file. Several of these
datasets are tied to longer JSE articles discussing their use in statistics classes. For example, try televisions.dat,
and Rossman article for some data on life expectancy
and numbers of televisions in various countries.
- Baby names (popularity by
year and state), compiled by the Social Security Administration
- DASL is the Data and Story
Library – a collection of datasets and related documentation which may be
searched by data subjects or by statistical techniques
- DASL in Australia
- Statlib
Dataset Archive – one of the original
sources for archived data
- National
Institute of Standards and Technology (NIST) education data sets
Project Datasets – data from recent media coverage
of current events. Only a few
datasets here, but many excellent references to teaching applications of statistics in the news can be
found at the main CHANCE
- Electronic Dataset Service – a
collection of links to datasets organized by statistical methods
- Data – a collection
of datasets from the book DATA by Andrews and Herzberg, stored at Statlib
- FEDSTATS links to Web access to data
produced by the US Government agencies like:
- Sports Data Page
- Statistical
Resources on the Web