Sebastian Raschka
Last updated: 01/16/2015
A collection of links to various free and open-source datasets.
I am looking forward to extend this little collection! If you don't find your favorite datasets listed here, just let me know (via email or twitter) and I will add them in no-time!
## Sections
-
Kaggle - Kaggle, the leading platform for predictive modeling competitions.
-
UCI MLR - UC Irvine Machine Learning Repository.
-
google.com/publicdata - Public data maintained by Google.
-
Freebase - A community-curated database of well-known people, places, and things.
-
mldata.org - Machine learning data set repository for uploading and finding data sets.
-
Infochimps - A huge collection of large-sized data sets.
-
Amazon Web Services - Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.
-
Databib - A searchable catalog / registry / directory / bibliography of research data repositories.
-
figshare - An online digital repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos.
-
reddit r/datasets - Datasets shared on reddit.
-
datahub - The free, powerful data management platform from the Open Knowledge Foundation
-
Quandl - A search engine for numerical data
-
enigma - A search engine for public records published by governments, companies and organizations.
-
Tiny Images Dataset - A dataset of 79,302,017 images, each being a 32x32 color image.
-
ImageNet -A searchable image database.
-
CAT Dataset - A dataset of 10,000 cat images.
-
Amsterdam Library of Object Images (ALOI) - A color image collection of one-thousand small objects, recorded for scientific purposes.
-
Face Recognition Databases - A large collection of datasets for face recognition.
-
INRIA Holidays and Copydays datasets - Datasets of personal holidays photos.
-
Mobio - bi-modal (audio and video) data taken from 152 people.
-
Million Song Dataset - The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.
-
Music Data Mining - A collection of research done on music analysis and links to various datasets.
-
CMU Audio Databases - A collection of databases for speech recognition.
-
CMU Audio Databases - A collection of databases for speech recognition.
-
CMU_ARCTIC speech synthesis databases - Phonetically balanced, US English single speaker databases designed for unit selection speech synthesis research.
-
VoxForge - GPL speech audio corpora.
-
TechTC - Technion Repository of Text Categorization Datasets containing 300 labeled datasets with categorization difficulties indicated by baseline SVM accuracies.
-
SMS Spam Collection - A public dataset of 5572 SMS messages that are labeled as either "spam" or "ham" (not spam).
-
musiXmatch - A dataset of lyrics for the songs in the one million songs dataset. The lyrics are pre-processed and available as "bag of words" after stemming.
-
Google books Ngram Viewer - The corpus of Google books as n-grams available for quick online queries or download.
-
Jeb Bush's email archive - Jeb Bush's emails during his days as the governor of Florida.
-
Amazon Google Books Ngrams - A data set containing Google Books n-gram corpuses.
-
The Wayback Machine - 80 terabytes of archived web crawl data available for research.
-
SMS Spam Collection - A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site.
-
Yahoo News Feed dataset - An 1.5 TB dataset for building machine learning recommendation systems
-
The full Reddit Submission Corpus 2006-2015 - This represents all publicly available Reddit submissions from January 2006 - August 31, 2015).
- NGAFID - National General Aviation Flight Information Database. Time series data from various flight data recorders for flights that are approximately an hour long each.
-
1000 Genomes Project - A Deep Catalog of Human Genetic Variation.
-
Cancer Program Data Sets - a collection of genomic datasets.
-
Meteorites - Registered meteorites that have impacted on Earth.
-
The Wayback Machine - 80 terabytes of archived web crawl data available for research.
-
Social Network Analysis Interactive Dataset Library - a site that contains an accessible library of many of the 'open' social network analysis datasets.
-
SMS Spam Collection - A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site.
-
SNAP - Stanford Large Network Dataset Collection.
-
Amazon Google Books Ngrams - A data set containing Google Books n-gram corpuses.
-
Click Dataset - A large dataset of about 53.5 billion HTTP requests made by users at Indiana University.
-
Common Crawl 2012 web corpus - A hyperlink graph of 3.5 billion web pages and 128 billion hyperlinks between these pages.
-
PyPi/Maven Dependency Data - State of the Maven/Java dependency graph and state of the PyPi/Python dependency graph.
-
Titanic Survivors - dataset with 1313 samples and 10 features about Titanic survivors.
-
Pass rates, race & gender - Detailed data on pass rates, race, and gender for 2013.
-
Modeling Online Auctions - Datasets of bidding for different ebay auctions.
-
NYPD Crash Data Band-Aid - NYPD traffic crash data as a geocoded CSV.
-
aiHit Datasets - Information on random 10,000 UK companies sampled from aiHit DB.
-
Crunchbase Companies Datasets - 2 Million crunchbase company listings with over 100 data points.
-
United Nations Data about health, environment, energy, etc.
-
United Stated Government Data The home of the U.S. Government’s open data.
-
EconData - economic time series, produced by a number of U.S. Government agencies and distributed in a variety of formats and media.
-
USGovXML - USGovXML is an index to publicly available web services and XML data sources that are provided by the US government.
-
Nominate/vote data - Datasets including all the D-NOMINATE and W-NOMINATE scores.
-
Jeb Bush's email archive - Jeb Bush's emails during his days as the governor of Florida.