Data is everywhere. Finding the dataset that you need ain’t so easy though. Which is why people often end up developing their own datasets. But before you do that, it’s worth looking to see if at least a toy dataset similar to the what you need is already available.
Here’s where you can search for existing datasets.
- Google’s Dataset Search Engine
- Registry of Research Data Repositories
- Datacite.org
- Nasa
- Earth Data by NASA https://search.earthdata.nasa.gov/search
- Us Gov Open Data – https://data.gov/
- Kaggle – https://www.kaggle.com/
- UC Irvine ML Repository – https://archive.ics.uci.edu/
- Baseball Savant – https://baseballsavant.mlb.com/
- Sports Reference sites – https://www.baseball-reference.com/
- Github
- Buzzfeed – https://github.com/BuzzFeedNews
- Gapminder Open Numbers – https://github.com/open-numbers
- Within Python Libraries
- Seaborn
- Scikit-Learn
- PyTorch
- Huggingface – https://huggingface.co/docs/datasets/…
- AWS – https://aws.amazon.com/opendata/?wwps…
- Datahub – https://datahub.io/search
- Pew Research Center – https://www.pewresearch.org/download-…
- 538 – https://data.fivethirtyeight.com/
- World Bank – https://data.worldbank.org/
- Cern Open Data Portal – https://opendata.cern.ch/
- NOAA – https://www.ncei.noaa.gov/cdo-web/
- British Film Institute – https://www.bfi.org.uk/industry-data-…
- Reddit Datasets – / datasets
- NYC taxi data – https://www.nyc.gov/site/tlc/about/tl…
- Data World – https://data.world/datasets/machine-l…
- Data One – https://www.dataone.org/
- Shaip –https://www.shaip.com/offerings/speec…
- IBM – https://developer.ibm.com/exchanges/d…
- paperswithcode – https://paperswithcode.com/datasets
- Openml – https://www.openml.org/search?type=da…
- Computer Vision Dataset – https://visualdata.io/discovery
- Crime Data Explorer – https://cde.ucr.cjis.gov/LATEST/webap…
- World Health Dataset – https://www.who.int/data/gho
- Web Scraping
- Use APIs (e.g. :)
Leave a Reply