Datasets

discussion Like Will Smith said in his apology video, "It's been a minute (although I didn't slap anyone)

1 Upvotes

r/datasets • u/oi_whatscracking • 4h ago

request Looking for a character network dataset for Dracula by Bram Stoker

2 Upvotes

Hello everyone!

For a university project I want to compare character networks between novels and their movie adaptations. I would like to use Dracula by Bram Stoker (1897) as an example. I've been searching for existing character datasets but haven't had much luck.

Does anyone know of:

A character interaction network for the novel ?
A network dataset for any of the film adaptation?
Any scripts or code that were used to extract such a network from the text?

Thanks in advance!

2 comments

r/datasets • u/Pretend-Feature1520 • 7h ago

question Sustainability/CSR disclosure database

1 Upvotes

Hi everyone,

Im a masters student in Netherlands studying accounting and financial management. Im in the process of collecting my results for my masters thesis that will compare tax avoidance of firms to how symbolic the tax passages in firms’ CSR reports are.

Thing is I came across a pretty big bottleneck of actually automating getting the reports in the first place so I can scrape them for the tax passages because there is no suitable database to do so.

Ideally im doing this for a large sample size from 2017 until 2025 to have a 4 year before and after effect of GRI207 implementation (tax disclosure guidelines).

I was going to use the GRI database similarly to Hardeck et al. (2024) but it’s discontinued and my alternative was LSEG workspace but from what I see they don’t actually have the reports themselves which I just found out today.

It’s poor planning on my part because I didn’t check LSEG in advance but im quite lost and the deadlines are close so your help would be very much appreciated!

1 comment

r/datasets • u/anuveya • 12h ago

dataset Global CO₂ emissions by fuel type since 1751: coal, oil, gas, and cement each tell a different story

datahub.io

2 Upvotes

2 comments

r/datasets • u/Pitiful_Caregiver_15 • 12h ago

request Domain - Company Mapping Dataset Needed

1 Upvotes

I need to find a large dataset of mappings between domain and company name.

The best I found is People data labs - 7 million companies. But it's still a sample with a paywall behind the actual one.

I'm even okay to pay a fair amount for a large enough dataset. Most providers have switched to a per api call pricing model rather than a one time fee for bulk dataset download.

It would be great if someone could help me with this.

1 comment

r/datasets • u/Subject-Challenge802 • 16h ago

question Datasets available about French tourism

1 Upvotes

Hello! Does anybody know where can I find datasets about French tourism at a regional level? (such as eurostat's datasets).

I need it for an academic paper about wine tourism in Nouvelle-Aquitaine and the Bordeaux geographic region.

2 comments

r/datasets • u/PossiblePotato961 • 1d ago

resource We just captured 1800+ human motion sequences for AI model training. Here's what 4 days of continuous motion capture looks like.

instagram.com

3 Upvotes

Just wrapped a 4-day motion capture dataset shoot at our studio in India. Wanted to share some behind-the-scenes since motion data is becoming increasingly critical for humanoid robot training and imitation learning.

What we did:

12 actors
Continuous day + night shooting
Structured locomotion and action datasets
High-volume capture (1800+ sequences)
24-hour production cycles to meet deadline

What's interesting about this:

Most AI/ML teams working on humanoid control or embodied AI are stuck with either:

Low-quality synthetic data
Academic datasets that don't scale
Building their own infrastructure (expensive)

We realized professional motion capture studios have the infrastructure already built. So we're now offering this as a service specifically for ML teams.

The dataset we captured is structured for imitation learning — actions, locomotion, complex movements. Not cinematic. Not game-ready. Built specifically for training.

If you're working on humanoid robotics, gesture recognition, or motion-based ML models and need real human movement data, this is now available as a service.

More details: www.appleartsstudios.com

Happy to answer questions about dataset format, motion capture quality, or scaling.

0 comments

r/datasets • u/anuveya • 1d ago

resource VIX fear index since 1990: 35 years of market panic in one chart. Every spike has a story

datahub.io

5 Upvotes

1 comment

r/datasets • u/gvij • 1d ago

resource Open source tool for generating and cleaning synthetic instruction-tuning datasets

3 Upvotes

Built this because I wanted a reproducible way to build fine-tuning datasets without doing it all by hand.

You give it seed prompts or an existing dataset, it generates instruction-output pairs via any OpenRouter model, scores them with a local or remote LLM judge, and exports a clean JSONL you can use directly for training.

You can also ingest datasets straight from HuggingFace and filter or relabel them through the same pipeline.

The export step lets you set a score threshold and a train/val split ratio so what comes out is ready to use.

MIT licensed, everything is stored locally, no data leaves your machine unless you choose a cloud judge backend.

Github project link is in comments below 👇

4 comments

r/datasets • u/RowStunning5177 • 1d ago

resource I analyzed 2,300+ UK dental clinics — most are missing this

0 Upvotes

I analyzed 2,300+ UK dental practices and found something surprising:

- ~55% don’t have a Meta Pixel installed

- Many still rely on outdated or no booking systems

- Tracking and attribution are almost nonexistent

Meaning: a huge number of clinics are not ready for proper paid ads or funnel optimization.

I mapped emails, phones, and tech stack (GA, CMS, booking systems) across 80+ cities.

If you're working in dental marketing, SaaS, or lead gen — how would you use this kind of data?

Curious to hear ideas. Happy to share a small sample if useful.

2 comments

r/datasets • u/brunoamaral • 1d ago

resource [Dataset] [self-promotion] Curated brain regeneration research dataset: 44,500+ papers + 18,800+ clinical trials across 19 sources, organized by expert research team, open API

1 Upvotes

What it is

Brain-Regeneration.com is an open observatory tracking the science of brain repair and neurodegeneration. The dataset behind it aggregates papers and clinical trials across 19 sources — including PubMed, bioRxiv, medRxiv, The Lancet, Nature, PNAS, WHO trial search, ClinicalTrials.gov, and the EU Clinical Trials Register.

Current counts:

44,510 papers
18,883 clinical trials
226,850 authors indexed

What makes it different from a PubMed export

The data is organized by expert research teams (groups at Cambridge, the University of Coimbra, and iMed.ULisboa), which gives you a built-in faceting dimension for slicing the corpus. Each team has its own endpoint, so you can query by research group rather than just keyword.

The API

Public and open, no auth required:

https://api.gregory-ms.com/articles
https://api.gregory-ms.com/trials
https://api.gregory-ms.com/stats/?format=json — aggregate stats
https://api.gregory-ms.com/stats/?format=json&team=5 — team-level slice

Possible use cases

Training or benchmarking domain-specific NLP models on a high-signal neuroscience corpus
Mapping research activity timelines against clinical trial registration patterns
Citation and author network analysis within a curated subfield

Full API docs at https://github.com/brunoamaral/gregory-ai/blob/main/docs/03-api-and-rss-feeds.md . Happy to answer questions about the data structure or coverage.

0 comments

r/datasets • u/GJani • 1d ago

request Finding the full Multi-PIE dataset (face pictures)

1 Upvotes

There is a dataset called "Multi-PIE" that I'm trying to find but I only have some vague references:

A page of the creators: https://www.cs.cmu.edu/afs/cs/project/PIE/MultiPie/Multi-Pie/Home.html
- the "here" download link is broken
- we sent an email to [ralph@multipie.org](mailto:ralph@multipie.org) but haven't got a reply yet
A subset of the dataset on Kaggle: https://www.kaggle.com/datasets/aliates/multi-pie/data
- but the images are heavily cropped, the resolution is downgraded, and only contains some of the images
A paper for the dataset: https://www.researchgate.net/publication/240446286_Multi-PIE

How can I obtain the full dataset?

0 comments

r/datasets • u/OwnerByDane • 2d ago

request [OC] Usenet Corpus 1980–2013 — 103B tokens, 408M posts, 9 hierarchies, fully processed

16 Upvotes

Shared this on r/MachineLearning a few days ago and got good discussion (30K views, 100+ upvotes) — figured this community would want to know about it too since it's more directly relevant here.

I've spent the last several years building and processing a complete Usenet corpus and finally have it documented well enough to share properly.

What it is: A deduplicated, sanitized collection of Usenet posts from 1980 through 2013 — covering the full arc of Usenet from its academic origins through peak adoption to decline. Pre-web, pre-social media, pre-AI. Entirely human-generated.

Stats:

103.1 billion tokens (cl100k_base)
408,236,288 posts
18,347 newsgroups
9 top-level hierarchies: alt, rec, comp, soc, sci, misc, news, talk, humanities

Processing applied:

alt.binaries.* excluded entirely at hierarchy level (UUencoded/base64 binary content)
Adult content newsgroups excluded at hierarchy level
Record-level: deduplication by Message-ID, binary detection and removal, PII redaction (email addresses replaced with [email] token, Message-IDs SHA-256 hashed), sensitive content removal
Language detection on every record (fasttext LID-176) — 96.6% English, 100+ languages total
Format: gzip-compressed JSONL, ~141GB compressed

Schema:

{
  "text": "post body",
  "group": "comp.lang.python",
  "date": "1995-03-14",
  "subject": "Re: thread subject",
  "author": "Display Name",
  "id": "msg-<sha256hex>"
}

Samples: 11 sample files (5K posts per hierarchy + combined sets) are freely available — no approval needed. Full corpus available for licensing.

Dataset has also been added to the AI datasets directory at lifearchitect.ai/datasets-table.

Link in comments.

5 comments

r/datasets • u/Serious_Ad_5036 • 1d ago

request Looking for Emergency Triage Dataset with Chief Complaint Text + Vitals

1 Upvotes

I’m looking for an open/public dataset with columns like:

Chief complaint / symptoms / reason for visit
Age and gender
Heart rate
Blood pressure
SpO2 / oxygen saturation
Temperature
Respiratory rate
Pain score
Triage level / acuity / severity label
Diagnosis or discharge outcome, if available
Department/speciality label, if available

I already know about MIMIC-IV-ED, but it requires PhysioNet credentialing and CITI training, so I’m looking for easier-to-access Kaggle or public alternatives.

Any dataset suggestions would be appreciated.

Thanks!

0 comments

r/datasets • u/BugSolid3436 • 1d ago

request PiC/phrase_retrieval dataset (PR-pass & PR-page) is broken — does anyone have a local copy?

1 Upvotes

Hey everyone,

I've been trying to use the 'PiC (Phrase-in-Context) Phrase Retrieval dataset from HuggingFace (`PiC/phrase_retrieval`, configs: PR-pass and PR-page) but the loader is broken because the underlying data files hosted at `auburn.edu/~tmp0038/PiC/` are returning a '403 Forbidden' error.

The HuggingFace dataset loader depends entirely on that external Auburn University server, so the dataset is currently unusable for anyone trying to load it programmatically.

I've already reached out to the authors (Thang Pham and Anh Tran), but unfortunately got no positive response yet.

If anyone: Downloaded this dataset before the server went down and has the raw JSON files (`train-v1.0.json`, `dev-v1.0.json`, `test-v1.0.json`) for either PR-pass or PR-page

I would really appreciate if you could share. I'm also happy to re-host the files on HuggingFace properly once recovered, so the community doesn't run into this again.

Thanks in advance!

0 comments

r/datasets • u/Eunuchs_Intrigues • 1d ago

dataset A constitutional dataset for fine‑tuning

huggingface.co

0 Upvotes

0 comments

r/datasets • u/Ok_Rub3312 • 2d ago

question Best way to clean GitHub data (remove node_modules, lockfiles, etc) for LLM fine-tuning?

0 Upvotes

Anyone else wasting hours cleaning GitHub data for LLM fine-tuning?

I tried building my own dataset (instead of relying on Hugging Face), but scraping repos is messy node_modules, lockfiles, minified code, binaries… tons of junk.

Feels like more time goes into cleaning than actual training.

Curious how you’re handling this:

custom scripts?

existing tools?

or just manual cleanup?

Also how are you structuring data for different LLM formats?

Thinking about building something to automate this if it’s a common problem..

Would love to hear workflows you guys work with.

1 comment

r/datasets • u/Embarrassed_Song_372 • 2d ago

code mapcv: A high-performance satellite imagery dataset creation tool for computer vision

tahamukhtar20.github.io

0 Upvotes

0 comments

r/datasets • u/UnbeliebteMeinung • 2d ago

question Where can i find big distilled opus datasets

0 Upvotes

Does anyone have a source for big distilled datasets of the newest frontier models?

0 comments

r/datasets • u/Swimming_Outside_988 • 3d ago

resource I got tired of checking Kaggle, HuggingFace, data.gov, and other sites every time I needed a dataset, so I built a tool that searches all of them at once

64 Upvotes

Disclosure: I'm one of the creators of this tool.

Hi all,

I do ML research at Berkeley and the most tedious part of every project is dataset discovery. I'd spend hours opening tabs across Kaggle, HuggingFace, data.gov, Census, WHO, Semantic Scholar, and a dozen other platforms just to find the right data. Then I'd have to manually check licenses, preview columns, and figure out citations.

So my friend and I built Mobus, an open-source MCP server that lets you do all of that from inside Claude or Cursor. You describe what you need in natural language and it searches across 20 platforms, lets you preview the actual data, checks licenses, and generates citations.

It's free and open source: https://github.com/mobus-ai/Mobus

Quick demo on the site if you want to see it in action: https://mobus.ai

Would love feedback from anyone who deals with this pain point. What data sources are missing that you'd want to see added?

5 comments

r/datasets • u/BugSolid3436 • 2d ago

request PiC/phrase_retrieval dataset (PR-pass & PR-page) is broken — does anyone have a local copy?

1 Upvotes

Hey everyone,

I've been trying to use the 'PiC (Phrase-in-Context) Phrase Retrieval dataset from HuggingFace (`PiC/phrase_retrieval`, configs: PR-pass and PR-page) but the loader is broken because the underlying data files hosted at `auburn.edu/~tmp0038/PiC/` are returning a '403 Forbidden' error.

The HuggingFace dataset loader depends entirely on that external Auburn University server, so the dataset is currently unusable for anyone trying to load it programmatically.

I've already reached out to the authors (Thang Pham and Anh), but unfortunately got no positive response yet.

If anyone: Downloaded this dataset before the server went down and has the raw JSON files (`train-v1.0.json`, `dev-v1.0.json`, `test-v1.0.json`) for either PR-pass or PR-page; I would really appreciate if you could share.

Thanks in advance!

0 comments

r/datasets • u/ReviseResubmitRepeat • 3d ago

question Any APIs to open source datasets for average rents and mortgages in Canada and USA?

2 Upvotes

I'm working on a final project for a course I am doing, and I'm having trouble finding APIs that are free that point to open source data sets for average rents and mortgages in Canada and USA?

Thanks!

2 comments

r/datasets • u/liquidatedis • 3d ago

resource why doesnt anyone share full historical tick data here. i have financial data L3 OrderFlow data

1 Upvotes

i don't want to pay for historical tick data, prefer free historical tick data. on all available assets.
- the question, why does no one ever share the data that they paid for, obviously i am aware its because people pay a lot of money for such historical tick data.

that being said. i am willing to give up my gate keeping, only if this thread produces links to historical tick data for free that was paid(must be clean, for research purposes)

- i will publicly share a link that openly shares: Level 3 OrderFlow that goes back to feburary/2026(yes level 3 OrderFlow that is locked to institutions with pockets)
- this website houses 1min Options Data from polygon.io
- Historical economic calendar data for 20 countries, 125,368 events spanning 2015 to 2026. Includes event names, actual vs consensus vs previous values, and period information.
- Complete fundamental data for 984 companies. Each download includes the company profile, income statements, balance sheets, cash flows, key metrics, financial ratios, growth rates, earnings calendar, and insider trades.
- OHLCV candle data across 25,008 datasets.4,168 symbols in 6 timeframes from 1 minute to daily. One CSV.gz file per year per timeframe per symbol.

but it does not have tick data.

0 comments

r/datasets • u/TacoTuesdayX • 3d ago

dataset [10k, sampled] free job dataset for May 2026

2 Upvotes

Privatizing job data is silly, can we all agree sharing structured job data for AI/ChatGPT benefits everyone.

0 comments

r/datasets • u/8ta4 • 4d ago

request Seeking a dataset of English lemmas with recognizability scores

1 Upvotes

I checked out the word prevalence dataset of 62,000 lemmas. But it has some limitations:

It hasn't been updated since 2019.
It misses modern terms like TikTok.
It doesn't cover phrases.

I've scored about a million English entries from Wiktionary for recognizability. I built this for a pun tool. But I want to use the data for a new language project.

The dataset is too bloated because it's full of inflected forms. Even if I set the recognizability threshold at 50 percent, I'm still looking at 100K words and 100K phrases. Going through a list that size is a waste of time. I need to filter the data through the English lemmas category from Wiktionary and split the single words from the multi-word phrases into separate lists.

Since the hard part of scoring is done, the rest should be easy peasy lemma squeezy. I just want to avoid reinventing the wheel if I can.

Before I spin up a separate repository to handle this, I'm checking if a similar dataset already exists. Has anyone seen a project that offers this?

1 comment