Datasets are the absolute foundation of machine learning development. I built a data-discovery and routing layer that lets AI agents know which real-world data source to use, how to access it, whether it is usable, and what it can be joined to. Existing public-data directories are human/topic-centric, the aim of this repo is to be agent-centric. It asks: given a research question, which dataset should an agent reach for, how does it authenticate, what can it join on, and is there an API/MCP/tooling path?
I am interested in agent-readable repos which mirror to more intuitive human readable formats, check it out here.
The mission…
Most scientific, healthcare, finance, and public datasets are not operationally discoverable. They are scattered across APIs, portals, FTP dumps, PDFs, registries, and institutional databases. This repo turns datasets into machine-readable source cards. Each entry answers four questions: can I access it, can I trust/reuse it, what identifiers can I join on, and is there an API/MCP/tooling path for autonomous use?
The core insight is the join-key layer. The generated join-key index maps canonical identifiers such as NCT_ID, DOI, PMID, CHEMBL_ID, ENSEMBL_ID, CIK, ROR, etc. to the datasets that expose them. That means agents can plan cross-source workflows instead of guessing. The repo currently has 111 datasets, 60 canonical keys, and 416 key-to-source links in that reverse index.
Why it is cool…

The repo covers domains like academic literature, clinical biotech, genomics, public health, claims, finance, corporate registries, news/events, consumer signals, government open data, and geospatial sources. It is closer to a planning substrate for research agents. For example: “Find me datasets that connect a clinical trial to a drug, target, paper, company, regulatory label, adverse-event signal, and equity ticker.” A human analyst thinks in that graph naturally. Most agents do not. Perhaps the development of such repos, if scaled, gives the agent the metadata needed to construct that graph. It maps the world’s useful research datasets, expressed in the way an agent needs to reason: access, licensing, freshness, rate limits, identifiers, joinability, and tool-readiness.
I will be exploring the tool in some more quantitative personal projects and iterating on how/if it yields improvement in agent reasoning and handling of dataset(s) in projects. Will report back ;)