Bringing the regulatory and clinical landscape to Paperclip
Today, we're adding 225K FDA regulatory documents, 1M+ clinical trials from 19 registries (ClinicalTrials.gov, EudraCT, CTIS, ISRCTN, UMIN, JRCT, ChiCTR, and 13 WHO ICTRP registries spanning India, Iran, Australia/NZ, Germany, Netherlands, Korea, Thailand, Brazil, and Africa), along with international regulatory filings from EMA and Japan PMDA, to Paperclip.
For current Paperclip users: run paperclip update and you can immediately search FDA documents and clinical trials. Prompt your agent to search FDA documents, ClinicalTrials.gov registrations, or international regulatory filings.
225KUS FDA Documents
580KClinicalTrials.govWhy this matters
Drug development workflows depend on regulatory and clinical evidence distributed across incompatible sources. Each database exposes its own search interface, metadata schema, and update cadence. International registries introduce additional constraints of language, jurisdiction, and identifier conventions, which makes cross-source synthesis difficult to automate.
In practice, regulatory affairs and clinical development teams spend considerable time on retrieval: locating NDA and BLA review packages on Drugs@FDA, trial registrations on ClinicalTrials.gov, EPARs from the EMA, PMDA assessment reports, and entries in ChiCTR, JapicCTI, EU CTR, and other national registries. The underlying documents are public, but they are not indexed or queryable in a unified form.
Paperclip indexes US FDA documents, ClinicalTrials.gov registrations, and international regulatory and trial corpora into a single virtual filesystem, accessible via search, grep, SQL, and standard Unix-style pipelines. The case studies below compare agent performance with and without this corpus on tasks that require synthesis across these sources.
Below, we show three case studies demonstrating what this corpus enables: replicating a published regulatory analysis across hundreds of drug approvals, planning a first-in-human clinical trial using precedent from international registries and NDA pharmacology reviews, and simulating FDA reviewer advice grounded in their actual review history.
Replicating a published analysis
By Jake Silberg, PhD at Stanford University, Biomedical Data Science
About 10% of drugs approved by the FDA between 2018 and 2021 failed at least one primary endpoint in their pivotal trials. Identifying these cases matters for understanding how the agency balances statistical rigor against unmet medical need. We tested whether agents could replicate this analysis from the source documents.
Results
Drugs flagged from original paper
Median cost per drug
Median wall time per drug
Both agents did a solid job replicating the key analysis, but Paperclip clearly helped. The agent with Paperclip flagged 18 of the 21 drugs with failed endpoints from the paper (86% recall), while the agent without Paperclip flagged only 15. The agent with Paperclip flagged 27 drugs overall, so 9 not in the paper. The agent without Paperclip flagged 11 drugs not in the paper.
A key difference that Paperclip provided was speed. The agent with Paperclip needed only 9 turns to write up a narrative describing the pivotal trials and noting any failures. The same task took the agent without Paperclip 21 turns. The agent with Paperclip spent only $0.45 in Claude credits per drug, compared to $0.85 per drug for the agent without Paperclip. Finally, analyzing each drug took only 65 seconds per drug with Paperclip compared to 136 seconds per drug without. All numbers are medians.
Examples
We can look at some specific successes by the agent with Paperclip. Helpfully, the agent quotes directly from the Drugs@FDA documents it accessed. For Aduhelm, for example, it cites the Statistical Reviewer saying, “We have a second large adequate well controlled study that ... is not even close to significance,” providing a line number to verify the quote exactly matches the FDA document. These verbatim details can help users understand a complex series of endpoints, like for Tavneos, where the pre-specified criteria involved assessing non-inferiority, then proceeding to superiority. The agent again provides the line number to a key point, “At Week 26, the non-inferiority comparison was statistically significant ... but superiority was not demonstrated” and the agent notes that “Endpoint did NOT meet all prespecified criteria.”
Where the agent missed drugs flagged by the original paper, it is usually due to judgment calls that are documented in its reasoning. For example, both agents missed an endpoint for Danyelza, where the FDA had reassessed the lower bound on the confidence interval to be 20%, when the prespecified criteria was ‘above 20%.’ While the Paperclip agent did not correctly count this failure, it noted that the interval “rounds to exactly 20% creating ambiguity.” As another example, when the Paperclip agent was assessing Nourianz, it noted that different documents listed different pivotal trials. So while it spotted the failed endpoint, it did not consider that trial to be officially pivotal. Thus, though we focus on the agent’s final binary calls, its detailed reasoning based on the Paperclip documents often did find the nuances.
Finally, we can analyze the extra drugs flagged by the agent with Paperclip. For some, the agent was simply overinclusive, mistaking a trial for a non-approved indication for Tauvid as pivotal. But for others, the agent may have a point. For example, Barhemsys had a failed endpoint in its DP10018 trial, and the FDA remarked “Based on the pre-specified Bonferroni procedure, neither p-values is statistically significant at one-sided 0.0125 level.” This was later revised to be significant under a different post-hoc procedure. But there is a good case to be made that Barhemsys could have been included in the original paper as well. This demonstrates the value of a powerful agent, accessing the right documents, and surfacing additional nuance for another human look.
Method
When the FDA approves a drug despite a failed primary endpoint, it signals that the totality of evidence, including secondary endpoints, subgroup analyses, and unmet medical need, outweighed a strict reading of the pre-specified statistical plan. These cases are important for sponsors designing their own pivotal trials: they reveal the conditions under which the agency exercises flexibility and the evidentiary thresholds that informed those decisions. Systematically identifying such cases requires reading hundreds of dense review documents and cross-referencing statistical analysis plans against reported outcomes.
Because the Drugs@FDA database contains so much information about the approval process, it is a rich data source for biomedical researchers. Recent papers have analyzed everything from how the FDA extrapolates from clinical trials to determine approved populations to uncertainties in the approval process of oncology drugs that were not reported in the final drug label.
One particularly detailed paper is this analysis of drugs from 2018–2021, finding that about 10% (21 drugs) failed one of the primary endpoints in pivotal trials, but were approved by the FDA anyway. We wanted to see if agents, without access to the results from the original paper, could replicate the analysis. To test this, we tried two experimental setups: first, a Claude Code SDK agent using Opus 4.7 with access to WebFetch and web search to grab PDFs from the FDA website, and second, another agent that in addition to those tools, could directly read key review documents using Paperclip. Except for how they accessed review documents, the prompts were otherwise identical, with a description of the methodology used in the paper (though NOT the specific results from the paper).
Both agents were provided with a list of the NDAs and BLAs from the time period of the study. A runner script activated concurrent Claude Code SDK sessions provided with identical instructions containing the methodology of the paper. From the prompt:
"1. Identify all pivotal trials in the approval package. 2. For each pivotal trial, identify the associated primary efficacy endpoint(s). 3. For each primary endpoint, determine whether it met the prespecified criteria."
and a structured format in which to respond. The Paperclip agent was provided a script that would pull the documents for that application for it to access, while the other agent was told:
"You have web access (WebFetch, Bash with curl/wget). FDA resources:
- Drugs@FDA: https://www.accessdata.fda.gov/scripts/cder/daf/
- openFDA API: https://api.fda.gov/drug/drugsfda.json
- ApplicationDocs: PDFs under
/drugsatfda_docs/<nda|bla>/<year>/<basename>.pdf
Read the FDA integrated review document(s). Basenames typically contain
MedR, SumR, IntegratedR, MultidisciplineR. You may need StatR for
analysis details."Other tools are blocked to prevent hacking. After writing its reasoning about each pivotal trial and any failed endpoints, the agents had to give the final answer as YES, NO, or UNCLEAR. We then provide the UNCLEAR drugs to a second resolver agent that only reads the narratives (not the original documents) and chooses a final answer. The prompt for this resolver is identical across the two setups. Costs and time for the resolver are de minimis as it only reads the short narratives and provides a two-sentence answer.
Planning a Phase 1 study
This question was provided by a biotech startup founder and former FDA employee. Early-phase drug development requires synthesizing precedent from hundreds of prior programs scattered across trial registries, FDA regulatory filings, and the scientific literature. This is a case study in what Paperclip can retrieve versus what web search alone cannot.
Background
BTK (Bruton's tyrosine kinase) is a protein involved in the growth of certain blood cancers and autoimmune diseases. Drugs that block BTK, such as ibrutinib (Imbruvica), have been among the most important advances in blood cancer treatment over the last decade. A new generation of drugs called BTK degraders goes further: instead of simply blocking BTK, they tag it for destruction by the cell's own recycling machinery, eliminating the protein entirely. This approach can overcome resistance to standard BTK inhibitors.
Before a new drug can be tested in cancer patients, it must first be shown to be safe in a carefully controlled first-in-human (FIH) study, typically run in healthy volunteers. The company in this scenario is planning that first study in Australia, a popular choice for FIH trials given its favorable regulations and faster startup times, with an additional cohort of Japanese volunteers to support regulatory approval in Japan.
Planning this study requires three types of highly specialized knowledge that are not readily findable through a web search:
- Precedent geography: Where have similar drugs been first tested in humans, and what does that tell us about how to design our program?
- Trial design precedents: What do studies that combined a Japanese ethnobridging cohort with a standard FIH design actually look like in practice? Which Australian clinical sites have done this before?
- Pharmacodynamic biomarkers: Once the drug is in humans, how do we measure whether it is actually degrading its target protein? The answer differs depending on whether the drug blocks, reversibly inhibits, or completely destroys the target. Choosing the wrong assay wastes time and obscures whether the drug is working.
1. Where have BTK inhibitor and degrader healthy volunteer FIH studies been run historically? Cover all geographies.
2. What are examples of similar clinical trial designs (SAD/MAD with an appended Japanese cohort) previously run in Australia? Which clinical sites and CROs were involved?
3. How have BTK target occupancy assays been developed and validated? Be technical. Cover acceptable sample matrices, mechanism of action applicability (covalent, non-covalent, degrader), and pros/cons.
Q1: Where Have Similar Drugs Been First Tested in Humans?
Understanding where prior BTK programs conducted their first human studies, and why they chose those locations, is essential for choosing the right geography and population for a new degrader program. This requires knowing the full landscape of BTK compound development, not just the handful of approved drugs.
Cataloged 18 compounds across three MOA classes with specific NCT numbers, sponsors, and geographies:
- 12 covalent inhibitors (ibrutinib, acalabrutinib, zanubrutinib, tirabrutinib, orelabrutinib, evobrutinib, tolebrutinib, remibrutinib, branebrutinib, rilzabrutinib, elsubrutinib)
- 3 reversible inhibitors (pirtobrutinib, fenebrutinib, nemtabrutinib)
- 3 degraders/PROTACs (NX-2127, NX-5948, HSK29116)
Identified FIH across 6 geographies: USA (dominant), Australia (zanubrutinib CTN scheme), China (orelabrutinib, HSK29116), UK (tirabrutinib), Europe (evobrutinib, remibrutinib), with Japan noted as never a primary FIH site.
Key pattern: oncology BTK programs ran FIH in patients; autoimmune programs used healthy volunteers.
Did not produce a FIH geography catalog.
The websearch-only response focused entirely on Australian CRO sites (part of Q2) and did not systematically map where BTK FIH studies have been run. No compound-level table, no NCT numbers for FIH studies, no cross-geography analysis.
Q2: How Do You Structure a Japanese Cohort Within a FIH Study?
Japan requires its own pharmacokinetic data before approving a new drug, typically through a dedicated cohort of Japanese volunteers within the same first-in-human study. Getting this right requires knowing how prior sponsors structured these cohorts: how many subjects, at which dose levels, run in parallel or sequentially, and at which sites. This precedent lives in clinical trial protocols and FDA regulatory filings, not on CRO websites.
Identified 9 named precedent studies with Japanese ethnobridging cohorts appended to SAD/MAD designs:
- Nirmatrelvir, PF-07817883 (Pfizer)
- Abrocitinib, ritlecitinib (Pfizer, with NDA numbers)
- LY3509754 (Lilly), TAK-071 (Takeda)
- AZD5462 (AstraZeneca), gepotidacin (GSK)
- Remibrutinib (Novartis, a BTK inhibitor)
Documented the standard design pattern (1–3 Japanese cohorts, 6–8 subjects each, run concurrently with later Western cohorts) and PMDA descent requirements.
Did not identify any named precedent study designs with Japanese cohorts. Could not map the standard ethnobridging design pattern or PMDA requirements from web search alone.
Identified 6 Australian CROs with early-phase capability:
- Scientia (Sydney), with a dedicated Japanese-language recruitment page
- Nucleus Network (Melbourne + Brisbane), with 1,500+ trials and 50% FIH
- Q-Pharm (Brisbane), now part of Nucleus
- CMAX (Adelaide), Linear (Perth), Novotech
Explained 5 strategic reasons for Australia: CTN scheme, 43.5% R&D tax offset, Japanese diaspora, regulatory acceptance (PMDA/FDA/EMA), and timeline reduction (~56 to ~32 months).
Identified 4 Australian sites (Nucleus Network, CMAX, Scientia, Linear) with Japanese recruitment capability.
Found CMAX's I'rom Group (Japanese) ownership, Scientia's Japanese-language page, and 3 blinded ethnobridging studies at Nucleus Network ("Matching Study," "Filter Forward," "Shield Study").
Correctly identified branebrutinib (NCT02705989) at Nucleus Network Melbourne as the best public precedent for the planned design.
The websearch-only condition did well on the question of which Australian sites exist, since that information is on CRO websites. But the more valuable question is how prior sponsors actually structured Japanese cohorts within FIH designs, with specific dose levels, subject counts, and timing. That answer requires FDA regulatory filings and clinical trial protocols that web search cannot surface.
Q3: How Do You Measure Whether the Drug Is Working?
Every first-in-human study needs to demonstrate not just that the drug is safe, but that it is actually reaching and affecting its target, a measurement called a pharmacodynamic readout. For BTK degraders, this is non-obvious: the standard assay used for BTK inhibitors (which measures whether the drug is occupying the protein) gives misleading results for a degrader (which destroys the protein entirely). Choosing the wrong assay can make it impossible to tell whether the drug is working in humans. This answer requires synthesizing technical information from NDA pharmacology reviews, published bioanalytical methods, and clinical trial protocols across a dozen programs.
Produced a comprehensive technical review covering 7 assay platforms:
- ABPP with biotinylated probes (PCI-41025, PRN299-001)
- Fluorescent probe-based assays
- Free BTK / Total BTK ratio (ELISA, MSD)
- Mass spectrometry (LC-MS/MS)
- Flow cytometry (intracellular BTK)
- CD63 BAT, CD69/pBTK functional readouts
- Mechanistic PK/PD modeling (ibrutinib kon/koff/KD parameters)
For each assay: sample matrices, MOA compatibility (covalent/reversible/degrader), validation status with specific NDA numbers, and pros/cons.
Concluded with a prioritized biomarker strategy for degrader programs (total BTK protein as primary; ABPP explicitly flagged as inappropriate) and a cross-MOA compatibility matrix.
Did not produce a target occupancy assay review.
The websearch-only response contained no technical content on BTK occupancy assays, sample matrices, MOA-specific assay selection, or biomarker strategy. This information exists primarily in FDA review documents, NDA pharmacology reviews, and specialized bioanalytical publications, sources that web search alone could not retrieve or synthesize.
What Each System Could Answer
| Question | Opus 4.6 + Paperclip | Opus 4.6 + Websearch |
|---|---|---|
| Q1: Which countries have run similar first-in-human studies? | 18 compounds mapped to 6 geographies with trial IDs | Not answered |
| Q2a: How have Japanese cohorts been structured in prior trials? | 9 named precedent studies with design details | Not answered |
| Q2b: Which Australian sites can run Japanese cohorts? | 6 sites, strategic rationales, regulatory context | 4 sites + blinded study names from CRO websites |
| Q3: How do you measure the drug working in humans? | 7 assay platforms, degrader-specific strategy, NDA validation data | Not answered |
Virtual FDA reviewer
Three virtual FDA oncology reviewers answer four questions representative of those asked during pre-submission meetings: BRCA-targeted therapies, novel ADCs, tumor-agnostic approvals, and mCRPC comparator design. Each question was answered by Claude Code + Opus 4.6 with and without Paperclip.
Method
Subjects. Three FDA oncology reviewers were selected for distinct regulatory track records:
- Suparna Wedam, MD. Breast cancer, CDK4/6 inhibitors, ADC approvals (ENHERTU tumor-agnostic, IBRANCE, PERJETA). Persona dossier: 14 drug reviews, 10 publications, 8 presentations, 45 cited sources.
- Michael Brave, MD. Prostate cancer, novel endpoints (MFS for nmCRPC), hematologic malignancies (SPRYCEL). Persona dossier: 8 drug reviews, 8 publications, 17 cited sources.
- Elaine Chang, MD. GU oncology, PARP inhibitors, combination therapy (PADCEV, ANKTIVA, TRUQAP/capivasertib ODAC). Persona dossier: 5 drug reviews, 7 publications, 22 cited sources.
Four questions. Working with pharma companies and former FDA employees, we developed four questions representative of the topics that arise in pre-submission meetings with FDA oncology reviewers. These questions were chosen specifically because the FDA does not publish bright-line answers for any of them. Sponsors cannot look these up. Useful answers require the reviewer to draw on their own prior regulatory decisions and documented positions.
Dossier creation. For each reviewer, we used Paperclip to search for and retrieve FDA review documents, approval summaries, ODAC transcripts, published regulatory papers, and workshop presentations authored by or attributed to that reviewer. These sources were synthesized into a structured persona dossier covering: biographical details, career trajectory, every documented drug review with the reviewer's stated conclusions and direct quotes, published papers, public speaking positions, and recurring regulatory philosophy themes. The resulting dossiers ranged from 367 to 527 lines and contained verbatim quotes from FDA review packages, published approval summaries, and ODAC remarks.
Experiment setup. Both conditions used Claude Code with Opus 4.6 as the underlying model. In the Paperclip condition (Claude Code + Opus 4.6 + Paperclip), the reviewer's full dossier was injected into the system prompt, and the model had access to Paperclip (biomedical literature + FDA documents) along with web search and web fetch. In the Baselinecondition (Claude Code + Opus 4.6 + websearch), the model received only the reviewer's name and title with no dossier, and had access to web search and web fetch only, with no Paperclip and no local files. The only difference between conditions is access to Paperclip and the reviewer dossier it produces.
Results
references
reviewer's own work
referenced
cited (Paperclip vs. websearch)
| Metric (avg per response, n=12) | Opus 4.6 + Paperclip | Opus 4.6 + Websearch (no Paperclip) |
|---|---|---|
| Word count | 3,596 | 2,574 |
| Direct quotes from reviewer's own work | 27.4 | 5.8 |
| Reviewer-specific terms (avg across reviewers) | 7.8 | 0.6 |
| Hazard ratios cited | 8.5 | 4.8 |
| Named drugs | 12.3 | 8.2 |
| Named clinical trials | 6.0 | 4.2 |
| Quantitative data points (%) | 35.6 | 26.1 |
| NDA/BLA review packages cited | 2.8 | 0 |
Reviewer-specific terms are a checklist of 11–14 terms per reviewer unique to their career (e.g., drugs they reviewed, NDA/BLA numbers they authored, endpoint innovations they pioneered). A generic FDA response would contain none of them. The websearch-only condition averaged 0.6 such terms per response; Paperclip averaged 7.8, a 13× increase.
Examples
These questions were chosen because the FDA intentionally does not publish bright-line answers for any of them. The Paperclip-equipped reviewers filled this gap by drawing on their own documented regulatory actions and extrapolating to new scenarios. Excerpts below show the same reviewer answering the same question under both conditions.
Q1: BRCA-Mutated Targeted Therapy
Dr. Wedam, on combination contribution of effect
"When I review combination therapies, as I did for TUKYSA (tucatinib + trastuzumab + capecitabine, NDA 213411) and IBRANCE (palbociclib + fulvestrant or aromatase inhibitor, NDA 207103), the fundamental question is whether each drug in the combination contributes meaningfully to the overall clinical benefit."
"The FDA's July 2025 draft guidance on development of cancer drugs for novel combinations directly applies to your proposed program. Factorial randomized trials are the gold standard for proving each drug's contribution to the treatment effect."
Q2: Novel ADC
Dr. Chang, on duration of response
"As I documented in the ANKTIVA review, when DOR data are immature, I have described median DOR as 'an unreliable estimate' and have instead relied on landmark DOR analyses, specifically the percentage of responders maintaining response at 12 and 24 months."
"For accelerated approval based on ORR, the durability of response will be a critical factor. The Agency will want to see a median duration of response that is clinically meaningful, supported by a Kaplan-Meier analysis with adequate follow-up."
Q3: Tumor-Agnostic Approval
Dr. Wedam, on her ENHERTU tumor-agnostic review
"Having served as the primary clinical reviewer for the ENHERTU (fam-trastuzumab deruxtecan-nxki) tumor-agnostic accelerated approval (BLA 761139 S-028) in April 2024, the first antibody-drug conjugate ever granted a tissue-agnostic indication, and as a panelist at the 2019 FDA-ASCO-Friends of Cancer Research Workshop on Tissue-Agnostic Biomarker-Based Indications, I have had the opportunity to think carefully about what constitutes adequate evidence for this regulatory paradigm."
"The sponsor's question pertains to our expectations for key efficacy parameters to support a tumor-agnostic accelerated approval under Section 506(c) of the FD&C Act. As outlined in our Tissue Agnostic Drug Development in Oncology Guidance for Industry (October 2022), accelerated approval in this setting relies on demonstration of an effect on a surrogate endpoint, typically Overall Response Rate."
Q4: mCRPC Comparator Design
Dr. Brave, on ARPI switch informed by his nmCRPC framework
"I have had extensive experience with this therapeutic area. I led the landmark FDA review of the three ARPI approvals for nonmetastatic CRPC (enzalutamide/PROSPER, apalutamide/SPARTAN, darolutamide/ARAMIS), where I served as corresponding author on the comprehensive regulatory review published in Clinical Cancer Research [Brave M, et al., Clin Cancer Res 26(18):4717–4722]."
"The Sponsor proposes a global registrational trial in taxane-naïve mCRPC with an investigator's choice comparator arm. The question at issue is what proportion of androgen receptor pathway inhibitor switching in the comparator arm would render the control insufficiently reflective of U.S. standard of care."
The pattern is consistent: Paperclip-equipped reviewers apply documented personal precedent to generate novel regulatory advice. The websearch-only baseline produces accurate but generic answers indistinguishable from any FDA reviewer.