gxl
Blog

Bringing the regulatory and clinical landscape to Paperclip

Today, we're adding 225K FDA regulatory documents, 1M+ clinical trials from 19 registries (ClinicalTrials.gov, EudraCT, CTIS, ISRCTN, UMIN, JRCT, ChiCTR, and 13 WHO ICTRP registries spanning India, Iran, Australia/NZ, Germany, Netherlands, Korea, Thailand, Brazil, and Africa), along with international regulatory filings from EMA and Japan PMDA, to Paperclip.

Try Paperclip →

For current Paperclip users: run paperclip update and you can immediately search FDA documents and clinical trials. Prompt your agent to search FDA documents, ClinicalTrials.gov registrations, or international regulatory filings.

FDA225KUS FDA Documents
NDA/BLA review packages, clinical & pharmacology reviews, approval letters, labeling, advisory committee transcripts, Orange Book, drug shortages, NDC directory
ClinicalTrials.gov580KClinicalTrials.gov
Full trial registrations with study design, arms, eligibility, endpoints, results, adverse events, sponsor & investigator details, plus full text of associated documents with each trial
International Registries
🇪🇺
8K
EMA Assessment Reports
🇪🇺
85K
EU Trials
🇯🇵
22K
Japan PMDA Reviews
🇯🇵
66K
Japan Trials
🇨🇳
116K
Chinese Clinical Trials
🇨🇳
14K
China Drug Reimbursement
🌐
133K
WHO ICTRP Registries

Why this matters

Drug development workflows depend on regulatory and clinical evidence distributed across incompatible sources. Each database exposes its own search interface, metadata schema, and update cadence. International registries introduce additional constraints of language, jurisdiction, and identifier conventions, which makes cross-source synthesis difficult to automate.

In practice, regulatory affairs and clinical development teams spend considerable time on retrieval: locating NDA and BLA review packages on Drugs@FDA, trial registrations on ClinicalTrials.gov, EPARs from the EMA, PMDA assessment reports, and entries in ChiCTR, JapicCTI, EU CTR, and other national registries. The underlying documents are public, but they are not indexed or queryable in a unified form.

Paperclip indexes US FDA documents, ClinicalTrials.gov registrations, and international regulatory and trial corpora into a single virtual filesystem, accessible via search, grep, SQL, and standard Unix-style pipelines. The case studies below compare agent performance with and without this corpus on tasks that require synthesis across these sources.

Below, we show three case studies demonstrating what this corpus enables: replicating a published regulatory analysis across hundreds of drug approvals, planning a first-in-human clinical trial using precedent from international registries and NDA pharmacology reviews, and simulating FDA reviewer advice grounded in their actual review history.


Replicating a published analysis

By Jake Silberg, PhD at Stanford University, Biomedical Data Science

TL;DRWe asked Jake Silberg, a researcher and PhD student at Stanford, to reproduce a paper analyzing FDA-approved drugs that failed their primary endpoint using Paperclip. With Claude Code, the Paperclip-equipped agent flagged 18/21 drugs from the original paper (86% recall) while the Websearch-only agent flagged 15/21. The Paperclip agent was also 2.1× faster and 47% cheaper at reproducing the analysis.

About 10% of drugs approved by the FDA between 2018 and 2021 failed at least one primary endpoint in their pivotal trials. Identifying these cases matters for understanding how the agency balances statistical rigor against unmet medical need. We tested whether agents could replicate this analysis from the source documents.

Results

Drugs flagged from original paper

18/21
With
Paperclip
15/21
Without
Paperclip

Median cost per drug

$0.45
With
Paperclip
$0.85
Without
Paperclip

Median wall time per drug

65 s
With
Paperclip
136 s
Without
Paperclip
With Paperclip, the agent got to the answer 2.1× faster and 47% cheaper.

Both agents did a solid job replicating the key analysis, but Paperclip clearly helped. The agent with Paperclip flagged 18 of the 21 drugs with failed endpoints from the paper (86% recall), while the agent without Paperclip flagged only 15. The agent with Paperclip flagged 27 drugs overall, so 9 not in the paper. The agent without Paperclip flagged 11 drugs not in the paper.

A key difference that Paperclip provided was speed. The agent with Paperclip needed only 9 turns to write up a narrative describing the pivotal trials and noting any failures. The same task took the agent without Paperclip 21 turns. The agent with Paperclip spent only $0.45 in Claude credits per drug, compared to $0.85 per drug for the agent without Paperclip. Finally, analyzing each drug took only 65 seconds per drug with Paperclip compared to 136 seconds per drug without. All numbers are medians.


Examples

We can look at some specific successes by the agent with Paperclip. Helpfully, the agent quotes directly from the Drugs@FDA documents it accessed. For Aduhelm, for example, it cites the Statistical Reviewer saying, “We have a second large adequate well controlled study that ... is not even close to significance,” providing a line number to verify the quote exactly matches the FDA document. These verbatim details can help users understand a complex series of endpoints, like for Tavneos, where the pre-specified criteria involved assessing non-inferiority, then proceeding to superiority. The agent again provides the line number to a key point, “At Week 26, the non-inferiority comparison was statistically significant ... but superiority was not demonstrated” and the agent notes that “Endpoint did NOT meet all prespecified criteria.”

Where the agent missed drugs flagged by the original paper, it is usually due to judgment calls that are documented in its reasoning. For example, both agents missed an endpoint for Danyelza, where the FDA had reassessed the lower bound on the confidence interval to be 20%, when the prespecified criteria was ‘above 20%.’ While the Paperclip agent did not correctly count this failure, it noted that the interval “rounds to exactly 20% creating ambiguity.” As another example, when the Paperclip agent was assessing Nourianz, it noted that different documents listed different pivotal trials. So while it spotted the failed endpoint, it did not consider that trial to be officially pivotal. Thus, though we focus on the agent’s final binary calls, its detailed reasoning based on the Paperclip documents often did find the nuances.

Finally, we can analyze the extra drugs flagged by the agent with Paperclip. For some, the agent was simply overinclusive, mistaking a trial for a non-approved indication for Tauvid as pivotal. But for others, the agent may have a point. For example, Barhemsys had a failed endpoint in its DP10018 trial, and the FDA remarked “Based on the pre-specified Bonferroni procedure, neither p-values is statistically significant at one-sided 0.0125 level.” This was later revised to be significant under a different post-hoc procedure. But there is a good case to be made that Barhemsys could have been included in the original paper as well. This demonstrates the value of a powerful agent, accessing the right documents, and surfacing additional nuance for another human look.

Method

When the FDA approves a drug despite a failed primary endpoint, it signals that the totality of evidence, including secondary endpoints, subgroup analyses, and unmet medical need, outweighed a strict reading of the pre-specified statistical plan. These cases are important for sponsors designing their own pivotal trials: they reveal the conditions under which the agency exercises flexibility and the evidentiary thresholds that informed those decisions. Systematically identifying such cases requires reading hundreds of dense review documents and cross-referencing statistical analysis plans against reported outcomes.

Because the Drugs@FDA database contains so much information about the approval process, it is a rich data source for biomedical researchers. Recent papers have analyzed everything from how the FDA extrapolates from clinical trials to determine approved populations to uncertainties in the approval process of oncology drugs that were not reported in the final drug label.

One particularly detailed paper is this analysis of drugs from 2018–2021, finding that about 10% (21 drugs) failed one of the primary endpoints in pivotal trials, but were approved by the FDA anyway. We wanted to see if agents, without access to the results from the original paper, could replicate the analysis. To test this, we tried two experimental setups: first, a Claude Code SDK agent using Opus 4.7 with access to WebFetch and web search to grab PDFs from the FDA website, and second, another agent that in addition to those tools, could directly read key review documents using Paperclip. Except for how they accessed review documents, the prompts were otherwise identical, with a description of the methodology used in the paper (though NOT the specific results from the paper).

Both agents were provided with a list of the NDAs and BLAs from the time period of the study. A runner script activated concurrent Claude Code SDK sessions provided with identical instructions containing the methodology of the paper. From the prompt:

"1. Identify all pivotal trials in the approval package.
2. For each pivotal trial, identify the associated primary efficacy
   endpoint(s).
3. For each primary endpoint, determine whether it met the prespecified
   criteria."

and a structured format in which to respond. The Paperclip agent was provided a script that would pull the documents for that application for it to access, while the other agent was told:

"You have web access (WebFetch, Bash with curl/wget). FDA resources:
  - Drugs@FDA:       https://www.accessdata.fda.gov/scripts/cder/daf/
  - openFDA API:     https://api.fda.gov/drug/drugsfda.json
  - ApplicationDocs: PDFs under
                     /drugsatfda_docs/<nda|bla>/<year>/<basename>.pdf

Read the FDA integrated review document(s). Basenames typically contain
MedR, SumR, IntegratedR, MultidisciplineR. You may need StatR for
analysis details."

Other tools are blocked to prevent hacking. After writing its reasoning about each pivotal trial and any failed endpoints, the agents had to give the final answer as YES, NO, or UNCLEAR. We then provide the UNCLEAR drugs to a second resolver agent that only reads the narratives (not the original documents) and chooses a final answer. The prompt for this resolver is identical across the two setups. Costs and time for the resolver are de minimis as it only reads the short narratives and provides a two-sentence answer.

Key takeaway: Direct access to FDA review documents allows an agent to read the actual statistical reviewer language, quote it with line numbers, and reason about ambiguous endpoints. The result is higher recall (86% vs. 71%), 2.1x faster execution, and 47% lower cost per drug analyzed.

Planning a Phase 1 study

This question was provided by a biotech startup founder and former FDA employee. Early-phase drug development requires synthesizing precedent from hundreds of prior programs scattered across trial registries, FDA regulatory filings, and the scientific literature. This is a case study in what Paperclip can retrieve versus what web search alone cannot.

TL;DRGiven a three-part clinical development question for a BTK degrader, Paperclip surfaced 18 prior compounds mapped to geographies, 9 precedent trial designs with Japanese cohort details, and 7 validated assay platforms. The websearch-only baseline could identify Australian CRO sites from their own marketing pages but could not answer the geography, trial design, or biomarker questions, which require FDA regulatory filings and clinical trial protocols that are not indexed by search engines.

Background

BTK (Bruton's tyrosine kinase) is a protein involved in the growth of certain blood cancers and autoimmune diseases. Drugs that block BTK, such as ibrutinib (Imbruvica), have been among the most important advances in blood cancer treatment over the last decade. A new generation of drugs called BTK degraders goes further: instead of simply blocking BTK, they tag it for destruction by the cell's own recycling machinery, eliminating the protein entirely. This approach can overcome resistance to standard BTK inhibitors.

Before a new drug can be tested in cancer patients, it must first be shown to be safe in a carefully controlled first-in-human (FIH) study, typically run in healthy volunteers. The company in this scenario is planning that first study in Australia, a popular choice for FIH trials given its favorable regulations and faster startup times, with an additional cohort of Japanese volunteers to support regulatory approval in Japan.

Planning this study requires three types of highly specialized knowledge that are not readily findable through a web search:

  1. Precedent geography: Where have similar drugs been first tested in humans, and what does that tell us about how to design our program?
  2. Trial design precedents: What do studies that combined a Japanese ethnobridging cohort with a standard FIH design actually look like in practice? Which Australian clinical sites have done this before?
  3. Pharmacodynamic biomarkers: Once the drug is in humans, how do we measure whether it is actually degrading its target protein? The answer differs depending on whether the drug blocks, reversibly inhibits, or completely destroys the target. Choosing the wrong assay wastes time and obscures whether the drug is working.
Query: We are planning a SAD/MAD first-in-human study for a BTK degrader in Australia, with a Japanese MAD cohort appended for ethnobridging. Answer the following:

1. Where have BTK inhibitor and degrader healthy volunteer FIH studies been run historically? Cover all geographies.
2. What are examples of similar clinical trial designs (SAD/MAD with an appended Japanese cohort) previously run in Australia? Which clinical sites and CROs were involved?
3. How have BTK target occupancy assays been developed and validated? Be technical. Cover acceptable sample matrices, mechanism of action applicability (covalent, non-covalent, degrader), and pros/cons.

Q1: Where Have Similar Drugs Been First Tested in Humans?

Understanding where prior BTK programs conducted their first human studies, and why they chose those locations, is essential for choosing the right geography and population for a new degrader program. This requires knowing the full landscape of BTK compound development, not just the handful of approved drugs.

Opus 4.6 + Paperclip vs. Opus 4.6 + Websearch
Opus 4.6 + Paperclip

Cataloged 18 compounds across three MOA classes with specific NCT numbers, sponsors, and geographies:

  • 12 covalent inhibitors (ibrutinib, acalabrutinib, zanubrutinib, tirabrutinib, orelabrutinib, evobrutinib, tolebrutinib, remibrutinib, branebrutinib, rilzabrutinib, elsubrutinib)
  • 3 reversible inhibitors (pirtobrutinib, fenebrutinib, nemtabrutinib)
  • 3 degraders/PROTACs (NX-2127, NX-5948, HSK29116)

Identified FIH across 6 geographies: USA (dominant), Australia (zanubrutinib CTN scheme), China (orelabrutinib, HSK29116), UK (tirabrutinib), Europe (evobrutinib, remibrutinib), with Japan noted as never a primary FIH site.

Key pattern: oncology BTK programs ran FIH in patients; autoimmune programs used healthy volunteers.

Opus 4.6 + Websearch (no Paperclip)

Did not produce a FIH geography catalog.

The websearch-only response focused entirely on Australian CRO sites (part of Q2) and did not systematically map where BTK FIH studies have been run. No compound-level table, no NCT numbers for FIH studies, no cross-geography analysis.


Q2: How Do You Structure a Japanese Cohort Within a FIH Study?

Japan requires its own pharmacokinetic data before approving a new drug, typically through a dedicated cohort of Japanese volunteers within the same first-in-human study. Getting this right requires knowing how prior sponsors structured these cohorts: how many subjects, at which dose levels, run in parallel or sequentially, and at which sites. This precedent lives in clinical trial protocols and FDA regulatory filings, not on CRO websites.

Ethnobridging Precedents
Opus 4.6 + Paperclip

Identified 9 named precedent studies with Japanese ethnobridging cohorts appended to SAD/MAD designs:

  • Nirmatrelvir, PF-07817883 (Pfizer)
  • Abrocitinib, ritlecitinib (Pfizer, with NDA numbers)
  • LY3509754 (Lilly), TAK-071 (Takeda)
  • AZD5462 (AstraZeneca), gepotidacin (GSK)
  • Remibrutinib (Novartis, a BTK inhibitor)

Documented the standard design pattern (1–3 Japanese cohorts, 6–8 subjects each, run concurrently with later Western cohorts) and PMDA descent requirements.

Opus 4.6 + Websearch (no Paperclip)

Did not identify any named precedent study designs with Japanese cohorts. Could not map the standard ethnobridging design pattern or PMDA requirements from web search alone.

Australian Sites & CROs
Opus 4.6 + Paperclip

Identified 6 Australian CROs with early-phase capability:

  • Scientia (Sydney), with a dedicated Japanese-language recruitment page
  • Nucleus Network (Melbourne + Brisbane), with 1,500+ trials and 50% FIH
  • Q-Pharm (Brisbane), now part of Nucleus
  • CMAX (Adelaide), Linear (Perth), Novotech

Explained 5 strategic reasons for Australia: CTN scheme, 43.5% R&D tax offset, Japanese diaspora, regulatory acceptance (PMDA/FDA/EMA), and timeline reduction (~56 to ~32 months).

Opus 4.6 + Websearch (no Paperclip)

Identified 4 Australian sites (Nucleus Network, CMAX, Scientia, Linear) with Japanese recruitment capability.

Found CMAX's I'rom Group (Japanese) ownership, Scientia's Japanese-language page, and 3 blinded ethnobridging studies at Nucleus Network ("Matching Study," "Filter Forward," "Shield Study").

Correctly identified branebrutinib (NCT02705989) at Nucleus Network Melbourne as the best public precedent for the planned design.

The websearch-only condition did well on the question of which Australian sites exist, since that information is on CRO websites. But the more valuable question is how prior sponsors actually structured Japanese cohorts within FIH designs, with specific dose levels, subject counts, and timing. That answer requires FDA regulatory filings and clinical trial protocols that web search cannot surface.


Q3: How Do You Measure Whether the Drug Is Working?

Every first-in-human study needs to demonstrate not just that the drug is safe, but that it is actually reaching and affecting its target, a measurement called a pharmacodynamic readout. For BTK degraders, this is non-obvious: the standard assay used for BTK inhibitors (which measures whether the drug is occupying the protein) gives misleading results for a degrader (which destroys the protein entirely). Choosing the wrong assay can make it impossible to tell whether the drug is working in humans. This answer requires synthesizing technical information from NDA pharmacology reviews, published bioanalytical methods, and clinical trial protocols across a dozen programs.

Technical Assay Review
Opus 4.6 + Paperclip

Produced a comprehensive technical review covering 7 assay platforms:

  • ABPP with biotinylated probes (PCI-41025, PRN299-001)
  • Fluorescent probe-based assays
  • Free BTK / Total BTK ratio (ELISA, MSD)
  • Mass spectrometry (LC-MS/MS)
  • Flow cytometry (intracellular BTK)
  • CD63 BAT, CD69/pBTK functional readouts
  • Mechanistic PK/PD modeling (ibrutinib kon/koff/KD parameters)

For each assay: sample matrices, MOA compatibility (covalent/reversible/degrader), validation status with specific NDA numbers, and pros/cons.

Concluded with a prioritized biomarker strategy for degrader programs (total BTK protein as primary; ABPP explicitly flagged as inappropriate) and a cross-MOA compatibility matrix.

Opus 4.6 + Websearch (no Paperclip)

Did not produce a target occupancy assay review.

The websearch-only response contained no technical content on BTK occupancy assays, sample matrices, MOA-specific assay selection, or biomarker strategy. This information exists primarily in FDA review documents, NDA pharmacology reviews, and specialized bioanalytical publications, sources that web search alone could not retrieve or synthesize.

What Each System Could Answer

QuestionOpus 4.6 + PaperclipOpus 4.6 + Websearch
Q1: Which countries have run similar first-in-human studies?18 compounds mapped to 6 geographies with trial IDsNot answered
Q2a: How have Japanese cohorts been structured in prior trials?9 named precedent studies with design detailsNot answered
Q2b: Which Australian sites can run Japanese cohorts?6 sites, strategic rationales, regulatory context4 sites + blinded study names from CRO websites
Q3: How do you measure the drug working in humans?7 assay platforms, degrader-specific strategy, NDA validation dataNot answered
Key takeaway: Web search can find what CROs say about themselves on their own websites. It cannot find how prior sponsors structured their clinical protocols, what the FDA required in NDA pharmacology reviews, or which bioanalytical assays have been validated across approved programs. That information lives in regulatory filings, clinical trial registries, patent applications, and the primary scientific literature that Paperclip adds to Claude Code + Opus 4.6.

Virtual FDA reviewer

Three virtual FDA oncology reviewers answer four questions representative of those asked during pre-submission meetings: BRCA-targeted therapies, novel ADCs, tumor-agnostic approvals, and mCRPC comparator design. Each question was answered by Claude Code + Opus 4.6 with and without Paperclip.

TL;DRWe built virtual FDA oncology reviewers from Paperclip-sourced dossiers of review documents, approval histories, and publications. Compared to Claude Code as a baseline, Paperclip produced 13× more reviewer-specific references, 4.7× more direct quotes from the reviewers' own work, and cited NDA/BLA packages the websearch-only baseline could not access. Responses applied documented regulatory history to novel questions rather than restating prior positions.

Method

Subjects. Three FDA oncology reviewers were selected for distinct regulatory track records:

  • Suparna Wedam, MD. Breast cancer, CDK4/6 inhibitors, ADC approvals (ENHERTU tumor-agnostic, IBRANCE, PERJETA). Persona dossier: 14 drug reviews, 10 publications, 8 presentations, 45 cited sources.
  • Michael Brave, MD. Prostate cancer, novel endpoints (MFS for nmCRPC), hematologic malignancies (SPRYCEL). Persona dossier: 8 drug reviews, 8 publications, 17 cited sources.
  • Elaine Chang, MD. GU oncology, PARP inhibitors, combination therapy (PADCEV, ANKTIVA, TRUQAP/capivasertib ODAC). Persona dossier: 5 drug reviews, 7 publications, 22 cited sources.

Four questions. Working with pharma companies and former FDA employees, we developed four questions representative of the topics that arise in pre-submission meetings with FDA oncology reviewers. These questions were chosen specifically because the FDA does not publish bright-line answers for any of them. Sponsors cannot look these up. Useful answers require the reviewer to draw on their own prior regulatory decisions and documented positions.

Q1: BRCA-Mutated Targeted Therapy. We have a novel targeted therapy with Phase 2 activity across BRCA-mutated prostate, breast, and ovarian cancers. For mCRPC, should we enroll the Phase 3 broadly (all HRR-mutated patients) or restrict to BRCA-only, and what magnitude of rPFS effect would you consider sufficient for approval? We plan to combine our agent with an androgen receptor pathway inhibitor backbone. What evidence of contribution of effect will you require? And can our cross-tumor-type ORR data from a basket trial support a tissue-agnostic accelerated approval?
Q2: Novel ADC. We have a novel ADC with strong Phase 2 response data across multiple solid tumor types but no randomized survival data yet. For our lead indication we're planning a Phase 3. Should we power for PFS or OS? For a second indication with high unmet need, is our single-arm ORR sufficient for accelerated approval? And given a class-effect pulmonary toxicity signal, how should we structure safety monitoring and labeling across indications?
Q3: Tumor-Agnostic Approval. To support tumor agnostic accelerated approval, what are the ORR, DOR, number of distinct histologies, and level of consistency across histologies you are expecting?
Q4: mCRPC Comparator Design. For a global registrational study in taxane-naive mCRPC, the comparator arm is investigator's choice. The standard of care in the US is docetaxel. However, some investigators may still elect for ARPI switch. What percentage of ARPI switch would be too much, so as to be non-reflective of US standard of care?

Dossier creation. For each reviewer, we used Paperclip to search for and retrieve FDA review documents, approval summaries, ODAC transcripts, published regulatory papers, and workshop presentations authored by or attributed to that reviewer. These sources were synthesized into a structured persona dossier covering: biographical details, career trajectory, every documented drug review with the reviewer's stated conclusions and direct quotes, published papers, public speaking positions, and recurring regulatory philosophy themes. The resulting dossiers ranged from 367 to 527 lines and contained verbatim quotes from FDA review packages, published approval summaries, and ODAC remarks.

Experiment setup. Both conditions used Claude Code with Opus 4.6 as the underlying model. In the Paperclip condition (Claude Code + Opus 4.6 + Paperclip), the reviewer's full dossier was injected into the system prompt, and the model had access to Paperclip (biomedical literature + FDA documents) along with web search and web fetch. In the Baselinecondition (Claude Code + Opus 4.6 + websearch), the model received only the reviewer's name and title with no dossier, and had access to web search and web fetch only, with no Paperclip and no local files. The only difference between conditions is access to Paperclip and the reviewer dossier it produces.


Results

13×
Reviewer-specific
references
4.7×
Direct quotes from
reviewer's own work
1.5×
Named drugs
referenced
2.8 vs. 0
NDA/BLA packages
cited (Paperclip vs. websearch)
Metric (avg per response, n=12)Opus 4.6 + PaperclipOpus 4.6 + Websearch (no Paperclip)
Word count3,5962,574
Direct quotes from reviewer's own work27.45.8
Reviewer-specific terms (avg across reviewers)7.80.6
Hazard ratios cited8.54.8
Named drugs12.38.2
Named clinical trials6.04.2
Quantitative data points (%)35.626.1
NDA/BLA review packages cited2.80

Reviewer-specific terms are a checklist of 11–14 terms per reviewer unique to their career (e.g., drugs they reviewed, NDA/BLA numbers they authored, endpoint innovations they pioneered). A generic FDA response would contain none of them. The websearch-only condition averaged 0.6 such terms per response; Paperclip averaged 7.8, a 13× increase.


Examples

These questions were chosen because the FDA intentionally does not publish bright-line answers for any of them. The Paperclip-equipped reviewers filled this gap by drawing on their own documented regulatory actions and extrapolating to new scenarios. Excerpts below show the same reviewer answering the same question under both conditions.

Q1: BRCA-Mutated Targeted Therapy

Dr. Wedam, on combination contribution of effect

Opus 4.6 + Paperclip
"When I review combination therapies, as I did for TUKYSA (tucatinib + trastuzumab + capecitabine, NDA 213411) and IBRANCE (palbociclib + fulvestrant or aromatase inhibitor, NDA 207103), the fundamental question is whether each drug in the combination contributes meaningfully to the overall clinical benefit."
Novel reasoning: Cites two of her own NDA reviews as personal precedent, then applies the framework she developed in those reviews to a hypothetical new combination she has never reviewed.
Opus 4.6 + Websearch (no Paperclip)
"The FDA's July 2025 draft guidance on development of cancer drugs for novel combinations directly applies to your proposed program. Factorial randomized trials are the gold standard for proving each drug's contribution to the treatment effect."
Missing: Cites the correct guidance document but cannot connect it to any personal review experience. No NDA numbers, no prior decisions to draw from.

Q2: Novel ADC

Dr. Chang, on duration of response

Opus 4.6 + Paperclip
"As I documented in the ANKTIVA review, when DOR data are immature, I have described median DOR as 'an unreliable estimate' and have instead relied on landmark DOR analyses, specifically the percentage of responders maintaining response at 12 and 24 months."
Novel reasoning: Transfers her ANKTIVA review methodology (BLA 761336) to an entirely different drug class. During that review she developed the landmark DOR framework; here she recommends it for a novel ADC she has never seen.
Opus 4.6 + Websearch (no Paperclip)
"For accelerated approval based on ORR, the durability of response will be a critical factor. The Agency will want to see a median duration of response that is clinically meaningful, supported by a Kaplan-Meier analysis with adequate follow-up."
Missing: Recommends median DOR (the standard approach), unaware that this specific reviewer has published that median DOR is "unreliable" when data are immature and prefers landmark analyses instead.

Q3: Tumor-Agnostic Approval

Dr. Wedam, on her ENHERTU tumor-agnostic review

Opus 4.6 + Paperclip
"Having served as the primary clinical reviewer for the ENHERTU (fam-trastuzumab deruxtecan-nxki) tumor-agnostic accelerated approval (BLA 761139 S-028) in April 2024, the first antibody-drug conjugate ever granted a tissue-agnostic indication, and as a panelist at the 2019 FDA-ASCO-Friends of Cancer Research Workshop on Tissue-Agnostic Biomarker-Based Indications, I have had the opportunity to think carefully about what constitutes adequate evidence for this regulatory paradigm."
Novel reasoning: Establishes personal authority via two specific regulatory actions (the ENHERTU review and the 2019 Workshop), then uses her firsthand experience approving the first ADC with a tissue-agnostic indication to set the bar for a new one.
Opus 4.6 + Websearch (no Paperclip)
"The sponsor's question pertains to our expectations for key efficacy parameters to support a tumor-agnostic accelerated approval under Section 506(c) of the FD&C Act. As outlined in our Tissue Agnostic Drug Development in Oncology Guidance for Industry (October 2022), accelerated approval in this setting relies on demonstration of an effect on a surrogate endpoint, typically Overall Response Rate."
Missing: Correctly cites the guidance document but has no knowledge that this reviewer personally approved the most relevant precedent (ENHERTU tissue-agnostic) or participated in the FDA workshop that shaped the policy.

Q4: mCRPC Comparator Design

Dr. Brave, on ARPI switch informed by his nmCRPC framework

Opus 4.6 + Paperclip
"I have had extensive experience with this therapeutic area. I led the landmark FDA review of the three ARPI approvals for nonmetastatic CRPC (enzalutamide/PROSPER, apalutamide/SPARTAN, darolutamide/ARAMIS), where I served as corresponding author on the comprehensive regulatory review published in Clinical Cancer Research [Brave M, et al., Clin Cancer Res 26(18):4717–4722]."
Novel reasoning: Draws on his landmark nmCRPC review (citing his own publication by journal, volume, and page) to reason about when an investigator's choice comparator arm stops reflecting US standard of care. He applies his position that "each approval was supported by an international, randomized, placebo-controlled trial" to a new trial design question.
Opus 4.6 + Websearch (no Paperclip)
"The Sponsor proposes a global registrational trial in taxane-naïve mCRPC with an investigator's choice comparator arm. The question at issue is what proportion of androgen receptor pathway inhibitor switching in the comparator arm would render the control insufficiently reflective of U.S. standard of care."
Missing: Correctly frames the regulatory question but cannot draw on any personal authority. Does not know this reviewer wrote the defining regulatory framework for the mCRPC treatment landscape.

The pattern is consistent: Paperclip-equipped reviewers apply documented personal precedent to generate novel regulatory advice. The websearch-only baseline produces accurate but generic answers indistinguishable from any FDA reviewer.

Key takeaway: Access to a reviewer's actual regulatory documents transforms generic FDA guidance into reviewer-specific advice. The same model produces qualitatively different outputs depending on whether it can retrieve NDA/BLA review packages, ODAC transcripts, and published positions that web search cannot surface.