Representing Biomedical Literature as a Filesystem through Agent-Native Indexing

The GXL Team

TL;DR Instead of moving data to the agent, we send the agent to the data. Biomedical preprints are exposed as a virtual filesystem that agents explore using the same bash tools they use on codebases. We develop Sy, our research agent built on top of this filesystem. The result is 1.6× more accurate, 2.4× faster, and 3.6× cheaper than MCP-based approaches across deep text Q&A, experiment novelty checking, and cross-paper synthesis.
Average performance of Sy vs other agents on difficult bioRxiv questions
Accuracy / Completeness
91%
57%
31%
Avg Time
1.9m
4.5m
14.9m
Avg Cost
$0.37
$1.32
$1.00
GXL Sy (Ours) Claude Code + bioRxiv MCP FutureHouse Edison

Let's say you want to retrieve information from a remote source, like a large corpus of paper preprints. Today, most MCP-style integrations behave like structured communication channels between distant systems. You define the fields ahead of time, send a request, and receive a structured response. That works well when the question is clearly defined and the answer fits neatly into those predefined slots.

But many interesting questions aren't like that. In practice, discovery is often exploratory: you need to move through the data itself, following threads and context rather than issuing a single precise query. However, coding agents like Claude Code, Codex, and Cursor, already have deep intuitions for exactly this style of navigation. These agents have deep priors navigating codebases through bash (ls, grep, find, cat). They come with intuitions of which commands to reach for, how to compose them with pipes, and how to navigate from a broad directory structure down to the exact file and line they need. The problem is that scientific literature gives them nothing to navigate.

The Problem with MCP Tools

Today's biomedical research agents are stuck on the wrong side of this divide. Common LLM-based tools rely on MCP servers and search APIs that behave as structured channels: a search_papers tool with a handful of parameters that returns massive payloads of abstracts, with no way to browse, no sense of what's nearby, and no ability to refine by navigating rather than re-querying. The agent throws a query into a void and catches whatever comes back.

The agent can't ls the literature to see what's in a research area. It can't grep across methods sections to find how experiments were actually done. It can't follow a citation trail by reading a file. The scientific literature has no geography that the agent can navigate, so it can't orient itself, and all those deeply trained intuitions about filesystem navigation go entirely unused.

Current Literature Agents

Many current literature agents share a similar architecture: wrappers over PubMed, Semantic Scholar, or bioRxiv APIs, accessed through MCP connectors or function-calling schemas. The agent calls search_papers(query), gets back a list of abstracts, and summarizes them. This works when the question is clearly defined, but it breaks down on anything that requires reading the content of a paper in detail: methods sections, supplemental tables, figure captions, appendix data. These are precisely the parts that contain what researchers actually need — the specific protocol, the sample sizes, the failure modes, and the caveats buried in extended data.

The traditional alternative is to bring the data to you: downloading terabytes of papers, building indexing pipelines, and running search infrastructure locally, all before any real investigation can begin. MCP connectors sit somewhere in between but still fall short — they return metadata and abstracts. Getting full-text content requires fetching a PDF at query time, dumping the entire document into context, and hoping the relevant sentence surfaces in a ~40K token blob. There is no way to search within the document, no way to navigate to the Methods section without reading everything before it, and no way to run the same extraction across 50 papers without exhausting your token budget on the first three.

Our approach takes the opposite direction: rather than moving the data to the agent, we move the agent to the data.

Biomedical Preprints as Filesystem

You can think of it as opening a small portal between your system and the corpus. The remote data source is exposed as a virtual filesystem, and the heavy lifting has already been done. The corpus is indexed, structured, and optimized for search so agents can explore efficiently without downloading or managing the data themselves.

Scientific papers are born as rich collections of structured artifacts: tables, figures, methods sections, supplementary spreadsheets, appendices, and code. Then authors compress all of that structure by flattening everything into a single PDF at submission time. The format is optimized for print and human readability, but not for AI agents. Our core idea is to reverse this compression. We re-expand each paper back into a filesystem where every paper is a folder, every figure is a file, and every section (Methods, Results, Discussion) is individually addressable. Each paper becomes a directory that an agent can enter, inspect, and traverse as deeply as needed.

Paper-as-a-Filesystem
Author 1, Author 2, Author 3
Abstract
Introduction
table1.csv
Results
Discussion
Methods
Parse
📂 File Structure
/paper/
/sections/
abstract.md
introduction.md
related_works.md
results.md
discussion.md
/tables/
table1.csv
table2.csv
/images/
figure1.png
figure2.png
figure3.png
/supplements/
supp_figure1.png
...
Access
Terminal
$ cd /paper/
$ ls
sections/ tables/ images/
$ cat abstract.md
Background: Recent work...
# Programmatic access
$ python -c "
  import pandas as pd
  df = pd.read_csv(
    'tables/table1.csv'
  )"

An agent goes to the filesystem with a task, navigates the relevant parts of the corpus, gathers the necessary context, and returns with the answer and supporting evidence. When it wants to replicate an experiment, it reads sections/Methods.lines. When it wants to compare results across studies, it reads sections/Results.lines. When it's trying to understand how the field interprets a finding, it reads sections/Discussion.lines. This mirrors how scientists actually use papers: not as monolithic blobs, but as collections of structurally distinct knowledge types.

Each line of text has a unique numeric identifier that traces back to the source, and every figure is individually addressable. The agent can cd into a paper, grep for a term, cat specific sections, and head the first 50 lines, using the same bash workflow it uses to navigate any codebase. Instead of hauling terabytes of data across the network and rebuilding indexing infrastructure locally, the exploration happens where the information already lives.

How It Works

The agent runs a command
Hover a command to see which parts of the filesystem it touches
$ search "CRISPR base editing efficiency"
$ grep "IC50" /papers/*/sections/Results.lines
$ cat /papers/a7f3e2/supplements/table_s1.csv
Virtual Filesystem Layer
Hidden from agent
Translates shell commands into parallel queries across the indexed corpus, then assembles results as ordinary files and directories.
📄
Document
Processing
PDF parsing, XML extraction,
section segmentation
🗃
SQL
Storage
Metadata, full text,
content blocks
🔍
Hybrid
Indexing
BM25 + semantic KNN,
block-level retrieval
Cache
Query results, embeddings,
hot paths
The agent sees individual papers as local directories
📂 /paper_a7f3e2/
▸ sections/
▸ supplements/
▸ figures/
▸ meta.json
📂 /paper_b8c4d1/
▸ sections/
▸ supplements/
▸ figures/
▸ meta.json
📂 /paper_c2e9a0/
▸ sections/
▸ supplements/
▸ figures/
▸ meta.json
📂 /paper_d5f1b3/
▸ sections/
▸ supplements/
▸ figures/
▸ meta.json
...
450K+ papers

Building the Index

Making half a million preprints navigable as directories requires solving a document engineering problem at scale. The raw corpus is a mix of JATS XML and supplementary materials in diverse formats: PDFs, Excel spreadsheets, Word documents, CSV files, even PowerPoint slides. The main text pipeline parses JATS XML into individual content blocks, where each paragraph, table, figure caption, section header, and formula becomes a separately addressable unit. This decomposition is what allows an agent to grep for a term and land on the exact block rather than ingesting a 40K-token blob.

Paper supplements present a challenge as they are inherently heterogeneous and unstructured. A single paper might attach a PDF with 15 supplementary figures, an Excel file of raw assay data, and a Word document with extended methods. To convert these to LLM-native text, we run each supplement through OCR models that perform document segmentation, table recognition, and formula extraction, producing the same block-level format: typed content with bounding boxes and page coordinates. The output is normalized into the same schema as the XML-derived blocks, so the agent sees a uniform supplements/ directory regardless of whether the underlying source was a scanned PDF or native XML.

The resulting ~70 million content blocks are stored in PostgreSQL with per-block JSONB metadata linking each block back to its source file and XPath, and dual-indexed in Elasticsearch through a hybrid retrieval layer that combines BM25 keyword scoring with dense vector embeddings. When the agent runs search "CRISPR base editing", both indices fire in parallel and results are merged via reciprocal rank fusion. When it runs grep "IC50" inside a paper, the query hits the block-level index filtered to that document. Figures are individually addressable and can be routed to a vision model on demand. The entire layer is invisible to the agent—it sees files and directories.

Why Bash

LLMs trained on code have encountered ls, grep, find, cat, wc, diff, head, tail, and pipe composition billions of times. They don't treat these as abstract API calls. They know grep -r recurses, they know wc -l counts lines, they know how to pipe output between commands, and they know what an empty directory means versus a missing one.

When the research filesystem responds to standard bash, the agent doesn't need to learn a new tool schema through in-context examples. It applies the same skills it uses to navigate a codebase, now pointed at scientific literature. A custom API with search_preprints() and get_preprint() means the model learns your interface from scratch on every invocation. It will use it, but it won't compose tools in ways you didn't anticipate.

Map-Reduce Over Papers

By treating each paper as a directory, we unlock another powerful pattern: map-reduce over papers. A map operation dispatches a lightweight subagent to every paper in parallel, each with filesystem access to its paper's directory. A reduce operation then synthesizes the extractions into a unified answer. This mirrors how scientists do literature reviews (asking the same question across many documents) but runs in minutes across dozens of papers, extracting structured data from full text, not abstracts. Each subagent navigates only the relevant parts of its paper and returns with a precise extraction.

MCP Tool Server

1Call search_preprints(category, date_range).
No keyword search, browse by category + date only.
Get recent titles + abstracts.
2Call get_preprint(doi).
Entire paper (~40K tokens) dumped into context as a blob.
3Repeat for 1–2 more.
Context now ~100K tokens of undifferentiated text.
4Synthesize from memory.
No section boundaries. Citations unreliable.
Result: 2–3 papers, vague summaries.
vs

GXL Sy (Ours)

1Search 450K papers in filesystem.
Get paths to top papers sorted by relevance.
2Each subagent navigates its paper:
grep Results, head Methods, cat a figure.
3Each reads ~200 tokens (not 40K).
Returns structured extraction with block-level citations.
425–100 subagents in parallel.
Reduce into synthesis.
Result: 100 papers, specific data, every claim cited.

GXL Sy is our research agent built on top of this filesystem. It navigates over 450,000 bioRxiv and medRxiv preprints, using the full depth of the virtual filesystem to answer questions that require reading specific passages, checking experimental novelty across the literature, and synthesizing findings from multiple papers. Rather than wrapping search results in a prompt, GXL Sy enters the corpus, follows leads across papers, and returns with grounded, citable answers.

bioRxiv Bench

To measure whether focused exploration outperforms rigid querying, we introduce bioRxiv Bench, a benchmark of 140 questions drawn from real research workflows over bioRxiv and medRxiv preprints. The benchmark spans three task types: Deep Paper Q&A (N=50), Experiment Novelty Check (N=50), and Multi-Paper Synthesis (N=40).

We compare GXL Sy (Ours) against two baselines: Claude Code with the Claude bioRxiv MCP connector, and the FutureHouse Edison Platform, which provides AI-powered biomedical literature search. For FutureHouse Edison, we used the Precedent agent on Experiment Novelty Check and the Literature agent on Paper Q&A and Multi-Paper Synthesis.

Deep Paper Q&A (N=50)

GXL Sy (Ours) vs. Claude Code + bioRxiv MCP connector vs. FutureHouse Edison

Accuracy
100%
86%
4%
Avg Time per Query
1m6s
3m42s
9m29s
Avg Cost per Query
$0.21
$1.07
$1.00
FutureHouse Edison charges per credit used.
GXL Sy (Ours) Claude Code + MCP FutureHouse Edison

50 supplement-grounded questions across 50 bioRxiv preprints, each requiring data from supplemental tables, PDFs, or DOCX files that cannot be answered from the paper's main text alone.

Dataset Construction

50 single-document questions drawn from 50 distinct bioRxiv preprints published in 2025. Each question was generated by granting a model access to the full paper including supplements, then manually reviewed and filtered for clarity, accuracy, and relevance.

Questions were explicitly designed to be unanswerable from the abstract or main text alone, requiring data from supplemental tables, PDFs, or DOCX files. Each answer is accompanied by the specific supplement file path, step-by-step reasoning procedure, and executable code used to derive the answer.

Example Questions

Exact Count"In the paper with DOI 10.1101/2025.03.28.646065, how many proteins comprise the proteostasis network of tau according to the paper?"
Cohort Breakdown"In the paper with DOI 10.1101/2025.05.30.657099, in the RNA-seq analysis of neutrophils, how many healthy controls and SLE patients were compared, and how were the SLE patients subdivided based on ISG levels?"
Dataset Parameter"In the paper with DOI 10.1101/2025.04.10.648169, how many HLA-I alleles are represented in the benchmarking validation dataset?"

Scoring Protocol

Questions were generated by giving a model access to the full filesystem and prompting it to produce grounded questions from the source material, followed by a round of human review for clarity, accuracy, and relevance.

Responses are scored by an LLM judge against ground truth. Numeric answers are accepted within a 2% relative tolerance or up to 5 decimal places, and semantically equivalent formats are treated as equal (e.g., "95%" == "0.95"). String and categorical answers are matched case-insensitively on core meaning. Each response scores 1.0 (correct) or 0.0 (incorrect).

Experiment Novelty Check (N=50)

GXL Sy (Ours) vs. Claude Code + bioRxiv MCP connector vs. FutureHouse Edison

Accuracy
80%
28%
20%
Avg Time per Query
2m22s
12m13s
2m54s
Avg Cost per Query
$0.36
$0.93
$1.00
FutureHouse Edison charges per credit used.
GXL Sy (Ours) CC + bioRxiv connector FutureHouse Edison

50 questions were constructed by starting from real bioRxiv and medRxiv papers with distinctive quantitative findings, then reverse-engineering the natural-language experiment novelty query a researcher might ask before attempting similar work.

Dataset Construction

Questions were authored by first identifying papers in our indexed bioRxiv/medRxiv corpus that contain distinctive quantitative findings — binding affinities, enzyme kinetics, production titers, dose-response values — then reverse-engineering a natural language query a researcher might pose before starting analogous work.

Each question has a ground truth record containing the target paper's document ID, title, DOI, authors, and the location of the experiments or results within the paper (body text, supplement table, or figure).

Example Questions

Inverted Quorum Sensing Toggle"What happens if you reverse which quorum sensing system controls which state in an intercellular genetic toggle? Would it still be bistable?"
TX-TL Cosolute Concentration Mapping"Has anyone systematically mapped how magnesium glutamate and potassium glutamate concentrations in cell-free TX-TL affect expression timing (lag phase) and total protein yield?"
Riboswitch-Dependent Synthetic Auxotrophs"Has anyone systematically built riboswitch-dependent synthetic auxotrophs in E. coli where theophylline-responsive riboswitches control essential genes, so cells only survive in the presence of the small molecule?"

Scoring Protocol

We use a 3-criteria LLM-as-a-judge system (Claude Sonnet 4.6) to evaluate each agent response:

Criteria 1 (Verdict) — Did the agent correctly conclude that the finding is not novel? All 50 questions describe real findings from real papers; the correct answer is always that prior art exists.
Criteria 2 (Paper Identification) — Did the agent find the correct source paper? Evaluated by matching the reported DOI, title, or author list against the ground-truth paper.
Criteria 3 (Specific Finding) — Did the agent extract the correct quantitative value or experimental detail? Allows agreement to 3 significant figures and equivalent unit conversions (e.g. 0.9 µM = 900 nM).

Multi-Paper Synthesis (N=40)

GXL Sy (Ours) vs. Claude Code + bioRxiv MCP connector vs. FutureHouse Edison

Completeness
92%
58%
70%
Avg Time per Query
2m6s
6m48s
23m6s
Avg Cost per Query
$0.53
$1.96
$1.00
FutureHouse Edison charges per credit used.
Avg Citations
27.9
10.6
19.5
GXL Sy (Ours) Claude Code + MCP FutureHouse Edison

40 cross-paper synthesis questions across 8 categories (relevant retrieval, exact quotation, abstract vs. substance, emerging directions, cross-paper contradiction, cross-paper synthesis, quantitative extraction, method comparison), each requiring evidence drawn from a minimum of 5 papers. Scored with literal per-question criteria. FutureHouse Edison and CC + BioRxiv were re-judged on the same 40-question set.

Dataset Construction

40 questions synthetically generated across 8 categories: spatial transcriptomics, foundation models, gene therapy safety, single-cell genomics, neuroscience, immunology, cancer biology, and synthetic biology. Each question was designed to require synthesis across a minimum of 5 papers and to be unanswerable from any single abstract.

Questions were generated by prompting a model with subfield descriptions and example queries, then filtered for quality through human review — removing questions that were too narrow, too broad, or answerable without cross-paper synthesis.

Example Questions

Spatial Validation Rigor"Across recent spatial transcriptomics or spatial proteomics preprints, extract each paper's main claimed cell-cell interaction, niche, or microenvironment finding and whether it includes orthogonal validation such as immunostaining, RNAscope, perturbation, or functional follow-up. Which classes of spatial claims are usually supported only by computational figures, and which are routinely validated experimentally?"
Foundation Model Leakage"Across recent bioRxiv preprints on foundation models for genomics, proteins, pathology, or multimodal biology, extract the pretraining corpus description, train-validation-test split strategy, homolog or cluster leakage controls, external benchmarks, and key ablations. Which reported performance gains may be inflated by weak leakage control or benchmark overlap?"
AAV/LNP Toxicity Signals"Across preprints on in vivo AAV or LNP gene delivery from the last 3 years, extract the species, dose, capsid or formulation, promoter, target tissue, and any reported liver injury, dorsal root ganglia toxicity, neuropathology, deaths, or severe adverse findings from figures, pathology supplements, and extended data. Which toxicity signals recur across independent groups even when they are not emphasized in the abstract?"

Scoring Protocol

Each question is scored by an LLM judge against a checklist of criteria written specifically for that query. A response passes only if every criterion is met. Completeness is the fraction of questions that fully pass.

Criteria are written literally against each question — they check exactly what the question asks for, not a generalized rubric. For example, for the query "Across recent spatial transcriptomics or spatial proteomics preprints, extract each paper's main claimed cell-cell interaction, niche, or microenvironment finding and whether it includes orthogonal validation such as immunostaining, RNAscope, perturbation, or functional follow-up. Which classes of spatial claims are usually supported only by computational figures, and which are routinely validated experimentally?", the criteria are:

  1. All returned papers are spatial transcriptomics or spatial proteomics preprints (not bulk RNA-seq or non-spatial single-cell studies)
  2. For each paper, the main claimed cell-cell interaction, niche, or microenvironment finding is extracted
  3. For each paper, states specifically whether it includes orthogonal validation (immunostaining, RNAscope, perturbation, or functional follow-up) or is computational only
  4. Identifies which classes of spatial claims (e.g. ligand-receptor predictions, niche composition, cell co-localization) are typically computational-only vs. routinely validated experimentally across the returned papers

A response that returns non-spatial papers, summarizes findings without per-paper validation status, or omits the classification of claim types fails. Partial credit is not given.

Category Breakdown (5 questions each)

Category GXL Sy (Ours) CC + MCP FutureHouse Edison
Relevant Retrieval 4/5 1/5 4/5
Exact Quotation 5/5 3/5 1/5
Abstract vs. Substance 5/5 2/5 5/5
Emerging Directions 5/5 2/5 3/5
Cross-Paper Contradiction 5/5 2/5 4/5
Cross-Paper Synthesis 5/5 5/5 2/5
Quantitative Extraction 3/5 4/5 4/5
Method Comparison 5/5 4/5 5/5
Total 37/40 (92%) 23/40 (58%) 28/40 (70%)

Across all task types the pattern holds. On Deep Paper Q&A, GXL Sy (Ours) scores 100% vs 86%, runs 3.4× faster (1m6s vs 3m42s), and costs 5.1× less ($0.21 vs $1.07). On Experiment Novelty Check, it scores 80%, 2.9× more accurate than CC + bioRxiv connector (28%) and 4.0× more accurate than FutureHouse Edison (20%), while running 1.2× faster than CC + bioRxiv connector (2m22s vs 2m54s) and 5.2× faster than FutureHouse Edison (2m22s vs 12m13s), with lower average cost than CC + bioRxiv connector ($0.36 vs $0.93); FutureHouse Edison charges per credit used. On Multi-Paper Synthesis, it is 59% more complete (92% vs 58%), 3.2× faster (2m6s vs 6m48s), produces 2.6× more citations (27.9 vs 10.6), and costs 3.7× less ($0.53 vs $1.96). The efficiency comes from targeted access: a grep into sections/Results.lines consumes ~200 tokens versus ~40,000 for loading a full paper through MCP. The accuracy comes from section-level precision: questions about methods go to Methods, questions about limitations go to Discussion.

Below, we walk through specific benchmark examples comparing the two approaches side by side.

Case Studies

Deep Paper Q&A

Deep Paper Q&A

Questions that can only be answered from supplemental figures, tables, or methods: content that is invisible to abstract-level search and requires navigating directly into supplement files. Every supplement is a first-class path in the filesystem.

"In the paper with DOI 10.1101/2025.03.28.646065, how many proteins comprise the proteostasis network of tau according to the paper?"

GXL Sy (Ours)

TOOL: lookup doi
lookup doi 10.1101/2025.03.28.646065 → doc_id: 71ef2289-6e94-1014-a914-83fbd6d5bac1 "Linking Aβ and tau in the amyloid cascade through the intersection of their proteostasis networks"
TOOL: scan content.lines
scan content.lines "proteostasis network of tau" "number of proteins" → L41 [4170683]: "the proteostasis network of tau comprises 16 proteins" Network constructed from Reactome functional interactions involving proteostasis proteins + tau (MAPT), retaining bi-directional interactions.
FINAL RESPONSE
The proteostasis network of tau comprises 16 proteins {{"block_id": 4170683}}. The 16 proteins span two main pathways: • Ubiquitin-proteasome: MARK2, BRSK1, BRSK2 • Autophagy (4 routes): AMPK signalling (PRKAA1/A2, PRKAB1/B2, PRKAG1/G2/G3), mTOR (MAPK1, MAPK3, EP300), autophagosome (CDK5, CDK5R1, GSK-3β) {{"document_id": "71ef2289-6e94-1014-a914-83fbd6d5bac1"}} {{"block_id": 4170713}}

Why it succeeded

  • Two tool calls: lookup doi to find the paper, then scan content.lines to find the exact count
  • Returned block-level citations pointing to the exact sentence in the paper

Claude Code + bioRxiv MCP

TOOL: get_preprint(doi)
get_preprint("10.1101/2025.03.28.646065") → title, authors, abstract, full text blob ~38K tokens dumped into context
FINAL RESPONSE
The paper describes a proteostasis network for tau involving kinases and autophagy regulators including MARK2, CDK5, and GSK-3β. ⚠ NO COUNT ⚠ PRIOR KNOWLEDGE BLEED

Why it fell short

  • The paper was fetched successfully, but the specific count ("16 proteins") is buried in the Results section of a ~38K token blob with no way to search for it
  • The answer drifted toward named proteins from prior knowledge rather than the explicit number stated in the text
"In the paper with DOI 10.1101/2025.05.30.657099, in the RNA-seq analysis of neutrophils (PMN), how many healthy controls and SLE patients were compared, and how were the SLE patients subdivided based on ISG levels?"

GXL Sy (Ours)

TOOL: lookup doi
lookup doi 10.1101/2025.05.30.657099 → doc_id: de7f4443-7f6d-1014-b9f8-c6603e23e448 "Transposable element expression in SLE neutrophils"
TOOL: scan content.lines
scan content.lines "RNA-seq analysis" "healthy controls" "SLE patients" "ISG levels" "subdivided" → L28 [32974354]: "10 healthy controls (HC) and 23 SLE patients: IFNpos (n=12) and IFNneg (n=11)"
FINAL RESPONSE
10 HC and 23 SLE patients subdivided by ISG expression: • IFNpos: n=12 (detectable IFN/ISG signature) • IFNneg: n=11 (no signature) IFNpos PMN showed elevated ISG and transposable element family expression vs HC and IFNneg. {{"document_id": "de7f4443-7f6d-1014-b9f8-c6603e23e448"}} {{"block_id": 32974354}}

Why it succeeded

  • Two tool calls: DOI lookup + keyword scan of content
  • Extracted exact counts (10, 23, 12, 11) and the subdivision logic directly from the text

Claude Code + bioRxiv MCP

TOOL: get_preprint(doi)
get_preprint("10.1101/2025.05.30.657099") → title, authors, abstract, full text blob ~41K tokens dumped into context
FINAL RESPONSE
The study analyzed neutrophils from SLE patients and healthy controls. SLE patients were grouped by interferon status (IFN-positive and IFN-negative). ⚠ NO COUNTS (10, 23, 12, 11) ⚠ NO SUBDIVISION SIZES

Why it fell short

  • Paper fetched successfully but specific cohort counts are in a Methods subsection buried in ~41K tokens — impossible to locate without text search
  • Answered descriptively ("IFN-positive and IFN-negative") without the exact numbers the question required
"In the paper with DOI 10.1101/2025.04.10.648169, how many HLA-I alleles are represented in the benchmarking validation dataset used to evaluate the peptide binding prediction tools?"

GXL Sy (Ours)

TOOL: lookup doi
lookup doi 10.1101/2025.04.10.648169 → doc_id: a2f49e40-7983-1014-9d82-ab93563e2d80 "Comprehensive evaluation and interpretative insights of peptide-HLA binding prediction tools using XAI" Wu et al., 2025
TOOL: scan content.lines
scan content.lines "HLA-I alleles" "benchmarking validation dataset" "number of alleles" → L12 [4340732]: "rigorously curated benchmarking validation dataset...over 290,000 peptides across 44 HLA-I alleles"
FINAL RESPONSE
The benchmarking validation dataset represents 44 HLA-I alleles, comprising over 290,000 peptides in total. {{"document_id": "a2f49e40-7983-1014-9d82-ab93563e2d80"}} {{"block_id": 4340732}}

Why it succeeded

  • Two tool calls total: lookup DOI to resolve the paper, then scan for the specific figure
  • Answer found in the abstract-level content, with block-level citation to the exact line

Claude Code + bioRxiv MCP

TOOL: get_preprint(doi)
get_preprint("10.1101/2025.04.10.648169") → title, authors, abstract, full text blob ~35K tokens dumped into context
FINAL RESPONSE (WRONG)
The benchmarking validation dataset includes data for 29 HLA-I alleles. ⚠ WRONG ANSWER (29 ≠ 44) ⚠ HALLUCINATED FROM BLOB

Why it fell short

  • Paper was retrieved, but 44 appears once in a dense Methods paragraph inside ~35K tokens — the model hallucinated 29 instead
  • No way to search within the full-text blob means precision extraction fails for specific numbers

Experiment Novelty Check

Experiment Novelty Check: "Has this been done before?"

A researcher describes a specific experimental idea and asks whether it exists in the literature. These questions require precise extraction from Methods and Results, sections the filesystem exposes as individual files.

"What happens if you reverse which quorum sensing system controls which state in an intercellular genetic toggle? Would it still be bistable?"

GXL Sy (Ours)

TOOL: search → 83 papers
search "rhlR cinR quorum sensing toggle switch bistability synthetic" search "quorum sensing toggle switch bistability bacterial population inverted"
TOP HIT
doc_id: 8ece2be1-6e81-1014-9171-bf02e9568d20 "Pattern formation and bistability in a synthetic intercellular genetic toggle" Magalhães, Fan, Sontag, Josić, Bennett — bioRxiv 2022
TOOL: scan content.lines
scan content.lines "rhlR" "cinR" "inverted" "swap" "toggle" "bistab" → 8 patterns, 121 total matches → L29 [6711340]: "Inverted QS toggle pattern..." → L23 [6711334]: "QS toggle...intercellular signaling..."
TOOL: scan supplements/502488_file02
scan supplements/502488_file02.content.md.lines "Inverted QS" "S2I" "S2J" "bistab" "state prefer" → L214 [71721041]: "I | Inverted QS toggle" → L215 [71721042]: "J | Inverted QS toggle" → L216 [71721043]: "We obtained the Inverted QS toggle by using the same lower strength promoters but reversing the QS network connected to each state: now, the rhlR/I network activates the yellow state, and cinR/I the blue state."
FINAL RESPONSE
NOT NOVEL. Tested explicitly. Magalhães et al. (bioRxiv 2022) doc_id: 8ece2be1-6e81-1014-9171-bf02e9568d20 Original: cinR/I → yellow state, rhlR/I → blue state. Inverted: rhlR/I → yellow, cinR/I → blue. Same circuit, QS arms swapped. Still bistable? Yes — but state preference shifts. L29: "exhibited changes in state preference, agreeing with the prediction of the model." Figs S2I–J: induction curves for inverted circuit. Colony patterning (Fig 7D): radial segregation retained, but only in specific conditions. Why: Bistability governed by relative promoter strengths + QS network identity. Swapping arms preserves bistability but shifts which state is preferred — traceable to asymmetric signal strength of C4-HSL vs C14-HSL.

Why it succeeded

  • Searched the filesystem directly for "inverted" + "cinR" + "rhlR" — found the supplement figure legend (L216) describing the exact swap
  • Supplement text is indexed as a file; the key sentence was retrievable by keyword scan in milliseconds
  • Real block IDs from the actual paper; answer grounded in primary source

Claude Code + bioRxiv MCP

TOOL: WebSearch ×3
"rhlR cinR quorum sensing toggle bistability" "inverted quorum sensing toggle rhl cin bistability" "rhlR/I cinR/I synthetic toggle swapped inverted"
FOUND PAPER IN SEARCH SNIPPETS
Magalhães et al. 2022 appears in results. Snippet mentions "Inverted QS toggle variant." Cannot read the paper — no full-text access.
TOOL: WebFetch ×3 (all blocked)
GET biorxiv.org/10.1101/2022.08.02.502488v1.full → 403 GET biorxiv.org/10.1101/2022.08.02.502488 → 403 GET pubs.acs.org/doi/10.1021/acssynbio.2c00332 → 403
TOOL: WebSearch ×3 (digging)
"inverted QS toggle" cinR rhlR bistability results "inverted QS toggle" bistable pattern formation Bruder Elowitz "intercellular genetic toggle" inverted → No accessible full text found
NO FINAL RESPONSE
Session ended without producing a verdict. Last tool call: pubs.acs.org → 403. ⚠ BLOCKED BY PAYWALL/RATE LIMIT ⚠ NO ANSWER PRODUCED

Why it fell short

  • Found the right paper via web search but couldn't read it — bioRxiv and ACS both returned 403
  • The key result is in the supplement (Fig S2I–J legend), not the abstract, so even a successful fetch of the abstract wouldn't have answered the question
  • Session terminated without a verdict
"Has anyone systematically mapped how magnesium glutamate and potassium glutamate concentrations in cell-free TX-TL affect expression timing (lag phase) and total protein yield?"

GXL Sy (Ours)

TOOL: search → 82 papers
search "cell-free TX-TL magnesium potassium glutamate expression timing lag" search "cell-free composition cosolute lag time yield systematic"
TOP HIT
doc_id: 73a5f4a8-6e8a-1014-9e06-8ad03644452d "Tuning Cell-free Composition Controls the Time-delay, Dynamics, and Productivity of TX-TL Expression" Grace E. Vezeau, Howard M. Salis — bioRxiv April 2021
TOOL: scan + cat content.lines
scan content.lines "magnesium" "potassium" "lag" "time-delay" "yield" → 21 total matches cat content.lines L18–L40 (Results & Discussion) → L19: "Kinetic Characterization of Genetic Systems in TX-TL with Varied Cosolute Compositions" → L20: "PEG-8000, Ficoll-400, and magnesium glutamate (Mg-glut) as three cosolutes..." → L31: "systematically vary cosolute composition and quantify their effects on the magnitude, timing, and tunability across all 14 genetic constructs"
FINAL RESPONSE
NOT NOVEL. Direct match. Vezeau & Salis, Penn State — bioRxiv 2021 doc_id: 73a5f4a8-6e8a-1014-9e06-8ad03644452d Systematically varied PEG-8000, Ficoll-400, and Mg-glut. Measured kinetic mRNA (RT-qPCR hourly) + protein fluorescence (every 10 min) over 10 hrs, 14 constructs. • PEG-8000 0→4% w/v: +2.5 hr delay, +27.6× yield • Ficoll-400 0→4%: +1.3 hr delay, +3.6× yield • Mg-glut 8.67→16.67 mM: no timing change, +1.9× yield Mg-glut is a clean yield dial; crowders shift the lag. Built Markov + thermodynamic biophysical model. Genuine gap: K-glut not independently varied — your specific angle may still be open.

Why it succeeded

  • First search hit was the exact paper. The filesystem indexes 2021 preprints alongside 2025 ones — no recency bias.
  • Scanned the Results section for "lag", "time-delay", "Mg-glut" and extracted exact numbers from text
  • Identified the K-glut gap as a genuine novelty angle by reading what the paper explicitly did and didn't vary

Claude Code + bioRxiv MCP

TOOL: WebSearch ×3
"cell-free TX-TL Mg K glutamate concentration lag time" "cell-free expression timing magnesium glutamate systematic" "TXTL composition optimization lag phase yield 2023–2025"
FOUND PAPER
Vezeau & Salis 2021 found in search results. Also surfaced a ResearchGate figure of an Mg/K-glut expression grid (possibly Noireaux lab, source unclear).
TOOL: WebFetch ×2
GET biorxiv.org/2021.04.02.438196v1.full → 403 GET pubs.acs.org/doi/acssynbio.1c00136 → partial
TOOL: WebSearch ×3 (chasing authorship)
Vezeau Salis "tuning cell-free composition" authors "tuning cell-free composition controls time delay" ACS 2021 researchgate publication 44798238 authors title → Session ended trying to identify ResearchGate figure
NO FINAL RESPONSE
Session ended without producing a verdict. Spent remaining turns on ResearchGate metadata. ⚠ BLOCKED BY PAYWALL ⚠ NO ANSWER PRODUCED

Why it fell short

  • Found the right paper via web search but couldn't read it — bioRxiv returned 403
  • Got distracted chasing a ResearchGate figure for authorship metadata; session ended without a verdict
  • The key data (exact lag time per cosolute, concentrations tested) is buried in the Results section — inaccessible without full-text
"Has anyone systematically built riboswitch-dependent synthetic auxotrophs in E. coli where theophylline-responsive riboswitches control essential genes, so cells only survive in the presence of the small molecule?"

GXL Sy (Ours)

TOOL: search → 77 papers
search "riboswitch essential gene biocontainment synthetic auxotroph E. coli theophylline" search "riboswitch-dependent auxotroph essential gene small molecule survival"
TOP HIT
doc_id: 8a5dde6a-6f18-1014-be07-d8e347a0806e "A Modular Toolkit For Theophylline-Dependent Synthetic Auxotrophs Via Riboswitch-Gated Essential Genes" Gonzalez-Lopez, Overly, Singh, Huang, Lopez Synvivia, Inc. — bioRxiv November 2025 DOI: 10.1101/2025.11.20.689564
TOOL: cat content.lines
cat content.lines L1–100 → L29: Fig 1: "Engineering and Mechanism of Theophylline-Dependent Synthetic Auxotrophs" → modular kanR-riboA cassette design → 29 essential genes screened → 18/29 gave stringent live-die phenotypes → escape frequency: 1×10&sup5; to 1×10&sup6;
FINAL RESPONSE
NOT NOVEL. Done comprehensively, Nov 2025. Gonzalez-Lopez et al. — bioRxiv 2025 doc_id: 8a5dde6a-6f18-1014-be07-d8e347a0806e • 29 essential genes tested in E. coli • theophylline riboswitch (riboA) upstream of each • 18/29 gave stringent live-die phenotypes • Escape freq: 1×10⁻⁵ to 1×10⁻⁶ • Modular kanR-riboA cassette; <1 week construction What remains open: Other organisms (Pseudomonas, Bacillus), other ligands for dual orthogonal containment, stacked dual auxotrophs for multiplicative escape reduction, in vivo / gut colonization stability.

Why it succeeded

  • Nov 2025 preprint was already indexed in the filesystem — returned as top hit immediately
  • Read full paper text to extract exact gene count (29), success rate (18/29), and escape frequencies
  • Identified genuine novelty angles from what the paper explicitly does and does not cover

Claude Code + bioRxiv MCP

TOOL: WebSearch ×3
"riboswitch essential gene biocontainment E. coli theophylline" "riboswitch-dependent essential gene control biocontainment" "synthetic auxotrophy riboswitch kill switch E. coli"
FOUND OLDER PAPERS ONLY
Desai & Gallivan 2004 — riboswitch on antibiotic resistance gene (not an essential gene) Jin et al. 2009 — riboswitch on csrA (single gene; cells survived without ligand — not lethal) Lopez & Anderson 2015 SLiDE — protein stability switch on 5 essential genes (not riboswitches)
TOOL: WebSearch ×3 (targeted)
site:biorxiv.org riboswitch systematic essential gene biocontainment theophylline 2025 "translational control" "synthetic auxotrophy" riboswitch essential gene 2025 biorxiv.org 2025 681377 riboswitch biocontainment → Hint of preprint 10.1101/2025.10.09.681377; content inaccessible
FINAL RESPONSE: NOVEL (WRONG)
Based on my search, this appears novel. No paper combines: theophylline riboswitch + multiple essential genes + systematic screen + lethality + explicit biocontainment goal. Recommend proceeding. ⚠ WRONG VERDICT ⚠ MISSED NOV 2025 PREPRINT

Why it fell short

  • The Gonzalez-Lopez et al. preprint (Nov 2025) exists but wasn't surfaced by web search — too recent for search index coverage at time of query
  • Spotted a DOI hint (2025.10.09.681377) but couldn't access the content
  • Concluded NOVEL — the opposite of the correct answer. A researcher acting on this would waste months.

Multi-Paper Synthesis

Idea Discovery: "What's new that I'm not aware of?"

Discovering convergent signals across the literature: patterns only visible when you analyze 25–50 papers in parallel. This is where map-reduce over filesystems is decisive.

"Across recent spatial transcriptomics or spatial proteomics preprints, extract each paper's main claimed cell-cell interaction, niche, or microenvironment finding and whether it includes orthogonal validation such as immunostaining, RNAscope, perturbation, or functional follow-up. Which classes of spatial claims are usually supported only by computational figures, and which are routinely validated experimentally?"

GXL Sy (Ours)

TOOL: searches → 50 papers
searches --quiet --tag spatial -n 50 \ "spatial transcriptomics cell-cell interaction niche" \ "spatial proteomics tumor microenvironment" \ "cell communication spatial ligand receptor" \ "spatial omics microenvironment validation immunostaining" \ "MERFISH Visium CODEX spatial cell interaction" \ "spatial transcriptomics orthogonal validation RNAscope"
TOOL: map → 42 subagents
map --from s_7c92a373 --limit 50 \ "Extract: (1) MAIN SPATIAL CLAIM. (2) CLAIM TYPE: cell-cell interaction / niche composition / ligand-receptor / gradient / microenvironment state. (3) COMPUTATIONAL EVIDENCE. (4) ORTHOGONAL VALIDATION: IF/IHC, RNAscope/smFISH, perturbation, functional assay, independent cohort, or None. (5) VALIDATION DEPTH: None / Minimal / Moderate / Extensive."
42 SUBAGENTS → FILESYSTEM
NicheCompass {document_id: e194e652} → SPP1+ macrophage niche in breast cancer. Computational only (graph deep learning). No IHC. Glioma TAM paper {document_id: 6015d9cc} → Ligand-receptor: tumor↔TAM co-expression. RNA-ISH confirms spatial co-expression. ✓ Validated. CRC liver metastases {document_id: 62060478} → CODEX multiplexed IF confirms L-R proximity at protein level. ✓ Validated. Haruka framework {document_id: 835ee34d} → Microenvironment remodeling domains. Computational only. No experimental readout.
TOOL: reduce
reduce --from m_7371efe2 \ "Group by claim type. For each: fraction with orthogonal validation vs computational-only. What predicts whether validation is included?"
FINAL RESPONSE
Claim type — not platform or disease — predicts validation. Routinely computational-only: Niche/neighborhood composition: ~0% validated. (NicheCompass, CellTrek, SColoc, Haruka — all graph embeddings, no IHC/RNAscope.) Microenvironment state/composition: ~0% validated. (SKNY, DeepSpaceDB, Visium+scRNA prostate TME.) Spatial expression gradients: ~0% validated. (Moran's I / spatially variable gene tests.) Routinely validated: Ligand-receptor signaling: ~67% include at least RNA-ISH or IF. Mechanistically falsifiable → reviewers demand it. Cell co-localization tied to a functional claim: IHC + independent cohort replication common. Platform doesn't predict: MERFISH, Visium, CODEX show similar rates within each claim type. RNAscope underused vs IF despite being more direct. {document_id: e194e652} {document_id: 6015d9cc} {document_id: 62060478} {document_id: 835ee34d}

Why it succeeded

  • 42 subagents each extracted claim type + validation depth from sections/Results.lines and supplements/
  • Pattern invisible from abstracts: abstracts describe the finding, not whether it was validated
  • All citations link to real document IDs from the filesystem

Claude Code + bioRxiv MCP

TOOL: search_preprints(category) ×2 + get_preprint (×2)
category: "bioinformatics" ×2, recent: 20 each. ~20 abstracts + 2 full papers. ~89K tokens.
FINAL RESPONSE
Spatial transcriptomics papers often validate findings with immunostaining or RNAscope. Cell-cell interaction studies tend to use CellChat or NicheNet computationally. ⚠ NO PER-PAPER EXTRACTION ⚠ NO VALIDATION RATES ⚠ PRIOR KNOWLEDGE

Why it fell short

  • Abstracts don't state whether validation was done — this requires reading Methods and Results of each paper
  • Only 2 papers loaded in full; couldn't extract per-claim validation status at scale
  • Answer reflects general knowledge, not evidence from actual preprints
"Across recent bioRxiv preprints on foundation models for genomics, proteins, pathology, or multimodal biology, extract the pretraining corpus description, train-validation-test split strategy, homolog or cluster leakage controls, external benchmarks, and key ablations. Which reported performance gains may be inflated by weak leakage control or benchmark overlap?"

GXL Sy (Ours)

TOOL: searches → 40 papers
searches --quiet --tag fm_bio -n 40 \ "foundation model genomics pretraining" \ "protein language model benchmark evaluation" \ "pathology foundation model self-supervised" \ "biological foundation model leakage homolog split" \ "single-cell foundation model pretraining ablation" \ "DNA language model benchmark generalization"
TOOL: map → 25 subagents
map --from s_6b4617d8 --limit 25 \ "Extract: (1) PRETRAIN_CORPUS: databases, size. (2) SPLIT_STRATEGY: random/time/chromosome/cluster. (3) LEAKAGE_CONTROLS: tools (MMseqs2, CD-HIT), identity thresholds, or 'none reported'. (4) EXTERNAL_BENCHMARKS + performance numbers. (5) ABLATIONS: what was ablated, key results. (6) MAIN_CLAIMED_GAINS: headline improvement, exact numbers. (7) LEAKAGE_RISK."
20 SUBAGENTS → FILESYSTEM
Agro NT {document_id: f40d7c92} Corpus: 48 plant genomes. Split: not described. Leakage controls: NONE. Syntenic genomes → ⚠️ HIGH RISK. Pathology benchmark (Virchow2) {document_id: 34c93689} Pretrained on TCGA. Evaluated on TCGA-derived benchmarks. Leakage controls: none reported. ⚠️ HIGH RISK — direct benchmark overlap. CAPTAIN {document_id: 57ac822b} 4.8M cells (CITE-seq). Dataset-level splits: entire datasets held out. ✅ LOW RISK. scGPT-spatial {document_id: 4ca5ef24} Slide-level splits — handles spatial autocorrelation. ✅ LOW RISK. LAMAR {document_id: 92112205} Chromosome splits for splice (rigorous) but random splits for mRNA stability. ⚠️ MODERATE.
FINAL RESPONSE
Explicit leakage controls are the exception. Most papers: random splits or splits not described. ⚠️ HIGH RISK (gains likely inflated): Agro NT — no homolog filtering, syntenic genomes. Virchow2 benchmark — pretrained + evaluated on TCGA. BAITSAO — no compound-cluster splits, drug overlap. ProTDyn — no MMseqs2/CD-HIT, PDB redundancy. AIDO.Tissue — zero methodological transparency. ✅ LOW RISK (rigorous splits): CAPTAIN — dataset-level holdout. scGPT-spatial — slide-level splits. EHR+PRS model — participant-level splits. Pattern: the type of claim matters. Models claiming "state-of-the-art" without reporting split strategy or leakage controls are universally in the high-risk category. {document_id: f40d7c92} {document_id: 34c93689} {document_id: 57ac822b} {document_id: 4ca5ef24}

Why it succeeded

  • Subagents read sections/Methods.lines to extract split strategy — this is never in the abstract
  • Extracted leakage controls and benchmarks from 20 papers; a task impossible within a single context window
  • Identified which papers share train/test datasets with their own benchmarks

Claude Code + bioRxiv MCP

TOOL: search_preprints(category) ×2 + get_preprint (×2)
category: "bioinformatics" ×2, recent: 20 each. ~20 abstracts + 2 full papers. ~92K tokens.
FINAL RESPONSE
Foundation models for biology often use large pretraining corpora. Leakage can be an issue in protein models when homologous sequences appear in train and test sets. Benchmark selection is important for fair evaluation. ⚠ NO PER-PAPER EXTRACTION ⚠ NO LEAKAGE ASSESSMENT ⚠ GENERIC

Why it fell short

  • Split strategy and leakage controls are buried in Methods — not accessible from abstracts
  • Couldn't load enough papers to compare practices across 20+ groups
  • No ability to identify which specific papers have benchmark overlap with their own pretraining data
"Across preprints on in vivo AAV or LNP gene delivery from the last 3 years, extract the species, dose, capsid or formulation, promoter, target tissue, and any reported liver injury, dorsal root ganglia toxicity, neuropathology, deaths, or severe adverse findings from figures, pathology supplements, and extended data. Which toxicity signals recur across independent groups even when they are not emphasized in the abstract?"

GXL Sy (Ours)

TOOL: searches → 40 papers
searches --quiet --tag aav_lnp_tox -n 40 --since 3y \ "AAV gene therapy toxicity liver dorsal root ganglia" \ "LNP lipid nanoparticle in vivo liver injury" \ "AAV capsid neuropathology DRG sensory neuron toxicity" \ "adeno-associated virus adverse effects pathology dose" \ "gene therapy in vivo safety systemic delivery"
TOOL: map → 29 subagents
map --from s_ef80c5de --limit 30 \ "From Results, figures, pathology data, supplements: (1) Species + strain. (2) Dose (exact vg/kg or mg/kg). (3) Capsid serotype or LNP formulation. (4) Promoter. (5) Route (IV, IT, IM, etc.). (6) Liver injury: ALT/AST, necrosis, inflammation. (7) DRG toxicity: neuronal degeneration, vacuolation. (8) Other neuropathology: spinal cord, brain. (9) Deaths / humane endpoints. (10) Are findings in the abstract? yes/no/partial."
29 SUBAGENTS → FILESYSTEM
NF-L biomarker paper {document_id: b0f35b7d} Rat + NHP. DRG neuronal degeneration, nerve fiber loss, trigeminal + autonomic ganglia. In abstract ✓ Cockayne syndrome AAV9 {document_id: daf6c195} Mouse. 1×10¹¹ vg/pup IV. AAV9 CBA promoter. Liver: normal ALT/AST but life-limiting transcriptomic alterations. Brain: hypomyelination, astrocytosis, neuronal loss. Deaths: median 58d (vs 22d untreated). In abstract: partial. PDHA1 AAV9 {document_id: 17751b49} Mouse C57BL/6. 1×10¹⁰–1×10¹¹ vg. Normal ALT/AST, but mild spinal cord gliosis. Dose-dep deaths (35d vs 90d median). Not in abstract. Retinal AAV {document_id: 122188650} Mouse. ~4×10⁸ gc subretinal (P0–P2). RPE toxicity via interferon/ER stress. Photoreceptors spared. Not in abstract.
FINAL RESPONSE
Three signals recur across independent groups but are absent or minimized in abstracts: 1. CNS gliosis (≥3 groups) AAV9 + scAAV2/8 in mice. Astrocytosis, microgliosis, spinal cord gliosis. Appears in pathology figures; absent from abstracts. 2. Liver transcriptomic damage without ALT/AST elevation (≥2 groups) AAV9 CBA, scAAV2/8 CMV in mice. Molecular stress present; standard clinical chemistry misses it. Implies routine safety readouts underdetect vector-related hepatic injury. 3. DRG neuronal degeneration (≥2 groups) Multiple AAV serotypes in rat + NHP. Nerve fiber degeneration, trigeminal involvement. Class effect, not construct-specific. Deaths framed as "survival benefit" obscure dose- dependent vector toxicity in disease models. NF-L plasma as non-invasive DRG biomarker is the key translational advance for catching buried signal. {document_id: b0f35b7d} {document_id: daf6c195} {document_id: 17751b49} {document_id: 122188650}

Why it succeeded

  • Subagents read pathology figures and supplements/ — where toxicity data lives, not abstracts
  • Identified the abstract-vs-pathology mismatch pattern only visible when reading 29 papers in parallel
  • Extracted exact doses, species, and timing; real document IDs throughout

Claude Code + bioRxiv MCP

TOOL: search_preprints(category) ×2 + get_preprint (×2)
category: "biochemistry" + "molecular biology", recent: 20 each. ~20 abstracts + 2 full papers. ~87K tokens.
FINAL RESPONSE
AAV gene therapy can cause liver toxicity and DRG toxicity at high doses. LNP delivery may cause inflammatory responses. Dose optimization is important for safety. ⚠ PRIOR KNOWLEDGE ⚠ NO SUPPLEMENT ACCESS ⚠ NO ABSTRACT MISMATCH DETECTION

Why it fell short

  • Toxicity data is in pathology supplements and figures — not accessible via MCP text extraction
  • The key finding (abstract-vs-pathology mismatch) requires reading the abstract and the supplement of each paper — impossible at scale without a filesystem
  • Answer is general knowledge about AAV safety, not evidence from these specific preprints

Conclusion

Instead of moving data to the agent, we bring the agent to the data. By exposing 450K bioRxiv and medRxiv preprints as a virtual filesystem, we place the agent inside the corpus rather than behind a query interface. This is a necessary shift to get past shallow search: when paper content is structured as directories with individually addressable sections, supplements, and figures, the agent can make targeted, efficient reads at whatever granularity the question demands rather than ingesting entire documents and hoping the answer surfaces.

This replicates the paradigm that has already proven immensely successful with coding agents. Tools like Claude Code and Cursor are effective precisely because they inhabit the codebase — navigating with ls, searching with grep, reading with cat — rather than querying it through an abstract API. Sy applies the same model to scientific literature, and the same bash-trained intuitions that make coding agents powerful transfer directly.

On bioRxiv Bench, Sy is 1.6× more accurate, 2.4× faster, and 3.6× cheaper than MCP-based approaches across 140 questions spanning Deep Paper Q&A, Experiment Novelty Check, and Multi-Paper Synthesis.

Try Sy yourself at sy.gxl.ai!