Representing Biomedical Literature as a Filesystem through Agent-Native Indexing

By The GXL Team

TL;DR

Instead of moving data to the agent, we send the agent to the data. Biomedical preprints are exposed as a virtual filesystem that agents explore using the same bash tools they use on codebases. We develop Sy, our research agent built on top of this filesystem. The result is 1.6× more accurate, 2.4× faster, and 3.6× cheaper than MCP-based approaches across deep text Q&A, experiment novelty checking, and cross-paper synthesis.

Average performance of Sy vs other agents on difficult bioRxiv questions
Accuracy / Completeness
91%
57%
31%
Avg Time
1.9m
4.5m
14.9m
Avg Cost
$0.37
$1.32
$1.00
GXL Sy (Ours)Claude Code + bioRxiv MCPFutureHouse Edison

Let’s say you want to retrieve information from a remote source, like a large corpus of paper preprints. Today, most MCP-style integrations behave like structured communication channels between distant systems. You define the fields ahead of time, send a request, and receive a structured response. That works well when the question is clearly defined and the answer fits neatly into those predefined slots.

But many interesting questions aren’t like that. In practice, discovery is often exploratory: you need to move through the data itself, following threads and context rather than issuing a single precise query. However, coding agents like Claude Code, Codex, and Cursor, already have deep intuitions for exactly this style of navigation. These agents have deep priors navigating codebases through bash (ls, grep, find, cat). They come with intuitions of which commands to reach for, how to compose them with pipes, and how to navigate from a broad directory structure down to the exact file and line they need. The problem is that scientific literature gives them nothing to navigate.

The Problem with MCP Tools

Today’s biomedical research agents are stuck on the wrong side of this divide. Common LLM-based tools rely on MCP servers and search APIs that behave as structured channels: a search_papers tool with a handful of parameters that returns massive payloads of abstracts, with no way to browse, no sense of what’s nearby, and no ability to refine by navigating rather than re-querying. The agent throws a query into a void and catches whatever comes back.

The agent can’t ls the literature to see what’s in a research area. It can’t grep across methods sections to find how experiments were actually done. It can’t follow a citation trail by reading a file. The scientific literature has no geography that the agent can navigate, so it can’t orient itself, and all those deeply trained intuitions about filesystem navigation go entirely unused.

Current Literature Agents

Many current literature agents share a similar architecture: wrappers over PubMed, Semantic Scholar, or bioRxiv APIs, accessed through MCP connectors or function-calling schemas. The agent calls search_papers(query), gets back a list of abstracts, and summarizes them. This works when the question is clearly defined, but it breaks down on anything that requires reading the content of a paper in detail: methods sections, supplemental tables, figure captions, appendix data. These are precisely the parts that contain what researchers actually need — the specific protocol, the sample sizes, the failure modes, and the caveats buried in extended data.

The traditional alternative is to bring the data to you: downloading terabytes of papers, building indexing pipelines, and running search infrastructure locally, all before any real investigation can begin. MCP connectors sit somewhere in between but still fall short — they return metadata and abstracts. Getting full-text content requires fetching a PDF at query time, dumping the entire document into context, and hoping the relevant sentence surfaces in a ~40K token blob. There is no way to search within the document, no way to navigate to the Methods section without reading everything before it, and no way to run the same extraction across 50 papers without exhausting your token budget on the first three.

Our approach takes the opposite direction: rather than moving the data to the agent, we move the agent to the data.

Biomedical Preprints as Filesystem

You can think of it as opening a small portal between your system and the corpus. The remote data source is exposed as a virtual filesystem, and the heavy lifting has already been done. The corpus is indexed, structured, and optimized for search so agents can explore efficiently without downloading or managing the data themselves.

Scientific papers are born as rich collections of structured artifacts: tables, figures, methods sections, supplementary spreadsheets, appendices, and code. Then authors compress all of that structure by flattening everything into a single PDF at submission time. The format is optimized for print and human readability, but not for AI agents. Our core idea is to reverse this compression. We re-expand each paper back into a filesystem where every paper is a folder, every figure is a file, and every section (Methods, Results, Discussion) is individually addressable. Each paper becomes a directory that an agent can enter, inspect, and traverse as deeply as needed.

Paper-as-a-Filesystem
Author 1, Author 2, Author 3
Abstract
Introduction
table1.csv
Results
Discussion
Methods
Parse
📂 File Structure
/paper/
/sections/
abstract.md
introduction.md
related_works.md
results.md
discussion.md
/tables/
table1.csv
table2.csv
/images/
figure1.png
figure2.png
figure3.png
/supplements/
supp_figure1.png
...
Access
Terminal
$ cd /paper/
$ ls
sections/ tables/ images/
$ cat abstract.md
Background: Recent work...
# Programmatic access
$ python -c "
  import pandas as pd
  df = pd.read_csv(
    'tables/table1.csv'
  )"

An agent goes to the filesystem with a task, navigates the relevant parts of the corpus, gathers the necessary context, and returns with the answer and supporting evidence. When it wants to replicate an experiment, it reads sections/Methods.lines. When it wants to compare results across studies, it reads sections/Results.lines. When it’s trying to understand how the field interprets a finding, it reads sections/Discussion.lines. This mirrors how scientists actually use papers: not as monolithic blobs, but as collections of structurally distinct knowledge types.

Each line of text has a unique numeric identifier that traces back to the source, and every figure is individually addressable. The agent can cd into a paper, grep for a term, cat specific sections, and head the first 50 lines, using the same bash workflow it uses to navigate any codebase. Instead of hauling terabytes of data across the network and rebuilding indexing infrastructure locally, the exploration happens where the information already lives.

How It Works

The agent runs a command
Hover a command to see which parts of the filesystem it touches
$ search "CRISPR base editing efficiency"
$ grep "IC50" /papers/*/sections/Results.lines
$ cat /papers/a7f3e2/supplements/table_s1.csv
Virtual Filesystem Layer
Hidden from agent
Translates shell commands into parallel queries across the indexed corpus, then assembles results as ordinary files and directories.
📄
Document Processing
PDF parsing, XML extraction, section segmentation
🗃️
SQL Storage
Metadata, full text, content blocks
🔍
Hybrid Indexing
BM25 + semantic KNN, block-level retrieval
Cache
Query results, embeddings, hot paths
The agent sees individual papers as local directories
📂 /paper_a7f3e2/
sections/
supplements/
figures/
meta.json
📂 /paper_b8c4d1/
sections/
supplements/
figures/
meta.json
📂 /paper_c2e9a0/
sections/
supplements/
figures/
meta.json
📂 /paper_d5f1b3/
sections/
supplements/
figures/
meta.json
...
450K+ papers

Building the Index

Making half a million preprints navigable as directories requires solving a document engineering problem at scale. The raw corpus is a mix of JATS XML and supplementary materials in diverse formats: PDFs, Excel spreadsheets, Word documents, CSV files, even PowerPoint slides. The main text pipeline parses JATS XML into individual content blocks, where each paragraph, table, figure caption, section header, and formula becomes a separately addressable unit. This decomposition is what allows an agent to grep for a term and land on the exact block rather than ingesting a 40K-token blob.

Paper supplements present a challenge as they are inherently heterogeneous and unstructured. A single paper might attach a PDF with 15 supplementary figures, an Excel file of raw assay data, and a Word document with extended methods. To convert these to LLM-native text, we run each supplement through OCR models that perform document segmentation, table recognition, and formula extraction, producing the same block-level format: typed content with bounding boxes and page coordinates. The output is normalized into the same schema as the XML-derived blocks, so the agent sees a uniform supplements/ directory regardless of whether the underlying source was a scanned PDF or native XML.

The resulting ~70 million content blocks are stored in PostgreSQL with per-block JSONB metadata linking each block back to its source file and XPath, and dual-indexed in Elasticsearch through a hybrid retrieval layer that combines BM25 keyword scoring with dense vector embeddings. When the agent runs search "CRISPR base editing", both indices fire in parallel and results are merged via reciprocal rank fusion. When it runs grep "IC50" inside a paper, the query hits the block-level index filtered to that document. Figures are individually addressable and can be routed to a vision model on demand. The entire layer is invisible to the agent — it sees files and directories.

Why Bash

LLMs trained on code have encountered ls, grep, find, cat, wc, diff, head, tail, and pipe composition billions of times. They don’t treat these as abstract API calls. They know grep -r recurses, they know wc -l counts lines, they know how to pipe output between commands, and they know what an empty directory means versus a missing one.

When the research filesystem responds to standard bash, the agent doesn’t need to learn a new tool schema through in-context examples. It applies the same skills it uses to navigate a codebase, now pointed at scientific literature. A custom API with search_preprints() and get_preprint() means the model learns your interface from scratch on every invocation. It will use it, but it won’t compose tools in ways you didn’t anticipate.

Map-Reduce Over Papers

By treating each paper as a directory, we unlock another powerful pattern: map-reduce over papers. A map operation dispatches a lightweight subagent to every paper in parallel, each with filesystem access to its paper’s directory. A reduce operation then synthesizes the extractions into a unified answer. This mirrors how scientists do literature reviews (asking the same question across many documents) but runs in minutes across dozens of papers, extracting structured data from full text, not abstracts. Each subagent navigates only the relevant parts of its paper and returns with a precise extraction.

MCP Tool Server

1Call search_preprints(category, date_range).
No keyword search, browse by category + date only.
Get recent titles + abstracts.
2Call get_preprint(doi).
Entire paper (~40K tokens) dumped into context as a blob.
3Repeat for 1–2 more.
Context now ~100K tokens of undifferentiated text.
4Synthesize from memory.
No section boundaries. Citations unreliable.
Result: 2–3 papers, vague summaries.
vs

GXL Sy (Ours)

1Search 450K papers in filesystem.
Get paths to top papers sorted by relevance.
2Each subagent navigates its paper:
grep Results, head Methods, cat a figure.
3Each reads ~200 tokens (not 40K).
Returns structured extraction with block-level citations.
425–100 subagents in parallel.
Reduce into synthesis.
Result: 100 papers, specific data, every claim cited.

GXL Sy is our research agent built on top of this filesystem. It navigates over 450,000 bioRxiv and medRxiv preprints, using the full depth of the virtual filesystem to answer questions that require reading specific passages, checking experimental novelty across the literature, and synthesizing findings from multiple papers. Rather than wrapping search results in a prompt, GXL Sy enters the corpus, follows leads across papers, and returns with grounded, citable answers.

bioRxiv Bench

To measure whether focused exploration outperforms rigid querying, we introduce bioRxiv Bench, a benchmark of 140 questions drawn from real research workflows over bioRxiv and medRxiv preprints. The benchmark spans three task types: Deep Paper Q&A (N=50), Experiment Novelty Check (N=50), and Multi-Paper Synthesis (N=40).

We compare GXL Sy (Ours) against two baselines: Claude Code with the Claude bioRxiv MCP connector, and the FutureHouse Edison Platform, which provides AI-powered biomedical literature search. For FutureHouse Edison, we used the Precedent agent on Experiment Novelty Check and the Literature agent on Paper Q&A and Multi-Paper Synthesis.

Deep Paper Q&A (N=50)

GXL Sy (Ours) vs. Claude Code + bioRxiv MCP connector vs. FutureHouse Edison

Accuracy
100%
86%
4%
Avg Time per Query
1m6s
3m42s
9m29s
Avg Cost per Query
$0.21
$1.07
$1.00
FutureHouse Edison charges per credit used.
GXL Sy (Ours)Claude Code + MCPFutureHouse Edison

50 supplement-grounded questions across 50 bioRxiv preprints, each requiring data from supplemental tables, PDFs, or DOCX files that cannot be answered from the paper’s main text alone.

Dataset Construction

50 single-document questions drawn from 50 distinct bioRxiv preprints published in 2025. Each question was generated by granting a model access to the full paper including supplements, then manually reviewed and filtered for clarity, accuracy, and relevance.

Questions were explicitly designed to be unanswerable from the abstract or main text alone, requiring data from supplemental tables, PDFs, or DOCX files. Each answer is accompanied by the specific supplement file path, step-by-step reasoning procedure, and executable code used to derive the answer.

Experiment Novelty Check (N=50)

GXL Sy (Ours) vs. Claude Code + bioRxiv MCP connector vs. FutureHouse Edison

Accuracy
80%
28%
20%
Avg Time per Query
2m22s
12m13s
2m54s
Avg Cost per Query
$0.36
$0.93
$1.00
FutureHouse Edison charges per credit used.
GXL Sy (Ours)CC + bioRxiv connectorFutureHouse Edison

50 questions were constructed by starting from real bioRxiv and medRxiv papers with distinctive quantitative findings, then reverse-engineering the natural-language experiment novelty query a researcher might ask before attempting similar work.

Dataset Construction

Questions were authored by first identifying papers in our indexed bioRxiv/medRxiv corpus that contain distinctive quantitative findings — binding affinities, enzyme kinetics, production titers, dose-response values — then reverse-engineering a natural language query a researcher might pose before starting analogous work.

Each question has a ground truth record containing the target paper's document ID, title, DOI, authors, and the location of the experiments or results within the paper (body text, supplement table, or figure).

Multi-Paper Synthesis (N=40)

GXL Sy (Ours) vs. Claude Code + bioRxiv MCP connector vs. FutureHouse Edison

Completeness
92%
58%
70%
Avg Time per Query
2m6s
6m48s
23m6s
Avg Cost per Query
$0.53
$1.96
$1.00
FutureHouse Edison charges per credit used.
Avg Citations
27.9
10.6
19.5
GXL Sy (Ours)Claude Code + MCPFutureHouse Edison

40 cross-paper synthesis questions across 8 categories (relevant retrieval, exact quotation, abstract vs. substance, emerging directions, cross-paper contradiction, cross-paper synthesis, quantitative extraction, method comparison), each requiring evidence drawn from a minimum of 5 papers. Scored with literal per-question criteria. FutureHouse Edison and CC + BioRxiv were re-judged on the same 40-question set.

Dataset Construction

40 questions synthetically generated across 8 categories: spatial transcriptomics, foundation models, gene therapy safety, single-cell genomics, neuroscience, immunology, cancer biology, and synthetic biology. Each question was designed to require synthesis across a minimum of 5 papers and to be unanswerable from any single abstract.

Questions were generated by prompting a model with subfield descriptions and example queries, then filtered for quality through human review — removing questions that were too narrow, too broad, or answerable without cross-paper synthesis.

Across all task types the pattern holds. On Deep Paper Q&A, GXL Sy (Ours) scores 100% vs 86%, runs 3.4× faster (1m6s vs 3m42s), and costs 5.1× less ($0.21 vs $1.07). On Experiment Novelty Check, it scores 80%, 2.9× more accurate than CC + bioRxiv connector (28%) and 4.0× more accurate than FutureHouse Edison (20%), while running 1.2× faster than CC + bioRxiv connector (2m22s vs 2m54s) and 5.2× faster than FutureHouse Edison (2m22s vs 12m13s), with lower average cost than CC + bioRxiv connector ($0.36 vs $0.93); FutureHouse Edison charges per credit used. On Multi-Paper Synthesis, it is 59% more complete (92% vs 58%), 3.2× faster (2m6s vs 6m48s), produces 2.6× more citations (27.9 vs 10.6), and costs 3.7× less ($0.53 vs $1.96). The efficiency comes from targeted access: a grep into sections/Results.lines consumes ~200 tokens versus ~40,000 for loading a full paper through MCP. The accuracy comes from section-level precision: questions about methods go to Methods, questions about limitations go to Discussion.

Below, we walk through specific benchmark examples comparing the two approaches side by side.

Case Studies

Deep Paper Q&A

Deep Paper Q&A

Questions that can only be answered from supplemental figures, tables, or methods: content that is invisible to abstract-level search and requires navigating directly into supplement files. Every supplement is a first-class path in the filesystem.

“In the paper with DOI 10.1101/2025.03.28.646065, how many proteins comprise the proteostasis network of tau according to the paper?”

GXL Sy (Ours)

TOOL: lookup doi
lookup doi 10.1101/2025.03.28.646065 → doc_id: 71ef2289-6e94-1014-a914-83fbd6d5bac1 “Linking Aβ and tau in the amyloid cascade through the intersection of their proteostasis networks”
TOOL: scan content.lines
scan content.lines “proteostasis network of tau” “number of proteins” → L41 [4170683]: “the proteostasis network of tau comprises 16 proteins” Network constructed from Reactome functional interactions involving proteostasis proteins + tau (MAPT), retaining bi-directional interactions.
FINAL RESPONSE
The proteostasis network of tau comprises 16 proteins {{"block_id": 4170683}}. The 16 proteins span two main pathways: • Ubiquitin-proteasome: MARK2, BRSK1, BRSK2 • Autophagy (4 routes): AMPK signalling (PRKAA1/A2, PRKAB1/B2, PRKAG1/G2/G3), mTOR (MAPK1, MAPK3, EP300), autophagosome (CDK5, CDK5R1, GSK-3β) {{"document_id": "71ef2289-6e94-1014-a914-83fbd6d5bac1"}} {{"block_id": 4170713}}

Why it succeeded

  • Two tool calls: lookup doi to find the paper, then scan content.lines to find the exact count
  • Returned block-level citations pointing to the exact sentence in the paper

Claude Code + bioRxiv MCP

TOOL: get_preprint(doi)
get_preprint("10.1101/2025.03.28.646065") → title, authors, abstract, full text blob ~38K tokens dumped into context
FINAL RESPONSE
The paper describes a proteostasis network for tau involving kinases and autophagy regulators including MARK2, CDK5, and GSK-3β.
⚠ NO COUNT⚠ PRIOR KNOWLEDGE BLEED

Why it fell short

  • The paper was fetched successfully, but the specific count (“16 proteins”) is buried in the Results section of a ~38K token blob with no way to search for it
  • The answer drifted toward named proteins from prior knowledge rather than the explicit number stated in the text
Experiment Novelty Check

Experiment Novelty Check: “Has this been done before?”

A researcher describes a specific experimental idea and asks whether it exists in the literature. These questions require precise extraction from Methods and Results, sections the filesystem exposes as individual files.

“What happens if you reverse which quorum sensing system controls which state in an intercellular genetic toggle? Would it still be bistable?”

GXL Sy (Ours)

TOOL: search → 83 papers
search "rhlR cinR quorum sensing toggle switch bistability synthetic" search "quorum sensing toggle switch bistability bacterial population inverted"
TOP HIT
doc_id: 8ece2be1-6e81-1014-9171-bf02e9568d20 “Pattern formation and bistability in a synthetic intercellular genetic toggle” Magalhães, Fan, Sontag, Josić, Bennett — bioRxiv 2022
TOOL: scan content.lines
scan content.lines "rhlR" "cinR" "inverted" "swap" "toggle" "bistab" → 8 patterns, 121 total matches → L29 [6711340]: "Inverted QS toggle pattern..." → L23 [6711334]: "QS toggle...intercellular signaling..."
TOOL: scan supplements/502488_file02
scan supplements/502488_file02.content.md.lines "Inverted QS" "S2I" "S2J" "bistab" "state prefer" → L214 [71721041]: "I | Inverted QS toggle" → L215 [71721042]: "J | Inverted QS toggle" → L216 [71721043]: "We obtained the Inverted QS toggle by using the same lower strength promoters but reversing the QS network connected to each state: now, the rhlR/I network activates the yellow state, and cinR/I the blue state."
FINAL RESPONSE
NOT NOVEL. Tested explicitly. Magalhães et al. (bioRxiv 2022) doc_id: 8ece2be1-6e81-1014-9171-bf02e9568d20 Original: cinR/I → yellow state, rhlR/I → blue state. Inverted: rhlR/I → yellow, cinR/I → blue. Same circuit, QS arms swapped. Still bistable? Yes — but state preference shifts. L29: "exhibited changes in state preference, agreeing with the prediction of the model." Figs S2I–J: induction curves for inverted circuit. Colony patterning (Fig 7D): radial segregation retained, but only in specific conditions. Why: Bistability governed by relative promoter strengths + QS network identity. Swapping arms preserves bistability but shifts which state is preferred — traceable to asymmetric signal strength of C4-HSL vs C14-HSL.

Why it succeeded

  • Searched the filesystem directly for "inverted" + "cinR" + "rhlR" — found the supplement figure legend (L216) describing the exact swap
  • Supplement text is indexed as a file; the key sentence was retrievable by keyword scan in milliseconds
  • Real block IDs from the actual paper; answer grounded in primary source

Claude Code + bioRxiv MCP

TOOL: WebSearch ×3
"rhlR cinR quorum sensing toggle bistability" "inverted quorum sensing toggle rhl cin bistability" "rhlR/I cinR/I synthetic toggle swapped inverted"
FOUND PAPER IN SEARCH SNIPPETS
Magalhães et al. 2022 appears in results. Snippet mentions "Inverted QS toggle variant." Cannot read the paper — no full-text access.
TOOL: WebFetch ×3 (all blocked)
GET biorxiv.org/10.1101/2022.08.02.502488v1.full → 403 GET biorxiv.org/10.1101/2022.08.02.502488 → 403 GET pubs.acs.org/doi/10.1021/acssynbio.2c00332 → 403
TOOL: WebSearch ×3 (digging)
"inverted QS toggle" cinR rhlR bistability results "inverted QS toggle" bistable pattern formation Bruder Elowitz "intercellular genetic toggle" inverted → No accessible full text found
NO FINAL RESPONSE
Session ended without producing a verdict. Last tool call: pubs.acs.org → 403.
⚠ BLOCKED BY PAYWALL/RATE LIMIT⚠ NO ANSWER PRODUCED

Why it fell short

  • Found the right paper via web search but couldn’t read it — bioRxiv and ACS both returned 403
  • The key result is in the supplement (Fig S2I–J legend), not the abstract, so even a successful fetch of the abstract wouldn’t have answered the question
  • Session terminated without a verdict
Multi-Paper Synthesis

Idea Discovery: “What’s new that I’m not aware of?”

Discovering convergent signals across the literature: patterns only visible when you analyze 25–50 papers in parallel. This is where map-reduce over filesystems is decisive.

“Across recent spatial transcriptomics or spatial proteomics preprints, extract each paper’s main claimed cell-cell interaction, niche, or microenvironment finding and whether it includes orthogonal validation such as immunostaining, RNAscope, perturbation, or functional follow-up. Which classes of spatial claims are usually supported only by computational figures, and which are routinely validated experimentally?”

GXL Sy (Ours)

TOOL: searches → 50 papers
searches --quiet --tag spatial -n 50 \ "spatial transcriptomics cell-cell interaction niche" \ "spatial proteomics tumor microenvironment" \ "cell communication spatial ligand receptor" \ "spatial omics microenvironment validation immunostaining" \ "MERFISH Visium CODEX spatial cell interaction" \ "spatial transcriptomics orthogonal validation RNAscope"
TOOL: map → 42 subagents
map --from s_7c92a373 --limit 50 \ "Extract: (1) MAIN SPATIAL CLAIM. (2) CLAIM TYPE: cell-cell interaction / niche composition / ligand-receptor / gradient / microenvironment state. (3) COMPUTATIONAL EVIDENCE. (4) ORTHOGONAL VALIDATION: IF/IHC, RNAscope/smFISH, perturbation, functional assay, independent cohort, or None. (5) VALIDATION DEPTH: None / Minimal / Moderate / Extensive."
42 SUBAGENTS → FILESYSTEM
NicheCompass {document_id: e194e652} → SPP1+ macrophage niche in breast cancer. Computational only (graph deep learning). No IHC. Glioma TAM paper {document_id: 6015d9cc} → Ligand-receptor: tumor↔TAM co-expression. RNA-ISH confirms spatial co-expression. ✓ Validated. CRC liver metastases {document_id: 62060478} → CODEX multiplexed IF confirms L-R proximity at protein level. ✓ Validated. Haruka framework {document_id: 835ee34d} → Microenvironment remodeling domains. Computational only. No experimental readout.
TOOL: reduce
reduce --from m_7371efe2 \ "Group by claim type. For each: fraction with orthogonal validation vs computational-only. What predicts whether validation is included?"
FINAL RESPONSE
Claim type — not platform or disease — predicts validation. Routinely computational-only: Niche/neighborhood composition: ~0% validated. (NicheCompass, CellTrek, SColoc, Haruka — all graph embeddings, no IHC/RNAscope.) Microenvironment state/composition: ~0% validated. (SKNY, DeepSpaceDB, Visium+scRNA prostate TME.) Spatial expression gradients: ~0% validated. (Moran’s I / spatially variable gene tests.) Routinely validated: Ligand-receptor signaling: ~67% include at least RNA-ISH or IF. Mechanistically falsifiable → reviewers demand it. Cell co-localization tied to a functional claim: IHC + independent cohort replication common. Platform doesn’t predict: MERFISH, Visium, CODEX show similar rates within each claim type. RNAscope underused vs IF despite being more direct.

Why it succeeded

  • 42 subagents each extracted claim type + validation depth from sections/Results.lines and supplements/
  • Pattern invisible from abstracts: abstracts describe the finding, not whether it was validated
  • All citations link to real document IDs from the filesystem

Claude Code + bioRxiv MCP

TOOL: search_preprints(category) ×2 + get_preprint (×2)
category: "bioinformatics" ×2, recent: 20 each. ~20 abstracts + 2 full papers. ~89K tokens.
FINAL RESPONSE
Spatial transcriptomics papers often validate findings with immunostaining or RNAscope. Cell-cell interaction studies tend to use CellChat or NicheNet computationally.
⚠ NO PER-PAPER EXTRACTION⚠ NO VALIDATION RATES⚠ PRIOR KNOWLEDGE

Why it fell short

  • Abstracts don’t state whether validation was done — this requires reading Methods and Results of each paper
  • Only 2 papers loaded in full; couldn’t extract per-claim validation status at scale
  • Answer reflects general knowledge, not evidence from actual preprints

Conclusion

Instead of moving data to the agent, we bring the agent to the data. By exposing 450K bioRxiv and medRxiv preprints as a virtual filesystem, we place the agent inside the corpus rather than behind a query interface. This is a necessary shift to get past shallow search: when paper content is structured as directories with individually addressable sections, supplements, and figures, the agent can make targeted, efficient reads at whatever granularity the question demands rather than ingesting entire documents and hoping the answer surfaces.

This replicates the paradigm that has already proven immensely successful with coding agents. Tools like Claude Code and Cursor are effective precisely because they inhabit the codebase — navigating with ls, searching with grep, reading with cat — rather than querying it through an abstract API. Sy applies the same model to scientific literature, and the same bash-trained intuitions that make coding agents powerful transfer directly.

On bioRxiv Bench, Sy is 1.6× more accurate, 2.4× faster, and 3.6× cheaper than MCP-based approaches across 140 questions spanning Deep Paper Q&A, Experiment Novelty Check, and Multi-Paper Synthesis.

Try Sy yourself at sy.gxl.ai!