gxl
Blog

Representing Biomedical Literature as a Filesystem through Agent-Native Indexing

Biomedical preprints exposed as a virtual filesystem for AI agents. 1.6× more accurate, 2.4× faster, 3.6× cheaper than MCP-based approaches.

TL;DR

Instead of moving data to the agent, we send the agent to the data. Biomedical preprints are exposed as a virtual filesystem that agents explore using the same bash tools they use on codebases. We develop Sy, our research agent built on top of this filesystem. The result is 1.6× more accurate, 2.4× faster, and 3.6× cheaper than MCP-based approaches across deep text Q&A, experiment novelty checking, and cross-paper synthesis.

Average performance of Sy vs other agents on difficult bioRxiv questions
Accuracy / Completeness
91%
57%
31%
Avg Time
1.9m
4.5m
14.9m
Avg Cost
$0.37
$1.32
$1.00
GXL Sy (Ours) Claude Code + bioRxiv MCP FutureHouse Edison

Let's say you want to retrieve information from a remote source, like a large corpus of paper preprints. Today, most MCP-style integrations behave like structured communication channels between distant systems. You define the fields ahead of time, send a request, and receive a structured response. That works well when the question is clearly defined and the answer fits neatly into those predefined slots.

But many interesting questions aren't like that. In practice, discovery is often exploratory: you need to move through the data itself, following threads and context rather than issuing a single precise query. However, coding agents like Claude Code, Codex, and Cursor, already have deep intuitions for exactly this style of navigation. These agents have deep priors navigating codebases through bash (ls, grep, find, cat). They come with intuitions of which commands to reach for, how to compose them with pipes, and how to navigate from a broad directory structure down to the exact file and line they need. The problem is that scientific literature gives them nothing to navigate.

The Problem with MCP Tools

Today's biomedical research agents are stuck on the wrong side of this divide. Common LLM-based tools rely on MCP servers and search APIs that behave as structured channels: a search_papers tool with a handful of parameters that returns massive payloads of abstracts, with no way to browse, no sense of what's nearby, and no ability to refine by navigating rather than re-querying. The agent throws a query into a void and catches whatever comes back.

The agent can't ls the literature to see what's in a research area. It can't grep across methods sections to find how experiments were actually done. It can't follow a citation trail by reading a file. The scientific literature has no geography that the agent can navigate, so it can't orient itself, and all those deeply trained intuitions about filesystem navigation go entirely unused.

Current Literature Agents

Many current literature agents share a similar architecture: wrappers over PubMed, Semantic Scholar, or bioRxiv APIs, accessed through MCP connectors or function-calling schemas. The agent calls search_papers(query), gets back a list of abstracts, and summarizes them. This works when the question is clearly defined, but it breaks down on anything that requires reading the content of a paper in detail: methods sections, supplemental tables, figure captions, appendix data. These are precisely the parts that contain what researchers actually need — the specific protocol, the sample sizes, the failure modes, and the caveats buried in extended data.

The traditional alternative is to bring the data to you: downloading terabytes of papers, building indexing pipelines, and running search infrastructure locally, all before any real investigation can begin. MCP connectors sit somewhere in between but still fall short — they return metadata and abstracts. Getting full-text content requires fetching a PDF at query time, dumping the entire document into context, and hoping the relevant sentence surfaces in a ~40K token blob. There is no way to search within the document, no way to navigate to the Methods section without reading everything before it, and no way to run the same extraction across 50 papers without exhausting your token budget on the first three.

Our approach takes the opposite direction: rather than moving the data to the agent, we move the agent to the data.

Biomedical Preprints as Filesystem

You can think of it as opening a small portal between your system and the corpus. The remote data source is exposed as a virtual filesystem, and the heavy lifting has already been done. The corpus is indexed, structured, and optimized for search so agents can explore efficiently without downloading or managing the data themselves.

Scientific papers are born as rich collections of structured artifacts: tables, figures, methods sections, supplementary spreadsheets, appendices, and code. Then authors compress all of that structure by flattening everything into a single PDF at submission time. The format is optimized for print and human readability, but not for AI agents. Our core idea is to reverse this compression. We re-expand each paper back into a filesystem where every paper is a folder, every figure is a file, and every section (Methods, Results, Discussion) is individually addressable. Each paper becomes a directory that an agent can enter, inspect, and traverse as deeply as needed.

Paper-as-a-Filesystem
Author 1, Author 2, Author 3
Abstract
Results
Discussion
Methods
Parse
📂 File Structure
/paper/
/sections/
abstract.md
introduction.md
results.md
discussion.md
/tables/
table1.csv
table2.csv
/images/
figure1.png
figure2.png
figure3.png
/supplements/
supp_figure1.png
...
Access
Terminal
$ cat abstract.md
Background: Recent work...
# Programmatic access
$ python -c "
  import pandas as pd
  df = pd.read_csv(
    'tables/table1.csv'
  )"

An agent goes to the filesystem with a task, navigates the relevant parts of the corpus, gathers the necessary context, and returns with the answer and supporting evidence. When it wants to replicate an experiment, it reads sections/Methods.lines. When it wants to compare results across studies, it reads sections/Results.lines. When it's trying to understand how the field interprets a finding, it reads sections/Discussion.lines. This mirrors how scientists actually use papers: not as monolithic blobs, but as collections of structurally distinct knowledge types.

Each line of text has a unique numeric identifier that traces back to the source, and every figure is individually addressable. The agent can cd into a paper, grep for a term, cat specific sections, and head the first 50 lines, using the same bash workflow it uses to navigate any codebase. Instead of hauling terabytes of data across the network and rebuilding indexing infrastructure locally, the exploration happens where the information already lives.

How It Works

The agent runs a command
Hover a command to see which parts of the filesystem it touches
$ search "CRISPR base editing efficiency"
$ grep "IC50" /papers/*/sections/Results.lines
$ cat /papers/a7f3e2/supplements/table_s1.csv
Virtual Filesystem LayerHidden from agent
Translates shell commands into parallel queries across the indexed corpus, then assembles results as ordinary files and directories.
📄
Document Processing
PDF parsing, XML extraction, section segmentation
🗃️
SQL Storage
Metadata, full text, content blocks
🔍
Hybrid Indexing
BM25 + semantic KNN, block-level retrieval
Cache
Query results, embeddings, hot paths
The agent sees individual papers as local directories
📂 /paper_a7f3e2/
▸ sections/
▸ supplements/
▸ figures/
▸ meta.json
📂 /paper_b8c4d1/
▸ sections/
▸ supplements/
▸ figures/
▸ meta.json
📂 /paper_c2e9a0/
▸ sections/
▸ supplements/
▸ figures/
▸ meta.json
📂 /paper_d5f1b3/
▸ sections/
▸ supplements/
▸ figures/
▸ meta.json
...
450K+ papers

Building the Index

Making half a million preprints navigable as directories requires solving a document engineering problem at scale. The raw corpus is a mix of JATS XML and supplementary materials in diverse formats: PDFs, Excel spreadsheets, Word documents, CSV files, even PowerPoint slides. The main text pipeline parses JATS XML into individual content blocks, where each paragraph, table, figure caption, section header, and formula becomes a separately addressable unit. This decomposition is what allows an agent to grep for a term and land on the exact block rather than ingesting a 40K-token blob.

Paper supplements present a challenge as they are inherently heterogeneous and unstructured. A single paper might attach a PDF with 15 supplementary figures, an Excel file of raw assay data, and a Word document with extended methods. To convert these to LLM-native text, we run each supplement through OCR models that perform document segmentation, table recognition, and formula extraction, producing the same block-level format: typed content with bounding boxes and page coordinates. The output is normalized into the same schema as the XML-derived blocks, so the agent sees a uniform supplements/ directory regardless of whether the underlying source was a scanned PDF or native XML.

The resulting ~70 million content blocks are stored in PostgreSQL with per-block JSONB metadata linking each block back to its source file and XPath, and dual-indexed in Elasticsearch through a hybrid retrieval layer that combines BM25 keyword scoring with dense vector embeddings. When the agent runs search "CRISPR base editing", both indices fire in parallel and results are merged via reciprocal rank fusion. When it runs grep "IC50" inside a paper, the query hits the block-level index filtered to that document. Figures are individually addressable and can be routed to a vision model on demand. The entire layer is invisible to the agent — it sees files and directories.

Why Bash

LLMs trained on code have encountered ls, grep, find, cat, wc, diff, head, tail, and pipe composition billions of times. They don't treat these as abstract API calls. They know grep -r recurses, they know wc -l counts lines, they know how to pipe output between commands, and they know what an empty directory means versus a missing one.

When the research filesystem responds to standard bash, the agent doesn't need to learn a new tool schema through in-context examples. It applies the same skills it uses to navigate a codebase, now pointed at scientific literature. A custom API with search_preprints() and get_preprint() means the model learns your interface from scratch on every invocation. It will use it, but it won't compose tools in ways you didn't anticipate.

Map-Reduce Over Papers

By treating each paper as a directory, we unlock another powerful pattern: map-reduce over papers. A map operation dispatches a lightweight subagent to every paper in parallel, each with filesystem access to its paper's directory. A reduce operation then synthesizes the extractions into a unified answer. This mirrors how scientists do literature reviews (asking the same question across many documents) but runs in minutes across dozens of papers, extracting structured data from full text, not abstracts. Each subagent navigates only the relevant parts of its paper and returns with a precise extraction.

MCP Tool Server

1Call search_preprints(category, date_range).
No keyword search, browse by category + date only.
Get recent titles + abstracts.
2Call get_preprint(doi).
Entire paper (~40K tokens) dumped into context as a blob.
3Repeat for 1–2 more.
Context now ~100K tokens of undifferentiated text.
4Synthesize from memory.
No section boundaries. Citations unreliable.
Result: 2–3 papers, vague summaries.
vs

GXL Sy (Ours)

1Search 450K papers in filesystem.
Get paths to top papers sorted by relevance.
2Each subagent navigates its paper:
grep Results, head Methods, cat a figure.
3Each reads ~200 tokens (not 40K).
Returns structured extraction with block-level citations.
425–100 subagents in parallel.
Reduce into synthesis.
Result: 100 papers, specific data, every claim cited.

GXL Sy is our research agent built on top of this filesystem. It navigates over 450,000 bioRxiv and medRxiv preprints, using the full depth of the virtual filesystem to answer questions that require reading specific passages, checking experimental novelty across the literature, and synthesizing findings from multiple papers. Rather than wrapping search results in a prompt, GXL Sy enters the corpus, follows leads across papers, and returns with grounded, citable answers.

bioRxiv Bench

To measure whether focused exploration outperforms rigid querying, we introduce bioRxiv Bench, a benchmark of 140 questions drawn from real research workflows over bioRxiv and medRxiv preprints. The benchmark spans three task types: Deep Paper Q&A (N=50), Experiment Novelty Check (N=50), and Multi-Paper Synthesis (N=40).

We compare GXL Sy (Ours) against two baselines: Claude Code with the Claude bioRxiv MCP connector, and the FutureHouse Edison Platform, which provides AI-powered biomedical literature search. For FutureHouse Edison, we used the Precedent agent on Experiment Novelty Check and the Literature agent on Paper Q&A and Multi-Paper Synthesis.

Deep Paper Q&A (N=50)

GXL Sy (Ours) vs. Claude Code + bioRxiv MCP connector vs. FutureHouse Edison
Accuracy
100%
86%
4%
Avg Time
1m6s
3m42s
9m29s
Avg Cost
$0.21
$1.07
$1.00
FutureHouse Edison charges per credit used.
GXL Sy (Ours) Claude Code + MCP FutureHouse Edison
50 supplement-grounded questions across 50 bioRxiv preprints, each requiring data from supplemental tables, PDFs, or DOCX files that cannot be answered from the paper's main text alone.

Dataset Construction

50 single-document questions drawn from 50 distinct bioRxiv preprints published in 2025. Each question was generated by granting a model access to the full paper including supplements, then manually reviewed and filtered for clarity, accuracy, and relevance.

Questions were explicitly designed to be unanswerable from the abstract or main text alone, requiring data from supplemental tables, PDFs, or DOCX files. Each answer is accompanied by the specific supplement file path, step-by-step reasoning procedure, and executable code used to derive the answer.

Example Questions

Exact Count"In the paper with DOI 10.1101/2025.03.28.646065, how many proteins comprise the proteostasis network of tau according to the paper?"
Cohort Breakdown"In the paper with DOI 10.1101/2025.05.30.657099, in the RNA-seq analysis of neutrophils, how many healthy controls and SLE patients were compared, and how were the SLE patients subdivided based on ISG levels?"
Dataset Parameter"In the paper with DOI 10.1101/2025.04.10.648169, how many HLA-I alleles are represented in the benchmarking validation dataset?"

Scoring Protocol

Questions were generated by giving a model access to the full filesystem and prompting it to produce grounded questions from the source material, followed by a round of human review for clarity, accuracy, and relevance.

Responses are scored by an LLM judge against ground truth. Numeric answers are accepted within a 2% relative tolerance or up to 5 decimal places, and semantically equivalent formats are treated as equal (e.g., "95%" == "0.95"). String and categorical answers are matched case-insensitively on core meaning. Each response scores 1.0 (correct) or 0.0 (incorrect).

Experiment Novelty Check (N=50)

GXL Sy (Ours) vs. Claude Code + bioRxiv MCP connector vs. FutureHouse Edison
Accuracy
80%
28%
20%
Avg Time
2m22s
2m54s
12m13s
Avg Cost
$0.36
$0.93
$1.00
FutureHouse Edison charges per credit used.
GXL Sy (Ours) CC + bioRxiv connector FutureHouse Edison
50 questions were constructed by starting from real bioRxiv and medRxiv papers with distinctive quantitative findings, then reverse-engineering the natural-language experiment novelty query a researcher might ask before attempting similar work.

Dataset Construction

Questions were authored by first identifying papers in our indexed bioRxiv/medRxiv corpus that contain distinctive quantitative findings — binding affinities, enzyme kinetics, production titers, dose-response values — then reverse-engineering a natural language query a researcher might pose before starting analogous work.

Each question has a ground truth record containing the target paper's document ID, title, DOI, authors, and the location of the experiments or results within the paper (body text, supplement table, or figure).

Example Questions

Inverted Quorum Sensing Toggle"What happens if you reverse which quorum sensing system controls which state in an intercellular genetic toggle? Would it still be bistable?"
TX-TL Cosolute Concentration Mapping"Has anyone systematically mapped how magnesium glutamate and potassium glutamate concentrations in cell-free TX-TL affect expression timing (lag phase) and total protein yield?"
Riboswitch-Dependent Synthetic Auxotrophs"Has anyone systematically built riboswitch-dependent synthetic auxotrophs in E. coli where theophylline-responsive riboswitches control essential genes, so cells only survive in the presence of the small molecule?"

Scoring Protocol

We use a 3-criteria LLM-as-a-judge system (Claude Sonnet 4.6) to evaluate each agent response:

Criteria 1 (Verdict) — Did the agent correctly conclude that the finding is not novel? All 50 questions describe real findings from real papers; the correct answer is always that prior art exists.

Criteria 2 (Paper Identification) — Did the agent find the correct source paper?

Criteria 3 (Specific Finding) — Did the agent extract the correct quantitative value or experimental detail?

Multi-Paper Synthesis (N=40)

GXL Sy (Ours) vs. Claude Code + bioRxiv MCP connector vs. FutureHouse Edison
Completeness
92%
58%
70%
Avg Time
2m6s
6m48s
23m6s
Avg Cost
$0.53
$1.96
$1.00
FutureHouse Edison charges per credit used.
Avg Citations
27.9
10.6
19.5
GXL Sy (Ours) Claude Code + MCP FutureHouse Edison
40 cross-paper synthesis questions across 8 categories, each requiring evidence drawn from a minimum of 5 papers. Scored with literal per-question criteria.

Dataset Construction

40 questions synthetically generated across 8 categories: spatial transcriptomics, foundation models, gene therapy safety, single-cell genomics, neuroscience, immunology, cancer biology, and synthetic biology. Each question was designed to require synthesis across a minimum of 5 papers and to be unanswerable from any single abstract.

Questions were generated by prompting a model with subfield descriptions and example queries, then filtered for quality through human review — removing questions that were too narrow, too broad, or answerable without cross-paper synthesis.

Example Questions

Spatial Validation Rigor"Across recent spatial transcriptomics or spatial proteomics preprints, extract each paper's main claimed cell-cell interaction, niche, or microenvironment finding and whether it includes orthogonal validation…"
Foundation Model Leakage"Across recent bioRxiv preprints on foundation models for genomics, proteins, pathology, or multimodal biology, extract the pretraining corpus description, train-validation-test split strategy…"
AAV/LNP Toxicity Signals"Across preprints on in vivo AAV or LNP gene delivery from the last 3 years, extract the species, dose, capsid or formulation, target tissue, and any reported liver injury…"

Scoring Protocol

Each question is scored by an LLM judge against a checklist of criteria written specifically for that query. A response passes only if every criterion is met. Completeness is the fraction of questions that fully pass.

Category Breakdown (5 questions each)

CategoryGXL SyCC + MCPFH Edison
Relevant Retrieval4/51/54/5
Exact Quotation5/53/51/5
Abstract vs. Substance5/52/55/5
Emerging Directions5/52/53/5
Cross-Paper Contradiction5/52/54/5
Cross-Paper Synthesis5/55/52/5
Quantitative Extraction3/54/54/5
Method Comparison5/54/55/5
Total37/40 (92%)23/40 (58%)28/40 (70%)

Across all task types the pattern holds. On Deep Paper Q&A, GXL Sy (Ours) scores 100% vs 86%, runs 3.4× faster (1m6s vs 3m42s), and costs 5.1× less ($0.21 vs $1.07). On Experiment Novelty Check, it scores 80%, 2.9× more accurate than CC + bioRxiv connector (28%) and 4.0× more accurate than FutureHouse Edison (20%), while running 1.2× faster than CC + bioRxiv connector (2m22s vs 2m54s) and 5.2× faster than FutureHouse Edison (2m22s vs 12m13s), with lower average cost than CC + bioRxiv connector ($0.36 vs $0.93). On Multi-Paper Synthesis, it is 59% more complete (92% vs 58%), 3.2× faster (2m6s vs 6m48s), produces 2.6× more citations (27.9 vs 10.6), and costs 3.7× less ($0.53 vs $1.96). The efficiency comes from targeted access: a grep into sections/Results.lines consumes ~200 tokens versus ~40,000 for loading a full paper through MCP. The accuracy comes from section-level precision: questions about methods go to Methods, questions about limitations go to Discussion.

Below, we walk through specific benchmark examples comparing the two approaches side by side.

Case Studies

Conclusion

Instead of moving data to the agent, we bring the agent to the data. By exposing 450K bioRxiv and medRxiv preprints as a virtual filesystem, we place the agent inside the corpus rather than behind a query interface. This is a necessary shift to get past shallow search: when paper content is structured as directories with individually addressable sections, supplements, and figures, the agent can make targeted, efficient reads at whatever granularity the question demands rather than ingesting entire documents and hoping the answer surfaces.

This replicates the paradigm that has already proven immensely successful with coding agents. Tools like Claude Code and Cursor are effective precisely because they inhabit the codebase — navigating with ls, searching with grep, reading with cat — rather than querying it through an abstract API. Sy applies the same model to scientific literature, and the same bash-trained intuitions that make coding agents powerful transfer directly.

On bioRxiv Bench, Sy is 1.6× more accurate, 2.4× faster, and 3.6× cheaper than MCP-based approaches across 140 questions spanning Deep Paper Q&A, Experiment Novelty Check, and Multi-Paper Synthesis.

Try Sy yourself at sy.gxl.ai!