Representing Biomedical Literature as a Filesystem through Agent-Native Indexing
Let's say you want to retrieve information from a remote source, like a large corpus of paper preprints. Today, most MCP-style integrations behave like structured communication channels between distant systems. You define the fields ahead of time, send a request, and receive a structured response. That works well when the question is clearly defined and the answer fits neatly into those predefined slots.
But many interesting questions aren't like that. In practice, discovery is often exploratory: you need to
move through the data itself, following threads and context rather than issuing a single precise query.
However, coding agents like Claude Code, Codex, and Cursor, already have deep intuitions for exactly this
style of navigation. These agents have deep priors navigating codebases through bash (ls,
grep, find, cat). They come with intuitions of which commands to reach
for, how to compose them with pipes, and how to navigate from a broad directory structure down to the exact
file and line they need. The problem is that scientific literature gives them nothing to navigate.
The Problem with MCP Tools
Today's biomedical research agents are stuck on the wrong side of this divide. Common LLM-based tools rely on
MCP servers and search APIs that behave as structured channels: a search_papers tool with a
handful of parameters that returns massive payloads of abstracts, with no way to browse, no sense of what's
nearby, and no ability to refine by navigating rather than re-querying. The agent throws a query into a void
and catches whatever comes back.
The agent can't ls the literature to see what's in a research area. It can't grep
across methods sections to find how experiments were actually done. It can't follow a citation trail by
reading a file. The scientific literature has no geography that the agent can navigate, so it can't orient
itself, and all those deeply trained intuitions about filesystem navigation go entirely unused.
Current Literature Agents
Many current literature agents share a similar architecture: wrappers over PubMed, Semantic Scholar, or
bioRxiv APIs, accessed through MCP connectors or function-calling schemas. The agent calls
search_papers(query), gets back a list of abstracts, and summarizes them. This works when the
question is clearly defined, but it breaks down on anything that requires reading the content of a paper in
detail: methods sections, supplemental tables, figure captions, appendix data. These are precisely the parts
that contain what researchers actually need — the specific protocol, the sample sizes, the failure modes,
and the caveats buried in extended data.
The traditional alternative is to bring the data to you: downloading terabytes of papers, building indexing pipelines, and running search infrastructure locally, all before any real investigation can begin. MCP connectors sit somewhere in between but still fall short — they return metadata and abstracts. Getting full-text content requires fetching a PDF at query time, dumping the entire document into context, and hoping the relevant sentence surfaces in a ~40K token blob. There is no way to search within the document, no way to navigate to the Methods section without reading everything before it, and no way to run the same extraction across 50 papers without exhausting your token budget on the first three.
Our approach takes the opposite direction: rather than moving the data to the agent, we move the agent to the data.
Biomedical Preprints as Filesystem
You can think of it as opening a small portal between your system and the corpus. The remote data source is exposed as a virtual filesystem, and the heavy lifting has already been done. The corpus is indexed, structured, and optimized for search so agents can explore efficiently without downloading or managing the data themselves.
Scientific papers are born as rich collections of structured artifacts: tables, figures, methods sections, supplementary spreadsheets, appendices, and code. Then authors compress all of that structure by flattening everything into a single PDF at submission time. The format is optimized for print and human readability, but not for AI agents. Our core idea is to reverse this compression. We re-expand each paper back into a filesystem where every paper is a folder, every figure is a file, and every section (Methods, Results, Discussion) is individually addressable. Each paper becomes a directory that an agent can enter, inspect, and traverse as deeply as needed.
An agent goes to the filesystem with a task, navigates the relevant parts of the corpus, gathers the necessary context,
and returns with the answer and supporting evidence. When it wants to replicate an experiment, it reads
sections/Methods.lines. When it wants to compare results across studies, it reads
sections/Results.lines. When it's trying to understand how the field interprets a finding, it
reads sections/Discussion.lines. This mirrors how scientists actually use papers: not as
monolithic blobs, but as collections of structurally distinct knowledge types.
Each line of text has a unique numeric identifier that traces back to the source, and every figure is
individually addressable. The agent can cd into a paper, grep for a term,
cat specific sections, and head the first 50 lines, using the same bash workflow it
uses to navigate any codebase. Instead of hauling terabytes of data across the network and rebuilding indexing
infrastructure locally, the exploration happens where the information already lives.
How It Works
Processing
section segmentation
Storage
content blocks
Indexing
block-level retrieval
hot paths
Building the Index
Making half a million preprints navigable as directories requires solving a document engineering problem
at scale. The raw corpus is a mix of JATS XML and supplementary materials in diverse formats: PDFs,
Excel spreadsheets, Word documents, CSV files, even PowerPoint slides. The main text pipeline parses
JATS XML into individual content blocks, where each paragraph, table, figure caption, section header,
and formula becomes a separately addressable unit. This decomposition is what allows an agent to
grep for a term and land on the exact block rather than ingesting a 40K-token blob.
Paper supplements present a challenge as they are inherently heterogeneous and unstructured. A single
paper might attach a PDF with 15 supplementary figures, an Excel file of raw assay data, and a Word
document with extended methods. To convert these to LLM-native text, we run each supplement through OCR
models that perform document segmentation, table recognition, and formula extraction, producing the same
block-level format: typed content with bounding boxes and page coordinates. The output is normalized into
the same schema as the XML-derived blocks, so the agent sees a uniform supplements/
directory regardless of whether the underlying source was a scanned PDF or native XML.
The resulting ~70 million content blocks are stored in PostgreSQL with per-block JSONB metadata linking
each block back to its source file and XPath, and dual-indexed in Elasticsearch through a hybrid
retrieval layer that combines BM25 keyword scoring with dense vector embeddings. When the agent runs
search "CRISPR base editing", both indices fire in parallel and results are merged via
reciprocal rank fusion. When it runs grep "IC50" inside a paper, the query hits the
block-level index filtered to that document. Figures are individually addressable and can be routed to
a vision model on demand. The entire layer is invisible to the agent—it sees files and directories.
Why Bash
LLMs trained on code have encountered ls, grep, find, cat,
wc, diff, head, tail, and pipe composition billions of
times. They don't treat these as abstract API calls. They know grep -r recurses, they know
wc -l counts lines, they know how to pipe output between commands, and they know what an empty
directory means versus a missing one.
When the research filesystem responds to standard bash, the agent doesn't need to learn a new tool schema
through in-context examples. It applies the same skills it uses to navigate a codebase, now pointed at
scientific literature. A custom API with search_preprints() and get_preprint() means
the model learns your interface from scratch on every invocation. It will use it, but it won't compose tools
in ways you didn't anticipate.
Map-Reduce Over Papers
By treating each paper as a directory, we unlock another powerful pattern: map-reduce over papers. A map operation dispatches a lightweight subagent to every paper in parallel, each with filesystem access to its paper's directory. A reduce operation then synthesizes the extractions into a unified answer. This mirrors how scientists do literature reviews (asking the same question across many documents) but runs in minutes across dozens of papers, extracting structured data from full text, not abstracts. Each subagent navigates only the relevant parts of its paper and returns with a precise extraction.
MCP Tool Server
search_preprints(category, date_range).No keyword search, browse by category + date only.
Get recent titles + abstracts.
get_preprint(doi).Entire paper (~40K tokens) dumped into context as a blob.
Context now ~100K tokens of undifferentiated text.
No section boundaries. Citations unreliable.
Result: 2–3 papers, vague summaries.
GXL Sy (Ours)
Get paths to top papers sorted by relevance.
grep Results, head Methods, cat a figure.Returns structured extraction with block-level citations.
Reduce into synthesis.
Result: 100 papers, specific data, every claim cited.
GXL Sy is our research agent built on top of this filesystem. It navigates over 450,000 bioRxiv and medRxiv preprints, using the full depth of the virtual filesystem to answer questions that require reading specific passages, checking experimental novelty across the literature, and synthesizing findings from multiple papers. Rather than wrapping search results in a prompt, GXL Sy enters the corpus, follows leads across papers, and returns with grounded, citable answers.
bioRxiv Bench
To measure whether focused exploration outperforms rigid querying, we introduce bioRxiv Bench, a benchmark of 140 questions drawn from real research workflows over bioRxiv and medRxiv preprints. The benchmark spans three task types: Deep Paper Q&A (N=50), Experiment Novelty Check (N=50), and Multi-Paper Synthesis (N=40).
We compare GXL Sy (Ours) against two baselines: Claude Code with the Claude bioRxiv MCP connector, and the FutureHouse Edison Platform, which provides AI-powered biomedical literature search. For FutureHouse Edison, we used the Precedent agent on Experiment Novelty Check and the Literature agent on Paper Q&A and Multi-Paper Synthesis.
Deep Paper Q&A (N=50)
GXL Sy (Ours) vs. Claude Code + bioRxiv MCP connector vs. FutureHouse Edison
50 supplement-grounded questions across 50 bioRxiv preprints, each requiring data from supplemental tables, PDFs, or DOCX files that cannot be answered from the paper's main text alone.
Dataset Construction
50 single-document questions drawn from 50 distinct bioRxiv preprints published in 2025. Each question was generated by granting a model access to the full paper including supplements, then manually reviewed and filtered for clarity, accuracy, and relevance.
Questions were explicitly designed to be unanswerable from the abstract or main text alone, requiring data from supplemental tables, PDFs, or DOCX files. Each answer is accompanied by the specific supplement file path, step-by-step reasoning procedure, and executable code used to derive the answer.
Example Questions
Scoring Protocol
Questions were generated by giving a model access to the full filesystem and prompting it to produce grounded questions from the source material, followed by a round of human review for clarity, accuracy, and relevance.
Responses are scored by an LLM judge against ground truth. Numeric answers are accepted within a 2% relative tolerance or up to 5 decimal places, and semantically equivalent formats are treated as equal (e.g., "95%" == "0.95"). String and categorical answers are matched case-insensitively on core meaning. Each response scores 1.0 (correct) or 0.0 (incorrect).
Experiment Novelty Check (N=50)
GXL Sy (Ours) vs. Claude Code + bioRxiv MCP connector vs. FutureHouse Edison
50 questions were constructed by starting from real bioRxiv and medRxiv papers with distinctive quantitative findings, then reverse-engineering the natural-language experiment novelty query a researcher might ask before attempting similar work.
Dataset Construction
Questions were authored by first identifying papers in our indexed bioRxiv/medRxiv corpus that contain distinctive quantitative findings — binding affinities, enzyme kinetics, production titers, dose-response values — then reverse-engineering a natural language query a researcher might pose before starting analogous work.
Each question has a ground truth record containing the target paper's document ID, title, DOI, authors, and the location of the experiments or results within the paper (body text, supplement table, or figure).
Example Questions
Scoring Protocol
We use a 3-criteria LLM-as-a-judge system (Claude Sonnet 4.6) to evaluate each agent response:
0.9 µM = 900 nM).
Multi-Paper Synthesis (N=40)
GXL Sy (Ours) vs. Claude Code + bioRxiv MCP connector vs. FutureHouse Edison
40 cross-paper synthesis questions across 8 categories (relevant retrieval, exact quotation, abstract vs. substance, emerging directions, cross-paper contradiction, cross-paper synthesis, quantitative extraction, method comparison), each requiring evidence drawn from a minimum of 5 papers. Scored with literal per-question criteria. FutureHouse Edison and CC + BioRxiv were re-judged on the same 40-question set.
Dataset Construction
40 questions synthetically generated across 8 categories: spatial transcriptomics, foundation models, gene therapy safety, single-cell genomics, neuroscience, immunology, cancer biology, and synthetic biology. Each question was designed to require synthesis across a minimum of 5 papers and to be unanswerable from any single abstract.
Questions were generated by prompting a model with subfield descriptions and example queries, then filtered for quality through human review — removing questions that were too narrow, too broad, or answerable without cross-paper synthesis.
Example Questions
Scoring Protocol
Each question is scored by an LLM judge against a checklist of criteria written specifically for that query. A response passes only if every criterion is met. Completeness is the fraction of questions that fully pass.
Criteria are written literally against each question — they check exactly what the question asks for, not a generalized rubric. For example, for the query "Across recent spatial transcriptomics or spatial proteomics preprints, extract each paper's main claimed cell-cell interaction, niche, or microenvironment finding and whether it includes orthogonal validation such as immunostaining, RNAscope, perturbation, or functional follow-up. Which classes of spatial claims are usually supported only by computational figures, and which are routinely validated experimentally?", the criteria are:
- All returned papers are spatial transcriptomics or spatial proteomics preprints (not bulk RNA-seq or non-spatial single-cell studies)
- For each paper, the main claimed cell-cell interaction, niche, or microenvironment finding is extracted
- For each paper, states specifically whether it includes orthogonal validation (immunostaining, RNAscope, perturbation, or functional follow-up) or is computational only
- Identifies which classes of spatial claims (e.g. ligand-receptor predictions, niche composition, cell co-localization) are typically computational-only vs. routinely validated experimentally across the returned papers
A response that returns non-spatial papers, summarizes findings without per-paper validation status, or omits the classification of claim types fails. Partial credit is not given.
Category Breakdown (5 questions each)
| Category | GXL Sy (Ours) | CC + MCP | FutureHouse Edison |
|---|---|---|---|
| Relevant Retrieval | 4/5 | 1/5 | 4/5 |
| Exact Quotation | 5/5 | 3/5 | 1/5 |
| Abstract vs. Substance | 5/5 | 2/5 | 5/5 |
| Emerging Directions | 5/5 | 2/5 | 3/5 |
| Cross-Paper Contradiction | 5/5 | 2/5 | 4/5 |
| Cross-Paper Synthesis | 5/5 | 5/5 | 2/5 |
| Quantitative Extraction | 3/5 | 4/5 | 4/5 |
| Method Comparison | 5/5 | 4/5 | 5/5 |
| Total | 37/40 (92%) | 23/40 (58%) | 28/40 (70%) |
Across all task types the pattern holds. On Deep Paper Q&A, GXL Sy (Ours) scores
100% vs 86%, runs 3.4× faster (1m6s vs 3m42s), and costs
5.1× less ($0.21 vs $1.07). On Experiment Novelty Check, it
scores 80%, 2.9× more accurate than CC + bioRxiv connector (28%) and
4.0× more accurate than FutureHouse Edison (20%), while running 1.2×
faster
than CC + bioRxiv connector (2m22s vs 2m54s) and 5.2× faster than FutureHouse Edison
(2m22s vs
12m13s), with lower average cost than CC + bioRxiv connector ($0.36 vs $0.93); FutureHouse Edison charges per
credit used.
On Multi-Paper Synthesis, it is
59% more complete (92%
vs 58%), 3.2× faster (2m6s vs 6m48s), produces 2.6× more
citations (27.9 vs 10.6), and costs 3.7× less ($0.53 vs $1.96). The
efficiency comes from targeted access: a grep into sections/Results.lines
consumes ~200 tokens versus ~40,000 for loading a full paper through MCP. The accuracy comes from
section-level precision: questions about methods go to Methods, questions about limitations go to Discussion.
Below, we walk through specific benchmark examples comparing the two approaches side by side.
Case Studies
Deep Paper Q&A
Questions that can only be answered from supplemental figures, tables, or methods: content that is invisible to abstract-level search and requires navigating directly into supplement files. Every supplement is a first-class path in the filesystem.
GXL Sy (Ours)
Why it succeeded
- Two tool calls:
lookup doito find the paper, thenscan content.linesto find the exact count - Returned block-level citations pointing to the exact sentence in the paper
Claude Code + bioRxiv MCP
Why it fell short
- The paper was fetched successfully, but the specific count ("16 proteins") is buried in the Results section of a ~38K token blob with no way to search for it
- The answer drifted toward named proteins from prior knowledge rather than the explicit number stated in the text
GXL Sy (Ours)
Why it succeeded
- Two tool calls: DOI lookup + keyword scan of content
- Extracted exact counts (10, 23, 12, 11) and the subdivision logic directly from the text
Claude Code + bioRxiv MCP
Why it fell short
- Paper fetched successfully but specific cohort counts are in a Methods subsection buried in ~41K tokens — impossible to locate without text search
- Answered descriptively ("IFN-positive and IFN-negative") without the exact numbers the question required
GXL Sy (Ours)
Why it succeeded
- Two tool calls total: lookup DOI to resolve the paper, then scan for the specific figure
- Answer found in the abstract-level content, with block-level citation to the exact line
Claude Code + bioRxiv MCP
Why it fell short
- Paper was retrieved, but 44 appears once in a dense Methods paragraph inside ~35K tokens — the model hallucinated 29 instead
- No way to search within the full-text blob means precision extraction fails for specific numbers
Experiment Novelty Check: "Has this been done before?"
A researcher describes a specific experimental idea and asks whether it exists in the literature. These questions require precise extraction from Methods and Results, sections the filesystem exposes as individual files.
GXL Sy (Ours)
Why it succeeded
- Searched the filesystem directly for "inverted" + "cinR" + "rhlR" — found the supplement figure legend (L216) describing the exact swap
- Supplement text is indexed as a file; the key sentence was retrievable by keyword scan in milliseconds
- Real block IDs from the actual paper; answer grounded in primary source
Claude Code + bioRxiv MCP
Why it fell short
- Found the right paper via web search but couldn't read it — bioRxiv and ACS both returned 403
- The key result is in the supplement (Fig S2I–J legend), not the abstract, so even a successful fetch of the abstract wouldn't have answered the question
- Session terminated without a verdict
GXL Sy (Ours)
Why it succeeded
- First search hit was the exact paper. The filesystem indexes 2021 preprints alongside 2025 ones — no recency bias.
- Scanned the Results section for "lag", "time-delay", "Mg-glut" and extracted exact numbers from text
- Identified the K-glut gap as a genuine novelty angle by reading what the paper explicitly did and didn't vary
Claude Code + bioRxiv MCP
Why it fell short
- Found the right paper via web search but couldn't read it — bioRxiv returned 403
- Got distracted chasing a ResearchGate figure for authorship metadata; session ended without a verdict
- The key data (exact lag time per cosolute, concentrations tested) is buried in the Results section — inaccessible without full-text
GXL Sy (Ours)
Why it succeeded
- Nov 2025 preprint was already indexed in the filesystem — returned as top hit immediately
- Read full paper text to extract exact gene count (29), success rate (18/29), and escape frequencies
- Identified genuine novelty angles from what the paper explicitly does and does not cover
Claude Code + bioRxiv MCP
Why it fell short
- The Gonzalez-Lopez et al. preprint (Nov 2025) exists but wasn't surfaced by web search — too recent for search index coverage at time of query
- Spotted a DOI hint (
2025.10.09.681377) but couldn't access the content - Concluded NOVEL — the opposite of the correct answer. A researcher acting on this would waste months.
Idea Discovery: "What's new that I'm not aware of?"
Discovering convergent signals across the literature: patterns only visible when you analyze 25–50 papers in parallel. This is where map-reduce over filesystems is decisive.
GXL Sy (Ours)
Why it succeeded
- 42 subagents each extracted claim type + validation depth from
sections/Results.linesandsupplements/ - Pattern invisible from abstracts: abstracts describe the finding, not whether it was validated
- All citations link to real document IDs from the filesystem
Claude Code + bioRxiv MCP
Why it fell short
- Abstracts don't state whether validation was done — this requires reading Methods and Results of each paper
- Only 2 papers loaded in full; couldn't extract per-claim validation status at scale
- Answer reflects general knowledge, not evidence from actual preprints
GXL Sy (Ours)
Why it succeeded
- Subagents read
sections/Methods.linesto extract split strategy — this is never in the abstract - Extracted leakage controls and benchmarks from 20 papers; a task impossible within a single context window
- Identified which papers share train/test datasets with their own benchmarks
Claude Code + bioRxiv MCP
Why it fell short
- Split strategy and leakage controls are buried in Methods — not accessible from abstracts
- Couldn't load enough papers to compare practices across 20+ groups
- No ability to identify which specific papers have benchmark overlap with their own pretraining data
GXL Sy (Ours)
Why it succeeded
- Subagents read pathology figures and
supplements/— where toxicity data lives, not abstracts - Identified the abstract-vs-pathology mismatch pattern only visible when reading 29 papers in parallel
- Extracted exact doses, species, and timing; real document IDs throughout
Claude Code + bioRxiv MCP
Why it fell short
- Toxicity data is in pathology supplements and figures — not accessible via MCP text extraction
- The key finding (abstract-vs-pathology mismatch) requires reading the abstract and the supplement of each paper — impossible at scale without a filesystem
- Answer is general knowledge about AAV safety, not evidence from these specific preprints
Conclusion
Instead of moving data to the agent, we bring the agent to the data. By exposing 450K bioRxiv and medRxiv preprints as a virtual filesystem, we place the agent inside the corpus rather than behind a query interface. This is a necessary shift to get past shallow search: when paper content is structured as directories with individually addressable sections, supplements, and figures, the agent can make targeted, efficient reads at whatever granularity the question demands rather than ingesting entire documents and hoping the answer surfaces.
This replicates the paradigm that has already proven immensely successful with coding agents. Tools like
Claude Code and Cursor are effective precisely because they inhabit the codebase — navigating with
ls, searching with grep, reading with cat — rather than querying it
through an abstract API. Sy applies the same model to scientific literature, and the same bash-trained
intuitions that make coding agents powerful transfer directly.
On bioRxiv Bench, Sy is 1.6× more accurate, 2.4× faster, and 3.6× cheaper than MCP-based approaches across 140 questions spanning Deep Paper Q&A, Experiment Novelty Check, and Multi-Paper Synthesis.
Try Sy yourself at sy.gxl.ai!