Representing Biomedical Literature as a Filesystem through Agent-Native Indexing
By The GXL Team
Instead of moving data to the agent, we send the agent to the data. Biomedical preprints are exposed as a virtual filesystem that agents explore using the same bash tools they use on codebases. We develop Sy, our research agent built on top of this filesystem. The result is 1.6× more accurate, 2.4× faster, and 3.6× cheaper than MCP-based approaches across deep text Q&A, experiment novelty checking, and cross-paper synthesis.
Let’s say you want to retrieve information from a remote source, like a large corpus of paper preprints. Today, most MCP-style integrations behave like structured communication channels between distant systems. You define the fields ahead of time, send a request, and receive a structured response. That works well when the question is clearly defined and the answer fits neatly into those predefined slots.
But many interesting questions aren’t like that. In practice, discovery is often exploratory: you need to move through the data itself, following threads and context rather than issuing a single precise query. However, coding agents like Claude Code, Codex, and Cursor, already have deep intuitions for exactly this style of navigation. These agents have deep priors navigating codebases through bash (ls, grep, find, cat). They come with intuitions of which commands to reach for, how to compose them with pipes, and how to navigate from a broad directory structure down to the exact file and line they need. The problem is that scientific literature gives them nothing to navigate.
The Problem with MCP Tools
Today’s biomedical research agents are stuck on the wrong side of this divide. Common LLM-based tools rely on MCP servers and search APIs that behave as structured channels: a search_papers tool with a handful of parameters that returns massive payloads of abstracts, with no way to browse, no sense of what’s nearby, and no ability to refine by navigating rather than re-querying. The agent throws a query into a void and catches whatever comes back.
The agent can’t ls the literature to see what’s in a research area. It can’t grep across methods sections to find how experiments were actually done. It can’t follow a citation trail by reading a file. The scientific literature has no geography that the agent can navigate, so it can’t orient itself, and all those deeply trained intuitions about filesystem navigation go entirely unused.
Current Literature Agents
Many current literature agents share a similar architecture: wrappers over PubMed, Semantic Scholar, or bioRxiv APIs, accessed through MCP connectors or function-calling schemas. The agent calls search_papers(query), gets back a list of abstracts, and summarizes them. This works when the question is clearly defined, but it breaks down on anything that requires reading the content of a paper in detail: methods sections, supplemental tables, figure captions, appendix data. These are precisely the parts that contain what researchers actually need — the specific protocol, the sample sizes, the failure modes, and the caveats buried in extended data.
The traditional alternative is to bring the data to you: downloading terabytes of papers, building indexing pipelines, and running search infrastructure locally, all before any real investigation can begin. MCP connectors sit somewhere in between but still fall short — they return metadata and abstracts. Getting full-text content requires fetching a PDF at query time, dumping the entire document into context, and hoping the relevant sentence surfaces in a ~40K token blob. There is no way to search within the document, no way to navigate to the Methods section without reading everything before it, and no way to run the same extraction across 50 papers without exhausting your token budget on the first three.
Our approach takes the opposite direction: rather than moving the data to the agent, we move the agent to the data.
Biomedical Preprints as Filesystem
You can think of it as opening a small portal between your system and the corpus. The remote data source is exposed as a virtual filesystem, and the heavy lifting has already been done. The corpus is indexed, structured, and optimized for search so agents can explore efficiently without downloading or managing the data themselves.
Scientific papers are born as rich collections of structured artifacts: tables, figures, methods sections, supplementary spreadsheets, appendices, and code. Then authors compress all of that structure by flattening everything into a single PDF at submission time. The format is optimized for print and human readability, but not for AI agents. Our core idea is to reverse this compression. We re-expand each paper back into a filesystem where every paper is a folder, every figure is a file, and every section (Methods, Results, Discussion) is individually addressable. Each paper becomes a directory that an agent can enter, inspect, and traverse as deeply as needed.
An agent goes to the filesystem with a task, navigates the relevant parts of the corpus, gathers the necessary context, and returns with the answer and supporting evidence. When it wants to replicate an experiment, it reads sections/Methods.lines. When it wants to compare results across studies, it reads sections/Results.lines. When it’s trying to understand how the field interprets a finding, it reads sections/Discussion.lines. This mirrors how scientists actually use papers: not as monolithic blobs, but as collections of structurally distinct knowledge types.
Each line of text has a unique numeric identifier that traces back to the source, and every figure is individually addressable. The agent can cd into a paper, grep for a term, cat specific sections, and head the first 50 lines, using the same bash workflow it uses to navigate any codebase. Instead of hauling terabytes of data across the network and rebuilding indexing infrastructure locally, the exploration happens where the information already lives.
How It Works
Building the Index
Making half a million preprints navigable as directories requires solving a document engineering problem at scale. The raw corpus is a mix of JATS XML and supplementary materials in diverse formats: PDFs, Excel spreadsheets, Word documents, CSV files, even PowerPoint slides. The main text pipeline parses JATS XML into individual content blocks, where each paragraph, table, figure caption, section header, and formula becomes a separately addressable unit. This decomposition is what allows an agent to grep for a term and land on the exact block rather than ingesting a 40K-token blob.
Paper supplements present a challenge as they are inherently heterogeneous and unstructured. A single paper might attach a PDF with 15 supplementary figures, an Excel file of raw assay data, and a Word document with extended methods. To convert these to LLM-native text, we run each supplement through OCR models that perform document segmentation, table recognition, and formula extraction, producing the same block-level format: typed content with bounding boxes and page coordinates. The output is normalized into the same schema as the XML-derived blocks, so the agent sees a uniform supplements/ directory regardless of whether the underlying source was a scanned PDF or native XML.
The resulting ~70 million content blocks are stored in PostgreSQL with per-block JSONB metadata linking each block back to its source file and XPath, and dual-indexed in Elasticsearch through a hybrid retrieval layer that combines BM25 keyword scoring with dense vector embeddings. When the agent runs search "CRISPR base editing", both indices fire in parallel and results are merged via reciprocal rank fusion. When it runs grep "IC50" inside a paper, the query hits the block-level index filtered to that document. Figures are individually addressable and can be routed to a vision model on demand. The entire layer is invisible to the agent — it sees files and directories.
Why Bash
LLMs trained on code have encountered ls, grep, find, cat, wc, diff, head, tail, and pipe composition billions of times. They don’t treat these as abstract API calls. They know grep -r recurses, they know wc -l counts lines, they know how to pipe output between commands, and they know what an empty directory means versus a missing one.
When the research filesystem responds to standard bash, the agent doesn’t need to learn a new tool schema through in-context examples. It applies the same skills it uses to navigate a codebase, now pointed at scientific literature. A custom API with search_preprints() and get_preprint() means the model learns your interface from scratch on every invocation. It will use it, but it won’t compose tools in ways you didn’t anticipate.
Map-Reduce Over Papers
By treating each paper as a directory, we unlock another powerful pattern: map-reduce over papers. A map operation dispatches a lightweight subagent to every paper in parallel, each with filesystem access to its paper’s directory. A reduce operation then synthesizes the extractions into a unified answer. This mirrors how scientists do literature reviews (asking the same question across many documents) but runs in minutes across dozens of papers, extracting structured data from full text, not abstracts. Each subagent navigates only the relevant parts of its paper and returns with a precise extraction.
MCP Tool Server
search_preprints(category, date_range).No keyword search, browse by category + date only.
Get recent titles + abstracts.
get_preprint(doi).Entire paper (~40K tokens) dumped into context as a blob.
Context now ~100K tokens of undifferentiated text.
No section boundaries. Citations unreliable.
Result: 2–3 papers, vague summaries.
GXL Sy (Ours)
Get paths to top papers sorted by relevance.
grep Results, head Methods, cat a figure.Returns structured extraction with block-level citations.
Reduce into synthesis.
Result: 100 papers, specific data, every claim cited.
GXL Sy is our research agent built on top of this filesystem. It navigates over 450,000 bioRxiv and medRxiv preprints, using the full depth of the virtual filesystem to answer questions that require reading specific passages, checking experimental novelty across the literature, and synthesizing findings from multiple papers. Rather than wrapping search results in a prompt, GXL Sy enters the corpus, follows leads across papers, and returns with grounded, citable answers.
bioRxiv Bench
To measure whether focused exploration outperforms rigid querying, we introduce bioRxiv Bench, a benchmark of 140 questions drawn from real research workflows over bioRxiv and medRxiv preprints. The benchmark spans three task types: Deep Paper Q&A (N=50), Experiment Novelty Check (N=50), and Multi-Paper Synthesis (N=40).
We compare GXL Sy (Ours) against two baselines: Claude Code with the Claude bioRxiv MCP connector, and the FutureHouse Edison Platform, which provides AI-powered biomedical literature search. For FutureHouse Edison, we used the Precedent agent on Experiment Novelty Check and the Literature agent on Paper Q&A and Multi-Paper Synthesis.
Deep Paper Q&A (N=50)
GXL Sy (Ours) vs. Claude Code + bioRxiv MCP connector vs. FutureHouse Edison
50 supplement-grounded questions across 50 bioRxiv preprints, each requiring data from supplemental tables, PDFs, or DOCX files that cannot be answered from the paper’s main text alone.
Dataset Construction
50 single-document questions drawn from 50 distinct bioRxiv preprints published in 2025. Each question was generated by granting a model access to the full paper including supplements, then manually reviewed and filtered for clarity, accuracy, and relevance.
Questions were explicitly designed to be unanswerable from the abstract or main text alone, requiring data from supplemental tables, PDFs, or DOCX files. Each answer is accompanied by the specific supplement file path, step-by-step reasoning procedure, and executable code used to derive the answer.
Experiment Novelty Check (N=50)
GXL Sy (Ours) vs. Claude Code + bioRxiv MCP connector vs. FutureHouse Edison
50 questions were constructed by starting from real bioRxiv and medRxiv papers with distinctive quantitative findings, then reverse-engineering the natural-language experiment novelty query a researcher might ask before attempting similar work.
Dataset Construction
Questions were authored by first identifying papers in our indexed bioRxiv/medRxiv corpus that contain distinctive quantitative findings — binding affinities, enzyme kinetics, production titers, dose-response values — then reverse-engineering a natural language query a researcher might pose before starting analogous work.
Each question has a ground truth record containing the target paper's document ID, title, DOI, authors, and the location of the experiments or results within the paper (body text, supplement table, or figure).
Multi-Paper Synthesis (N=40)
GXL Sy (Ours) vs. Claude Code + bioRxiv MCP connector vs. FutureHouse Edison
40 cross-paper synthesis questions across 8 categories (relevant retrieval, exact quotation, abstract vs. substance, emerging directions, cross-paper contradiction, cross-paper synthesis, quantitative extraction, method comparison), each requiring evidence drawn from a minimum of 5 papers. Scored with literal per-question criteria. FutureHouse Edison and CC + BioRxiv were re-judged on the same 40-question set.
Dataset Construction
40 questions synthetically generated across 8 categories: spatial transcriptomics, foundation models, gene therapy safety, single-cell genomics, neuroscience, immunology, cancer biology, and synthetic biology. Each question was designed to require synthesis across a minimum of 5 papers and to be unanswerable from any single abstract.
Questions were generated by prompting a model with subfield descriptions and example queries, then filtered for quality through human review — removing questions that were too narrow, too broad, or answerable without cross-paper synthesis.
Across all task types the pattern holds. On Deep Paper Q&A, GXL Sy (Ours) scores 100% vs 86%, runs 3.4× faster (1m6s vs 3m42s), and costs 5.1× less ($0.21 vs $1.07). On Experiment Novelty Check, it scores 80%, 2.9× more accurate than CC + bioRxiv connector (28%) and 4.0× more accurate than FutureHouse Edison (20%), while running 1.2× faster than CC + bioRxiv connector (2m22s vs 2m54s) and 5.2× faster than FutureHouse Edison (2m22s vs 12m13s), with lower average cost than CC + bioRxiv connector ($0.36 vs $0.93); FutureHouse Edison charges per credit used. On Multi-Paper Synthesis, it is 59% more complete (92% vs 58%), 3.2× faster (2m6s vs 6m48s), produces 2.6× more citations (27.9 vs 10.6), and costs 3.7× less ($0.53 vs $1.96). The efficiency comes from targeted access: a grep into sections/Results.lines consumes ~200 tokens versus ~40,000 for loading a full paper through MCP. The accuracy comes from section-level precision: questions about methods go to Methods, questions about limitations go to Discussion.
Below, we walk through specific benchmark examples comparing the two approaches side by side.
Case Studies
Deep Paper Q&A
Questions that can only be answered from supplemental figures, tables, or methods: content that is invisible to abstract-level search and requires navigating directly into supplement files. Every supplement is a first-class path in the filesystem.
GXL Sy (Ours)
Why it succeeded
- Two tool calls: lookup doi to find the paper, then scan content.lines to find the exact count
- Returned block-level citations pointing to the exact sentence in the paper
Claude Code + bioRxiv MCP
Why it fell short
- The paper was fetched successfully, but the specific count (“16 proteins”) is buried in the Results section of a ~38K token blob with no way to search for it
- The answer drifted toward named proteins from prior knowledge rather than the explicit number stated in the text
Experiment Novelty Check: “Has this been done before?”
A researcher describes a specific experimental idea and asks whether it exists in the literature. These questions require precise extraction from Methods and Results, sections the filesystem exposes as individual files.
GXL Sy (Ours)
Why it succeeded
- Searched the filesystem directly for "inverted" + "cinR" + "rhlR" — found the supplement figure legend (L216) describing the exact swap
- Supplement text is indexed as a file; the key sentence was retrievable by keyword scan in milliseconds
- Real block IDs from the actual paper; answer grounded in primary source
Claude Code + bioRxiv MCP
Why it fell short
- Found the right paper via web search but couldn’t read it — bioRxiv and ACS both returned 403
- The key result is in the supplement (Fig S2I–J legend), not the abstract, so even a successful fetch of the abstract wouldn’t have answered the question
- Session terminated without a verdict
Idea Discovery: “What’s new that I’m not aware of?”
Discovering convergent signals across the literature: patterns only visible when you analyze 25–50 papers in parallel. This is where map-reduce over filesystems is decisive.
GXL Sy (Ours)
Why it succeeded
- 42 subagents each extracted claim type + validation depth from sections/Results.lines and supplements/
- Pattern invisible from abstracts: abstracts describe the finding, not whether it was validated
- All citations link to real document IDs from the filesystem
Claude Code + bioRxiv MCP
Why it fell short
- Abstracts don’t state whether validation was done — this requires reading Methods and Results of each paper
- Only 2 papers loaded in full; couldn’t extract per-claim validation status at scale
- Answer reflects general knowledge, not evidence from actual preprints
Conclusion
Instead of moving data to the agent, we bring the agent to the data. By exposing 450K bioRxiv and medRxiv preprints as a virtual filesystem, we place the agent inside the corpus rather than behind a query interface. This is a necessary shift to get past shallow search: when paper content is structured as directories with individually addressable sections, supplements, and figures, the agent can make targeted, efficient reads at whatever granularity the question demands rather than ingesting entire documents and hoping the answer surfaces.
This replicates the paradigm that has already proven immensely successful with coding agents. Tools like Claude Code and Cursor are effective precisely because they inhabit the codebase — navigating with ls, searching with grep, reading with cat — rather than querying it through an abstract API. Sy applies the same model to scientific literature, and the same bash-trained intuitions that make coding agents powerful transfer directly.
On bioRxiv Bench, Sy is 1.6× more accurate, 2.4× faster, and 3.6× cheaper than MCP-based approaches across 140 questions spanning Deep Paper Q&A, Experiment Novelty Check, and Multi-Paper Synthesis.
Try Sy yourself at sy.gxl.ai!