gxl
Blog

Adding arXiv and 150M+ abstracts to Paperclip

We’ve added 3M arXiv articles and 150M+ abstracts to Paperclip, complete with all the agent-native indexing your agents need to search, read, and synthesize them. Available in Paperclip now!
PMC
7.5M
Full Text
bioRxiv
388K
Full Text
medRxiv
82K
Full Text
arXiv
3.0M
Full Text
new
150M
Abstracts
new
+
more coming soon

Agent-native indexing of arXiv

We’re really excited to add arXiv to Paperclip. While GXL’s focus has been on the biomedical sphere, we’ve gotten feedback from lots of scientists who live in the world in between arXiv and bioRxiv preprints. They might start in bioRxiv to find an interesting biological problem, and end up in arXiv to find the computational solution.

We’ve found that while there have been other efforts to build MCPs around arXiv, they are limited by the depth of the content they index and a lack of agent-native abstractions. Paperclip doesn’t just give you title and abstract — it gives your agent the full text of every paper, structured into sections, with search, grep, cat, map, and sql all working across the entire corpus.

Adding arXiv has been a unique technical challenge. PDFs on arXiv are rendered from LaTeX and are notoriously non-uniform in structure — multi-column layouts, inline math, oddly-formatted tables — and they break common PDF parsers. While amazing OCR models exist for this kind of task, parsing a PDF each time an agent wants to read a paper is slow and doesn’t allow for deep horizontal searches across papers.

The obvious solution was a huge lift that others have (understandably) avoided: pre-process all 3M arXiv papers using state-of-the-art OCR, then index everything for instant retrieval. We’re happy to announce today that we’ve done exactly that. We have also applied all our agent-native indexing to these preprints as well. The full text of 3M arXiv articles — including tables, figures, and section structure — is now searchable, greppable, and mappable using Paperclip.

$ paperclip cat arxiv_2501.12948 —section Methods
/papers/arxiv_2501.12948/sections/Methods.lines
L1: We train DeepSeek-R1-Zero using GRPO (Shao et al., 2024),
L2: which foregoes the critic model by estimating the baseline from
L3: group scores. The learning rate is set to 1e-6 with a batch size
L4: of 512 and a KL penalty coefficient of 0.01…

If you haven’t read it yet, take a look at our previous blog post to learn more about the full set of Paperclip commands.

Adding abstracts

While the Paperclip experience centers on giving agents deep access to papers, we’ve also found that the agent experience improves when we give them access to the comprehensive paper universe. We’ve included 150M abstracts from OpenAlex, an open catalog of research works. We’ve applied hybrid indexing (BM25 + vector embeddings) to these abstracts, and they show up in search results when specified by the agent. We’ve found them very useful for high-level literature reviews and for mapping out the landscape around a topic area. At the same time, surfacing all abstracts in search results can lead to diluted content. To balance this, we only indexed a subset of 50M abstracts, with the remaining still accessible via sql queries.

$ paperclip search “protein language model fitness prediction” -s abstracts -n 5
Found 5 papers [s_a3f8c12d]
  1. Protein Fitness Prediction Is Impacted by the Interplay…
  2. Learning protein fitness models from evolutionary and…
  3. Tranception: protein fitness prediction with autoregressive…
  4. ProteinNPT: Improving Protein Property Prediction and Design…
  5. Deep generative models of genetic variation capture the effects…

We’ve loved using Paperclip with arXiv papers. While the most useful queries are still to be written by you, here are just a few examples of queries that used to take weeks that now can be done with just a few tool calls.

Example 1: arXiv trends

Trends within the AI/ML community are hard to quantify sometimes — you may hear that certain models are more popular, or that nobody uses “X” anymore, but is there a way to measure these trends? Using the grep command, you can figure these out in seconds.

Rate = unique papers mentioning the term per 1,000 arXiv papers per month, 3-month rolling average. Each concept uses 2-11 regex variants (e.g., “A100 GPU”, “NVIDIA A100”, “DGX A100”, “8× A100”, etc.) to catch all phrasings. 109 grep calls across 2,964,720 papers, 80 seconds wall time.

Example 2: Choosing hyperparameters for GRPO

Picking hyperparameters is a bit of a vibe-science. Researchers often have their favorite learning rate, but it’s hard to tell what works — especially if you’ve changed other parameters like batch size. Using the map/reduce functions in Paperclip, your agent can produce a heatmap of choices other researchers have used and use it as a good prior.

”What are common batch size × learning rate combinations used for GRPO?”

  1e-71e-61e-51e-41e-31e-2
Batch Size ↓Learning Rate →
≤83340383
16161810
3211814332
642231811
12852256
256129222
5122352
>5124244
Number of GRPO papers that use the particular combination of hyperparameters in the methods section.

First grep finds 6,477 GRPO papers across the full corpus. Second grep intersects those with hyperparameter patterns to extract 2,079 paragraphs from 1,145 papers. 10 total queries, 3 seconds wall time.

Example 3: bioRxiv limitation → arXiv solution

With both bioRxiv and arXiv indexed, an agent can do something that neither corpus supports alone: find a specific technical limitation in a biology paper, then search the CS/ML literature for methods that might address it. We gave an agent two bioRxiv papers and asked:

“Given the limitation in this paper, look through arXiv and find a potential solution.”

6 paperclip calls (search + cat per case), 1.4 seconds end-to-end.

”9 out of 10 agents love using Paperclip!”

In our previous blog post, we showed how Paperclip is faster, cheaper, and more accurate than common alternatives. For this blog post, we thought we’d show something more fun: how well do the agents themselves like Paperclip? To test this, we gave agents two related computational questions and randomized the pair of tools it should use to answer each one (Paperclip vs an alternative). At the end, we simply asked the agent which tool it preferred using.

”Which tool did you prefer using?”

Claude Code (Claude Sonnet 4.6)

Paperclip28 – 2WebSearch
93% prefer Paperclip (84–97% CI)
Paperclip29 – 1Tavily
98% prefer Paperclip (91–100% CI)
Paperclip29 – 1Consensus
98% prefer Paperclip (91–100% CI)

Codex (GPT-5.3)

Paperclip25 – 5WebSearch
83% prefer Paperclip (66–93% CI)
Paperclip30 – 0Tavily
100% prefer Paperclip (89–100% CI)
Paperclip24 – 5Consensus
83% prefer Paperclip (65–92% CI)

Each head-to-head run presents two related scientific questions to the same agent session. The agent uses Tool A for question 1 and Tool B for question 2 (order randomized). After both phases, the agent states its overall preference, confidence (1–5), and free-text reasoning.

What the agents said

Since we asked for free-text reasoning, we got some colorful feedback. Here’s a sample.

CodexPaperclip → WebSearchpreferred Paperclip

Paperclip was significantly faster and more precise. A single search query returned a ranked, well-summarized list of directly relevant papers with metadata immediately accessible via paperclip cat. I could retrieve abstracts in under 200ms each with no noise from unrelated pages. WebSearch required more queries to triangulate the same coverage, returned mixed-relevance marketing text alongside real results.

ClaudePaperclip → Tavilypreferred Paperclip

Paperclip felt more like a real literature search tool: it returned ranked paper lists with titles, authors, dates, abstracts, and document IDs in a consistent format, and the search quality was genuinely good — relevant papers appeared in the first 5-10 results for every query. The ability to drill into a paper via cat meta.json or grep content.lines gave structured access to actual paper content.

ClaudePaperclip → Consensuspreferred Paperclip

Paperclip’s hybrid search (combining semantic embeddings with keyword matching) surfaced niche but highly relevant papers like SALSA (fragment combinatorial spaces) and the Bellamy noisy BO paper that directly matched my query’s requirements. The meta.json fetch gave clean, structured abstracts on demand. Consensus felt faster for building a taxonomy but search results were noisier and more iterative to steer.

CodexWebSearch → Paperclippreferred WebSearch

WebSearch provided broader and more immediately actionable results for this question about time-series forecasting architectures. The web results included blog posts, tutorials, and recent benchmark comparisons that gave practical implementation context beyond just paper abstracts. Paperclip’s results were more narrowly academic.

Our survey shows that agents enjoy being able to get structured, high-relevance results in a single call — ranked papers with clean IDs, authors, dates, and concise abstracts that are immediately citable without further disambiguation. The ability to drill into any paper with cat and grep for specific sections, figures, or methods was consistently cited as a differentiator. Of course, we didn’t win them all — WebSearch, for example, is able to surface blog posts, tutorials, and benchmark comparisons that supplement the academic corpus we have, which can be valuable for practical implementation context. That being said, we also want to emphasize that using Paperclip is never mutually exclusive from using all the other amazing tools out there, so hopefully you are able to get the best of all worlds.