Adding arXiv and 150M+ abstracts to Paperclip





Agent-native indexing of arXiv
We’re really excited to add arXiv to Paperclip. While GXL’s focus has been on the biomedical sphere, we’ve gotten feedback from lots of scientists who live in the world in between arXiv and bioRxiv preprints. They might start in bioRxiv to find an interesting biological problem, and end up in arXiv to find the computational solution.
We’ve found that while there have been other efforts to build MCPs around arXiv, they are limited by the depth
of the content they index and a lack of agent-native abstractions. Paperclip doesn’t just give you title and
abstract — it gives your agent the full text of every paper, structured into sections, with search, grep,
cat, map, and sql all working across the entire corpus.
Adding arXiv has been a unique technical challenge. PDFs on arXiv are rendered from LaTeX and are notoriously non-uniform in structure — multi-column layouts, inline math, oddly-formatted tables — and they break common PDF parsers. While amazing OCR models exist for this kind of task, parsing a PDF each time an agent wants to read a paper is slow and doesn’t allow for deep horizontal searches across papers.
The obvious solution was a huge lift that others have (understandably) avoided: pre-process all 3M arXiv papers using state-of-the-art OCR, then index everything for instant retrieval. We’re happy to announce today that we’ve done exactly that. We have also applied all our agent-native indexing to these preprints as well. The full text of 3M arXiv articles — including tables, figures, and section structure — is now searchable, greppable, and mappable using Paperclip.
If you haven’t read it yet, take a look at our previous blog post to learn more about the full set of Paperclip commands.
Adding abstracts
While the Paperclip experience centers on giving agents deep access to papers, we’ve also found that the agent experience improves when we give them access to the comprehensive paper universe. We’ve included 150M abstracts from OpenAlex, an open catalog of research works. We’ve applied hybrid indexing (BM25 + vector embeddings) to these abstracts, and they show up in search results when specified by the agent. We’ve found them very useful for high-level literature reviews and for mapping out the landscape around a topic area. At the same time, surfacing all abstracts in search results can lead to diluted content. To balance this, we only indexed a subset of 50M abstracts, with the remaining still accessible via sql queries.
What can you do with this?
We’ve loved using Paperclip with arXiv papers. While the most useful queries are still to be written by you, here are just a few examples of queries that used to take weeks that now can be done with just a few tool calls.
Example 1: arXiv trends
Trends within the AI/ML community are hard to quantify sometimes — you may hear that certain models are more popular, or that nobody uses “X” anymore, but is there a way to measure these trends? Using the grep command, you can figure these out in seconds.
Example 2: Choosing hyperparameters for GRPO
Picking hyperparameters is a bit of a vibe-science. Researchers often have their favorite learning rate, but it’s hard to tell what works — especially if you’ve changed other parameters like batch size. Using the map/reduce functions in Paperclip, your agent can produce a heatmap of choices other researchers have used and use it as a good prior.
”What are common batch size × learning rate combinations used for GRPO?”
Example 3: bioRxiv limitation → arXiv solution
With both bioRxiv and arXiv indexed, an agent can do something that neither corpus supports alone: find a specific technical limitation in a biology paper, then search the CS/ML literature for methods that might address it. We gave an agent two bioRxiv papers and asked:
“Given the limitation in this paper, look through arXiv and find a potential solution.”
Unsupervised batch correction (Harmony, scVI) erases real biological signal when batch and biology are confounded — e.g., disease samples from one lab.
This work proves that standard unsupervised methods fail to disentangle factors when they are correlated in the training data — a formalization of the same confounding problem that plagues batch correction. Proposes weak supervision with a small number of labels to separate correlated factors, suggesting a path for scRNA-seq methods that don't throw away biology when removing batch effects.
”9 out of 10 agents love using Paperclip!”
In our previous blog post, we showed how Paperclip is faster, cheaper, and more accurate than common alternatives. For this blog post, we thought we’d show something more fun: how well do the agents themselves like Paperclip? To test this, we gave agents two related computational questions and randomized the pair of tools it should use to answer each one (Paperclip vs an alternative). At the end, we simply asked the agent which tool it preferred using.
”Which tool did you prefer using?”
Claude Code (Claude Sonnet 4.6)
Codex (GPT-5.3)
What the agents said
Since we asked for free-text reasoning, we got some colorful feedback. Here’s a sample.
Paperclip was significantly faster and more precise. A single search query returned a ranked, well-summarized list of directly relevant papers with metadata immediately accessible via
paperclip cat. I could retrieve abstracts in under 200ms each with no noise from unrelated pages. WebSearch required more queries to triangulate the same coverage, returned mixed-relevance marketing text alongside real results.
Paperclip felt more like a real literature search tool: it returned ranked paper lists with titles, authors, dates, abstracts, and document IDs in a consistent format, and the search quality was genuinely good — relevant papers appeared in the first 5-10 results for every query. The ability to drill into a paper via
cat meta.jsonorgrep content.linesgave structured access to actual paper content.
Paperclip’s hybrid search (combining semantic embeddings with keyword matching) surfaced niche but highly relevant papers like SALSA (fragment combinatorial spaces) and the Bellamy noisy BO paper that directly matched my query’s requirements. The meta.json fetch gave clean, structured abstracts on demand. Consensus felt faster for building a taxonomy but search results were noisier and more iterative to steer.
WebSearch provided broader and more immediately actionable results for this question about time-series forecasting architectures. The web results included blog posts, tutorials, and recent benchmark comparisons that gave practical implementation context beyond just paper abstracts. Paperclip’s results were more narrowly academic.
Our survey shows that agents enjoy being able to get structured, high-relevance results in a single call — ranked papers with clean IDs, authors, dates, and concise abstracts that are immediately citable without further disambiguation. The ability to drill into any paper with cat and grep for specific sections, figures, or methods was consistently cited as a differentiator. Of course, we didn’t win them all — WebSearch, for example, is able to surface blog posts, tutorials, and benchmark comparisons that supplement the academic corpus we have, which can be valuable for practical implementation context. That being said, we also want to emphasize that using Paperclip is never mutually exclusive from using all the other amazing tools out there, so hopefully you are able to get the best of all worlds.