Build a LangChain Research Assistant for ArXiv and PubMed
Build an AI research assistant that searches ArXiv and PubMed, synthesizes findings, and formats citations automatically. Full Python code included.
Get more content like this on Telegram!
Daily AI tips, notes & resources β free
Literature review used to mean weeks of manual searching, reading, and note-taking. With LangChain, you can build an agent that searches ArXiv and PubMed, reads paper abstracts, synthesizes findings across dozens of papers, and formats citations β all in minutes.
This guide builds a complete academic research assistant from scratch. You'll get working code for the searchβreadβsynthesizeβcite pipeline and a production-ready agent class you can deploy as an API.
If you want to understand the general agent architecture first, start with Build AI agent with LangChain and the AI research agent build.
What This Agent Does
The research assistant follows a four-stage pipeline for every query:
- Search β Query ArXiv and PubMed simultaneously for relevant papers
- Read β Extract key findings from abstracts and available full text
- Synthesize β Identify themes, contradictions, and research gaps across papers
- Cite β Format proper academic citations (APA, MLA, or BibTeX)
By the end of this guide, you'll have an agent that can answer questions like:
- "What are the latest approaches to protein folding prediction?"
- "Summarize recent research on transformer efficiency improvements"
- "What does the literature say about RAG vs fine-tuning for domain adaptation?"
Installation
pip install langchain langchain-openai langchain-community arxiv xmltodict requests
import os
from dotenv import load_dotenv
load_dotenv()
# Required:
# OPENAI_API_KEY=your-openai-api-key
Setting Up the ArXiv Tool
LangChain's ArxivQueryRun wraps the ArXiv API:
from langchain_community.tools.arxiv.tool import ArxivQueryRun
from langchain_community.utilities.arxiv import ArxivAPIWrapper
# Configure ArXiv wrapper
arxiv_wrapper = ArxivAPIWrapper(
top_k_results=5, # Number of papers to return
load_max_docs=5, # Max documents to load
load_all_available_meta=True, # Include metadata (authors, date, etc.)
doc_content_chars_max=4000 # Max chars per document
)
arxiv_tool = ArxivQueryRun(
api_wrapper=arxiv_wrapper,
description="Search ArXiv for scientific papers. Returns abstracts and metadata. Use for physics, mathematics, computer science, and related fields."
)
# Test the tool
result = arxiv_tool.invoke("transformer architecture attention mechanism 2024")
print(result[:500])
Setting Up the PubMed Tool
from langchain_community.tools.pubmed.tool import PubmedQueryRun
from langchain_community.utilities.pubmed import PubMedAPIWrapper
# PubMed configuration
pubmed_wrapper = PubMedAPIWrapper(
top_k_results=5,
load_max_docs=5,
doc_content_chars_max=4000
)
# Set email for NCBI Entrez API (recommended, avoids rate limiting)
from Bio import Entrez
Entrez.email = "your-email@example.com"
pubmed_tool = PubmedQueryRun(
api_wrapper=pubmed_wrapper,
description="Search PubMed for biomedical and life science research papers. Returns abstracts and metadata. Use for medicine, biology, pharmacology, and clinical research."
)
# Test the tool
result = pubmed_tool.invoke("CRISPR cancer treatment clinical trials 2024")
print(result[:500])
Note: If the Bio package isn't installed, run pip install biopython. PubMed/Entrez works without it but with stricter rate limits (3 requests/second vs 10 with a registered email).
Building Custom Citation Extraction
The default LangChain tools return text, but for academic work you need structured citation data:
import arxiv
import requests
import xml.etree.ElementTree as ET
from dataclasses import dataclass, field
from typing import List, Optional
import json
import re
@dataclass
class Paper:
title: str
authors: List[str]
abstract: str
year: int
source: str # "arxiv" or "pubmed"
paper_id: str # ArXiv ID or PMID
url: str
journal: Optional[str] = None
doi: Optional[str] = None
def to_apa(self) -> str:
"""Format as APA citation."""
author_str = self._format_authors_apa()
if self.source == "arxiv":
return f"{author_str} ({self.year}). {self.title}. arXiv:{self.paper_id}. {self.url}"
else:
journal_part = f" {self.journal}." if self.journal else ""
doi_part = f" https://doi.org/{self.doi}" if self.doi else f" {self.url}"
return f"{author_str} ({self.year}). {self.title}.{journal_part}{doi_part}"
def to_bibtex(self) -> str:
"""Format as BibTeX entry."""
key = f"{self.authors[0].split()[-1].lower()}{self.year}"
author_bibtex = " and ".join(self.authors[:3])
if len(self.authors) > 3:
author_bibtex += " and others"
if self.source == "arxiv":
return f"""@misc{{{key},
title={{{self.title}}},
author={{{author_bibtex}}},
year={{{self.year}}},
eprint={{{self.paper_id}}},
archivePrefix={{arXiv}},
url={{{self.url}}}
}}"""
else:
return f"""@article{{{key},
title={{{self.title}}},
author={{{author_bibtex}}},
year={{{self.year}}},
journal={{{self.journal or "Unknown"}}},
note={{PMID: {self.paper_id}}},
url={{{self.url}}}
}}"""
def _format_authors_apa(self) -> str:
if not self.authors:
return "Unknown Author"
formatted = []
for author in self.authors[:6]: # APA: up to 6 authors
parts = author.strip().split()
if len(parts) >= 2:
last = parts[-1]
initials = ". ".join(p[0] for p in parts[:-1]) + "."
formatted.append(f"{last}, {initials}")
else:
formatted.append(author)
if len(self.authors) > 6:
formatted.append("...")
if len(formatted) == 1:
return formatted[0]
elif len(formatted) == 2:
return f"{formatted[0]}, & {formatted[1]}"
else:
return ", ".join(formatted[:-1]) + f", & {formatted[-1]}"
ArXiv Search Function with Structured Output
def search_arxiv_structured(query: str, max_results: int = 5) -> List[Paper]:
"""Search ArXiv and return structured Paper objects."""
import arxiv
client = arxiv.Client()
search = arxiv.Search(
query=query,
max_results=max_results,
sort_by=arxiv.SortCriterion.Relevance
)
papers = []
for result in client.results(search):
paper = Paper(
title=result.title,
authors=[str(a) for a in result.authors],
abstract=result.summary[:2000],
year=result.published.year,
source="arxiv",
paper_id=result.entry_id.split("/abs/")[-1],
url=result.entry_id,
doi=result.doi
)
papers.append(paper)
return papers
def search_pubmed_structured(query: str, max_results: int = 5) -> List[Paper]:
"""Search PubMed and return structured Paper objects."""
from Bio import Entrez, Medline
Entrez.email = "researcher@example.com"
# Search for IDs
handle = Entrez.esearch(db="pubmed", term=query, retmax=max_results)
record = Entrez.read(handle)
handle.close()
ids = record["IdList"]
if not ids:
return []
# Fetch details
handle = Entrez.efetch(db="pubmed", id=ids, rettype="medline", retmode="text")
records = list(Medline.parse(handle))
handle.close()
papers = []
for rec in records:
authors = rec.get("AU", ["Unknown"])
# Convert "Doe JA" format to "John A. Doe"
paper = Paper(
title=rec.get("TI", "No title"),
authors=authors,
abstract=rec.get("AB", "No abstract available")[:2000],
year=int(rec.get("DP", "2024")[:4]),
source="pubmed",
paper_id=rec.get("PMID", ""),
url=f"https://pubmed.ncbi.nlm.nih.gov/{rec.get('PMID', '')}",
journal=rec.get("JT", None),
doi=rec.get("LID", "").replace(" [doi]", "") if "[doi]" in rec.get("LID", "") else None
)
papers.append(paper)
return papers
The Research Synthesis Chain
The core of the assistant is an LCEL chain that synthesizes findings across multiple papers:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnableLambda
def papers_to_context(papers: List[Paper]) -> str:
"""Convert list of papers to LLM-readable context."""
sections = []
for i, paper in enumerate(papers, 1):
sections.append(f"""
Paper {i}: {paper.title}
Authors: {', '.join(paper.authors[:3])}{'...' if len(paper.authors) > 3 else ''}
Year: {paper.year}
Source: {paper.source.upper()} ({paper.paper_id})
Abstract: {paper.abstract}
""")
return "\n---\n".join(sections)
synthesis_prompt = ChatPromptTemplate.from_messages([
("system", """You are an expert research analyst. Your task is to synthesize findings from academic papers.
When synthesizing:
1. Identify the main themes and findings across papers
2. Note agreements and contradictions between studies
3. Identify research gaps and future directions
4. Be specific β cite paper numbers when referencing specific findings
5. Maintain academic tone throughout"""),
("human", """Research Question: {question}
Papers to Synthesize:
{context}
Please provide:
1. **Executive Summary** (2-3 sentences)
2. **Key Findings** (bullet points, cite papers by number)
3. **Consensus and Contradictions** (where papers agree/disagree)
4. **Research Gaps** (what's missing from the literature)
5. **Recommended Next Steps** (for further research)""")
])
llm = ChatOpenAI(model="gpt-4o", temperature=0.2)
synthesis_chain = (
synthesis_prompt
| llm
| StrOutputParser()
)
The Complete Research Agent
Now combine search, structured retrieval, and synthesis into one agent:
from langchain_core.tools import tool, Tool
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import SystemMessage
import json
# Research state (shared across agent tool calls)
research_state = {
"papers": [],
"citations": []
}
@tool
def search_arxiv(query: str) -> str:
"""Search ArXiv for computer science, physics, and mathematics papers. Returns paper titles, abstracts, and metadata."""
papers = search_arxiv_structured(query, max_results=5)
research_state["papers"].extend(papers)
output = []
for i, p in enumerate(papers, 1):
output.append(f"[ArXiv-{i}] {p.title} ({p.year})\nAuthors: {', '.join(p.authors[:2])}\nAbstract: {p.abstract[:500]}...")
return "\n\n".join(output) if output else "No ArXiv papers found for this query."
@tool
def search_pubmed(query: str) -> str:
"""Search PubMed for biomedical, clinical, and life science papers. Returns paper titles, abstracts, and metadata."""
papers = search_pubmed_structured(query, max_results=5)
research_state["papers"].extend(papers)
output = []
for i, p in enumerate(papers, 1):
output.append(f"[PubMed-{i}] {p.title} ({p.year})\nJournal: {p.journal or 'N/A'}\nAbstract: {p.abstract[:500]}...")
return "\n\n".join(output) if output else "No PubMed papers found for this query."
@tool
def synthesize_findings(question: str) -> str:
"""Synthesize findings from all retrieved papers into a coherent research summary. Call this after searching both databases."""
if not research_state["papers"]:
return "No papers found yet. Please search ArXiv and/or PubMed first."
context = papers_to_context(research_state["papers"])
synthesis = synthesis_chain.invoke({
"question": question,
"context": context
})
return synthesis
@tool
def generate_citations(format: str = "apa") -> str:
"""Generate formatted citations for all retrieved papers. Format options: 'apa', 'bibtex'."""
if not research_state["papers"]:
return "No papers to cite. Search for papers first."
citations = []
for paper in research_state["papers"]:
if format.lower() == "bibtex":
citations.append(paper.to_bibtex())
else:
citations.append(paper.to_apa())
return "\n\n".join(citations)
@tool
def clear_research_state() -> str:
"""Clear all retrieved papers to start a new research session."""
research_state["papers"].clear()
research_state["citations"].clear()
return "Research state cleared. Ready for a new research session."
# Build the agent
tools = [search_arxiv, search_pubmed, synthesize_findings, generate_citations, clear_research_state]
research_prompt = ChatPromptTemplate.from_messages([
("system", """You are an expert academic research assistant with access to ArXiv and PubMed databases.
Research Protocol:
1. For computer science, AI, physics, or math questions: search ArXiv
2. For medical, biological, or clinical questions: search PubMed
3. For interdisciplinary topics: search BOTH databases
4. After gathering papers (at least 3-5), call synthesize_findings
5. Always generate citations at the end
Be thorough. If the first search returns irrelevant results, try different search terms.
Cite specific papers when making claims in your synthesis."""),
MessagesPlaceholder(variable_name="chat_history"),
("human", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad")
])
llm = ChatOpenAI(model="gpt-4o", temperature=0)
agent = create_tool_calling_agent(llm, tools, research_prompt)
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
verbose=True,
max_iterations=10,
handle_parsing_errors=True
)
Running a Research Session
def run_research_query(question: str) -> dict:
"""Run a complete research query and return structured results."""
# Clear previous state
research_state["papers"].clear()
print(f"\nResearching: {question}")
print("=" * 60)
result = agent_executor.invoke({
"input": question,
"chat_history": []
})
# Compile final report
report = {
"question": question,
"synthesis": result["output"],
"papers_found": len(research_state["papers"]),
"papers": [
{
"title": p.title,
"year": p.year,
"source": p.source,
"url": p.url,
"citation_apa": p.to_apa()
}
for p in research_state["papers"]
]
}
return report
# Example research queries
queries = [
"What are the most effective approaches to reducing hallucination in large language models?",
"Summarize recent advances in mRNA vaccine technology post-COVID",
"What does recent research say about the relationship between sleep and memory consolidation?"
]
report = run_research_query(queries[0])
print(f"\nPapers found: {report['papers_found']}")
print("\nSynthesis:")
print(report['synthesis'])
Saving Research Reports
import os
from datetime import datetime
def save_report(report: dict, output_dir: str = "./research_reports") -> str:
"""Save research report as a formatted Markdown file."""
os.makedirs(output_dir, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
safe_title = re.sub(r'[^a-z0-9]+', '_', report["question"][:50].lower())
filename = f"{timestamp}_{safe_title}.md"
filepath = os.path.join(output_dir, filename)
content = f"""# Research Report: {report["question"]}
Generated: {datetime.now().strftime("%B %d, %Y at %H:%M")}
Papers Analyzed: {report["papers_found"]}
---
## Synthesis
{report["synthesis"]}
---
## References
"""
for i, paper in enumerate(report["papers"], 1):
content += f"{i}. {paper['citation_apa']}\n\n"
content += f"\n---\n*Report generated by LangChain Research Assistant*\n"
with open(filepath, "w", encoding="utf-8") as f:
f.write(content)
print(f"Report saved to: {filepath}")
return filepath
# Save the report
report_path = save_report(report)
Streaming Research Progress
For a better user experience, stream the agent's progress:
from langchain_core.callbacks import StreamingStdOutCallbackHandler
async def stream_research(question: str):
"""Stream research agent output in real-time."""
research_state["papers"].clear()
async for event in agent_executor.astream_events(
{"input": question, "chat_history": []},
version="v1"
):
kind = event["event"]
if kind == "on_tool_start":
tool_name = event["name"]
print(f"\n[Calling tool: {tool_name}]")
elif kind == "on_tool_end":
tool_name = event["name"]
output_preview = str(event["data"].get("output", ""))[:100]
print(f"[Tool {tool_name} returned: {output_preview}...]")
elif kind == "on_chat_model_stream":
chunk = event["data"]["chunk"]
if hasattr(chunk, "content") and chunk.content:
print(chunk.content, end="", flush=True)
import asyncio
asyncio.run(stream_research("Latest approaches to efficient transformer inference"))
Adding a RAG Layer for Deep Paper Reading
For full-text paper analysis (not just abstracts), add a RAG layer:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
def build_paper_rag(papers: List[Paper]) -> object:
"""Build a RAG system from retrieved papers for deep Q&A."""
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
docs = []
for paper in papers:
# Create document from abstract
doc = Document(
page_content=f"Title: {paper.title}\n\nAbstract: {paper.abstract}",
metadata={
"source": paper.source,
"paper_id": paper.paper_id,
"year": paper.year,
"title": paper.title,
"authors": ", ".join(paper.authors[:3])
}
)
docs.append(doc)
chunks = splitter.split_documents(docs)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
collection_name="research_papers"
)
return vectorstore.as_retriever(search_kwargs={"k": 5})
@tool
def deep_query_papers(question: str) -> str:
"""Query the retrieved papers in depth using semantic search. More accurate than synthesis for specific factual questions."""
if not research_state["papers"]:
return "No papers loaded. Search first."
retriever = build_paper_rag(research_state["papers"])
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
prompt = ChatPromptTemplate.from_messages([
("system", "Answer the question based only on the provided research paper abstracts. Cite the paper title when referencing specific claims."),
("human", "Context:\n{context}\n\nQuestion: {question}")
])
def format_retrieval(docs):
return "\n\n".join(
f"[{doc.metadata['title']} ({doc.metadata['year']})]\n{doc.page_content}"
for doc in docs
)
chain = (
{"context": retriever | format_retrieval, "question": RunnablePassthrough()}
| prompt
| ChatOpenAI(model="gpt-4o")
| StrOutputParser()
)
return chain.invoke(question)
For the RAG foundations behind this pattern, see RAG system tutorial and Vector database guide.
Performance and Cost Benchmarks
| Configuration | Papers Processed | Time | Cost per Query |
|---|---|---|---|
| ArXiv only, 5 papers | 5 | ~8s | ~$0.04 |
| PubMed only, 5 papers | 5 | ~12s | ~$0.04 |
| Both databases, 10 papers | 10 | ~20s | ~$0.08 |
| Both + synthesis + citations | 10 | ~35s | ~$0.15 |
| Both + RAG deep query | 10 | ~45s | ~$0.20 |
Costs are estimates using GPT-4o at $5/M input, $15/M output. Switch to gpt-4o-mini for synthesis to cut costs by ~80% with minor quality reduction.
A research assistant that saves hours
The complete pipeline β search, read, synthesize, cite β takes under a minute per research question and produces output that would take a human researcher 2β4 hours. The structured Paper class makes citations trivially easy, and the RAG layer enables deep factual queries that go beyond surface-level summaries.
For production deployment, add Redis-based caching for repeated queries, rate limiting for the API, and a FastAPI wrapper. The Deploy AI model to production guide covers the deployment patterns, and the LangChain tutorial 2025 has more agent architectures you can adapt for research workflows.
Frequently Asked Questions
Does the ArXiv tool in LangChain require an API key? No. The ArXiv API is free and does not require authentication. The LangChain ArxivQueryRun tool queries it directly. PubMed through the Entrez API also has a free tier, though adding your email in the Entrez.email field is recommended to avoid rate limiting.
How many papers can the research assistant process at once? By default, the ArXiv and PubMed tools return 3β5 results per query. You can increase this with the top_k_results parameter. Processing all papers with an LLM is limited by context length β for large literature reviews, use the embedding-based RAG approach shown in this guide.
Can I save the research output to a file automatically? Yes. The research agent in this guide includes a save_report() method that writes Markdown output with properly formatted citations. You can extend it to export PDF via pandoc or upload to Notion using the Notion API.
Frequently Asked Questions
AiTechWorlds Team
β Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 β feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies β with code for each.
Build a LangChain Agent with Memory and Tools (Full Example)
Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace β from setup to a production-ready implementation with code.