# arxiv2md.org

> Clean, LLM-friendly Markdown versions of arXiv papers. Parses arXiv's structured HTML (not PDFs) for reliable sections, math (MathML → LaTeX), and tables.

For programmatic / agent use, call the REST API below. No auth, no API key, no SDK — just a GET request.

## Quickstart

```bash
# Raw markdown
curl "https://arxiv2md.org/api/markdown?url=1706.03762"

# JSON with metadata (title, arxiv_id, source_url, content)
curl "https://arxiv2md.org/api/json?url=1706.03762"
```

`url` accepts a bare arXiv ID (`1706.03762`, `2501.11120v1`) or a full arXiv URL (`https://arxiv.org/abs/1706.03762`).

## Endpoints

- `GET /api/markdown?url=<id-or-url>` — returns raw Markdown as `text/plain`.
- `GET /api/json?url=<id-or-url>` — returns JSON: `{ "arxiv_id", "title", "source_url", "content" }`.
- `GET /api` — OpenAPI schema (JSON).
- `GET /health` — `{ "status": "healthy" }`.

### Query parameters

| Param | Default | Applies to | Description |
|-------|---------|------------|-------------|
| `url` | required | both | arXiv ID or URL |
| `remove_refs` | `true` | both | Drop bibliography/references section |
| `remove_toc` | `true` | both | Drop table of contents |
| `remove_citations` | `true` | both | Strip inline citations (e.g. "(Smith et al., 2023)") |
| `frontmatter` | `false` | `/api/markdown` only | Prepend YAML frontmatter with paper metadata |

## Examples

```bash
# Keep references and citations
curl "https://arxiv2md.org/api/markdown?url=2312.00752&remove_refs=false&remove_citations=false"

# Markdown with YAML frontmatter, piped into an LLM
curl -s "https://arxiv2md.org/api/markdown?url=2501.11120&frontmatter=true" | your-llm
```

## Notes

- **The URL-swap trick returns HTML, not Markdown.** Visiting `https://arxiv2md.org/abs/1706.03762` (i.e. replacing `arxiv.org` with `arxiv2md.org`) loads the human web app with the URL pre-filled. Agents should use `/api/markdown` or `/api/json` instead.
- Works for arXiv papers that have a structured HTML version (most newer papers).
- Rate limit: 30 requests/minute per IP.
- Results are cached server-side for 24 hours, so repeated requests for the same paper are fast.
- Errors return HTTP 400 (invalid URL / processing error) or 500, with an `error` message.

## CLI & Python library

For local use, install the package (PyPI name `arxiv2markdown`, import name `arxiv2md`):

```bash
pip install arxiv2markdown

# CLI: write markdown to stdout
arxiv2md 2501.11120v1 --remove-refs --remove-toc -o -

# Only specific sections
arxiv2md 2501.11120v1 --section-filter-mode include --sections "Abstract,Introduction" -o -
```

```python
from arxiv2md import ingest_paper_sync  # or: ingest_paper (async)

result = ingest_paper_sync("2501.11120v1")  # kwargs: remove_refs, remove_toc,
print(result.content)                        # remove_inline_citations, section_filter_mode,
                                             # sections, include_frontmatter
```

## Links

- Web app: https://arxiv2md.org
- Source: https://github.com/timf34/arxiv2md
- PyPI: https://pypi.org/project/arxiv2markdown/