---
name: spider
description: >-
  Set up Spider for fast web crawling, scraping, and search. Covers Spider
  Cloud (hosted REST API + Python/JS/Go/Rust SDKs and CLI) and the
  open-source Spider engine for self-hosting, including an MCP server so AI
  tools can call Spider directly. Use it to pick the right path and get a
  working request in minutes.
---

# Spider

Spider is a web crawler and scraper built in Rust. It turns any site into
clean Markdown/HTML/text for LLMs and pipelines, runs headless Chrome when a
page needs JavaScript, rotates proxies, and streams results so large crawls
stay fast and cheap.

There are two ways to run it:

- **Spider Cloud** — the hosted API at `https://api.spider.cloud`. Nothing to
  run. Bring an API key. This is the right choice for almost everyone.
- **Open-source engine** — the `spider` Rust crate you host yourself. Choose
  this only if you need to run the crawler inside your own infrastructure.

## Get a key (Spider Cloud)

1. Create a key at https://spider.cloud (Account → API Keys).
2. Export it. Every SDK and the CLI read this variable; the REST API expects
   it as a bearer token.

```bash
export SPIDER_API_KEY="sk-..."
```

Check it works:

```bash
curl https://api.spider.cloud/credits \
  -H "Authorization: Bearer $SPIDER_API_KEY"
```

## Pick a path

| You want to…                                   | Go to |
| ---------------------------------------------- | ----- |
| Call the API directly (any language, curl)     | **A — REST API** |
| Use an SDK in Python / JS / Go / Rust          | **B — Client SDK** |
| Run the crawler in your own infrastructure     | **C — Self-host** |
| Let Claude / Cursor / an agent call Spider     | **D — MCP server** |

Most integrations are **A** or **B**. Reach for **C/D** only when you have a
specific reason (data residency, an MCP-native agent).

---

## A — REST API

Base URL `https://api.spider.cloud`. Send JSON, authenticate with the bearer
token. Every endpoint takes a `url` (or `query` for search) plus optional
parameters.

```bash
curl -X POST https://api.spider.cloud/scrape \
  -H "Authorization: Bearer $SPIDER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "return_format": "markdown"
  }'
```

Endpoints:

- `POST /scrape` — one page, returned as Markdown/HTML/text.
- `POST /crawl` — a whole site. Set `limit` and `request` (see below).
- `POST /links` — collect links from a site without returning content.
- `POST /search` — web search; returns ranked result URLs.
- `POST /screenshot` — render the page and return an image.
- `POST /transform` — convert raw HTML you already have into clean Markdown.
- `POST /unblocker` — fetch one page through Spider's anti-bot
  bypass (proxy rotation, fingerprinting, JS challenge solving). Use
  this for sites that block normal `/scrape` requests.
- `POST /credits` — remaining account credits.

**AI (prompt-guided) endpoints.** Each takes a natural-language `prompt`
(and a `url`, except `ai/search`) and returns extracted/structured results
instead of raw pages — use these when you want an answer, not a document:

- `POST /ai/scrape` — `{ url, prompt }` — extract exactly what the prompt asks from one page.
- `POST /ai/crawl` — `{ url, prompt }` — crawl a site and extract per the prompt.
- `POST /ai/links` — `{ url, prompt }` — return only the links the prompt is after.
- `POST /ai/search` — `{ prompt }` — search the web and return an extracted answer.
- `POST /ai/browser` — `{ url, prompt }` — drive a headless browser to complete the prompt.

```bash
curl -X POST https://api.spider.cloud/ai/scrape \
  -H "Authorization: Bearer $SPIDER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com/pricing", "prompt": "Return each plan name and monthly price as JSON" }'
```

Key parameters that matter on most calls:

- `request`: `"http"` (fastest, no JS), `"chrome"` (headless browser, runs
  JS), or `"smart_mode"` (HTTP first, escalates to Chrome only when the page
  needs it — the usual default for crawls).
- `limit`: max pages for `/crawl`.
- `return_format`: `"markdown"` (default for LLMs), `"raw"`, `"text"`, `"html"`.
- `remote_proxy`: route the request through your own proxy, e.g.
  `"http://user:pass@host:port"`. Leave it unset to use Spider's network.
- `css_extraction_map`: pull specific fields from each page. Top-level
  keys are URL-path patterns (`"/"` matches everything). Each value is
  an array of `{ "name": "<field>", "selectors": ["<css>", ...] }`
  entries — list multiple selectors as fallbacks. Results come back
  under `css_extracted`.
- `wait_for`: control when Chrome considers a page "ready". Object with
  any of `selector`, `idle_network`, `idle_network0`,
  `almost_idle_network0`, `dom`, `delay`, `page_navigations`. Each
  timeout is a Rust `Duration`: `{ "secs": <n>, "nanos": <n> }`.
  `selector` and `dom` also take a `selector` string. Only applies when
  `request` is `"chrome"` or `"smart_mode"`.

```bash
curl -X POST https://api.spider.cloud/scrape \
  -H "Authorization: Bearer $SPIDER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://store.example.com/product/123",
    "request": "chrome",
    "wait_for": {
      "selector": { "selector": "h1.product-title", "timeout": { "secs": 10, "nanos": 0 } },
      "idle_network": { "timeout": { "secs": 5, "nanos": 0 } }
    },
    "css_extraction_map": {
      "/": [
        { "name": "title",  "selectors": ["h1.product-title"] },
        { "name": "price",  "selectors": [".price-value"] },
        { "name": "images", "selectors": ["img.hero-image", "img.gallery"] }
      ]
    }
  }'
```

Routing a scrape through your own proxy:

```bash
curl -X POST https://api.spider.cloud/scrape \
  -H "Authorization: Bearer $SPIDER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "remote_proxy": "http://user:pass@proxy.example.com:8080"
  }'
```

Full reference: https://spider.cloud/docs/api

---

## B — Client SDK

All SDKs read `SPIDER_API_KEY` from the environment (or accept the key in the
constructor) and mirror the endpoints above: `scrape`, `crawl`, `links`,
`search`, `screenshot`, `transform`.

**Python** — `pip install spider_client`

```python
from spider import Spider

app = Spider()  # reads SPIDER_API_KEY

# One page
page = app.scrape_url("https://example.com")

# A whole site, escalating to Chrome only when needed
result = app.crawl_url(
    "https://example.com",
    params={"limit": 200, "request": "smart_mode"},
)

# Prompt-guided extraction — returns structured results, not raw pages.
# Also: ai_crawl, ai_links, ai_search, ai_browser
data = app.ai_scrape(
    "https://example.com/pricing",
    "Return each plan name and monthly price as JSON",
)
```

For large crawls, stream instead of buffering — pass `stream=True` to
`crawl_url` and handle pages as they arrive. The same AI methods exist in
the JS client as `aiScrape`, `aiCrawl`, `aiLinks`, `aiSearch`, `aiBrowser`.

**JavaScript / TypeScript** — `npm install @spider-cloud/spider-client`

```ts
import { Spider } from "@spider-cloud/spider-client";

const app = new Spider(); // reads SPIDER_API_KEY
const page = await app.scrapeUrl("https://example.com");
const site = await app.crawlUrl("https://example.com", { limit: 200, request: "smart_mode" });
```

**Go** — `go get github.com/spider-rs/spider-clients/go`

**Rust** — add `spider-client` to `Cargo.toml` (`cargo add spider-client`).

**CLI** — `cargo install spider-cloud-cli`

```bash
spider-cloud-cli auth --api-key $SPIDER_API_KEY
spider-cloud-cli scrape --url https://example.com
spider-cloud-cli crawl  --url https://example.com --limit 50
spider-cloud-cli search --query "rust web crawler"
```

SDK docs and examples: https://spider.cloud/docs/libraries

---

## C — Self-host (open-source engine)

Use the `spider` crate when the crawl must run inside your own infrastructure.
This is the engine itself, not the Cloud client — no API key, no
`api.spider.cloud`.

```toml
[dependencies]
spider = "2"
```

```rust
use spider::{tokio, website::Website};

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");
    let mut rx = website.subscribe(16);

    tokio::spawn(async move {
        while let Ok(page) = rx.recv().await {
            println!("{}  {}", page.status_code, page.get_url());
        }
    });

    website.crawl().await;
    website.unsubscribe();
}
```

Pages stream as they arrive; the crawl stops when nothing is left to fetch.
Prefer a command-line tool instead of a library? `cargo install spider_cli`.

Source and full docs: https://github.com/spider-rs/spider

---

## D — MCP server (Claude, Cursor, agents)

Spider ships a Model Context Protocol server so an AI tool can call the
crawler directly as a tool.

```bash
cargo install spider_mcp
```

Then register it with your MCP client (Claude Desktop, Cursor, etc.) per that
client's MCP configuration. The agent can then crawl, scrape, and search the
web without you writing glue code. Details:
https://github.com/spider-rs/spider/tree/main/spider_mcp

---

## Choosing an endpoint

- A single known page → `scrape`.
- An entire site / many pages → `crawl` with a `limit`.
- Just the URLs, not content → `links`.
- "Find pages about X on the web" → `search`, or `ai/search` to get an
  extracted answer.
- HTML you already fetched, want clean Markdown → `transform`.
- Site renders content with JavaScript → set `request: "chrome"`, or
  `"smart_mode"` to let Spider decide.
- Page blocks normal requests (bot wall, CAPTCHA, fingerprinting) →
  `unblocker`.
- You need specific fields, not the whole page → `css_extraction_map`
  with selectors you already know. For natural-language extraction
  instead, use the `ai/*` endpoints (`ai_scrape`, `ai_crawl`, `ai_links`,
  `ai_search`, `ai_browser`): pass a `prompt` and get structured results
  back.
- Chrome page isn't ready when it returns → add `wait_for` with a
  `selector` or `idle_network`.

## Links

- API reference: https://spider.cloud/docs/api
- SDKs & quickstart: https://spider.cloud/docs/libraries
- Open source: https://github.com/spider-rs/spider
- Clients: https://github.com/spider-rs/spider-clients
