Skip to main content
All job types use the /api/v1/search/live POST endpoint and follow the same structure: a top-level type to define the data source, and a set of arguments defining the parameters of the job.

URL Scraping

Extract content from specific web pages as markdown

LLM Processing

Get LLM processed text from the scraped content

Multi-Page Crawling

Crawl multiple pages with configurable depth

Content Extraction

Extract structured content and metadata

Overview

The Web API provides comprehensive web scraping capabilities for extracting content from websites. It supports single-page scraping as well as multi-page crawling with configurable depth and page limits.

scraper

Scrapes content from web pages and returns metadata, markdown, and LLM-processed text. This is the primary job type for web data extraction. Use Cases:
  • Content extraction and analysis
  • Website documentation scraping
  • Data collection from web sources
  • Multi-page site crawling
  • Content preprocessing for LLM analysis

Parameters

Common Parameters

ParameterTypeRequiredDefaultDescription
typestringYes-The job type: scraper
urlstringYes-The URL to scrape (must include http:// or https://)
max_depthintegerNo0Maximum crawl depth from the starting URL
max_pagesintegerNo1Maximum number of pages to scrape

URL Requirements

  • Scheme Required: URLs must include http:// or https:// scheme
  • Valid Format: Must be a properly formatted URL
  • Accessible: The URL must be accessible and return valid content

Crawling Behavior

  • Depth Control: max_depth controls how many levels deep to crawl from the starting URL
  • Page Limits: max_pages prevents excessive scraping by limiting total pages processed
  • Robots.txt: Currently respects robots.txt by default (configurable)
  • Markdown Output: Content is automatically converted to markdown format

Examples

Basic Single Page Scraping

{
  "type": "web",
  "arguments": {
    "type": "scraper",
    "url": "https://docs.learnbittensor.org",
    "max_depth": 0,
    "max_pages": 1
  }
}

Multi-Page Crawling

{
  "type": "web",
  "arguments": {
    "type": "scraper",
    "url": "https://docs.learnbittensor.org",
    "max_depth": 2,
    "max_pages": 10
  }
}

Deep Site Crawling

{
  "type": "web",
  "arguments": {
    "type": "scraper",
    "url": "https://example.com/sitemap",
    "max_depth": 3,
    "max_pages": 50
  }
}

Documentation Site Scraping

{
  "type": "web",
  "arguments": {
    "type": "scraper",
    "url": "https://api.example.com/docs",
    "max_depth": 1,
    "max_pages": 25
  }
}

Use Case Examples

API Documentation Extraction

Extract complete API documentation from a documentation site:
{
  "type": "web",
  "arguments": {
    "type": "scraper",
    "url": "https://docs.api.example.com",
    "max_depth": 2,
    "max_pages": 100
  }
}

News Article Collection

Scrape multiple articles from a news site:
{
  "type": "web",
  "arguments": {
    "type": "scraper",
    "url": "https://news.example.com/tech",
    "max_depth": 1,
    "max_pages": 20
  }
}

Product Catalog Scraping

Extract product information from an e-commerce site:
{
  "type": "web",
  "arguments": {
    "type": "scraper",
    "url": "https://shop.example.com/products",
    "max_depth": 2,
    "max_pages": 50
  }
}

Best Practices

1. URL Validation

  • Always include the full URL with http:// or https://
  • Test URLs manually before scraping to ensure accessibility
  • Use specific URLs rather than redirects when possible

2. Depth and Page Limits

  • Start with max_depth: 0 for single pages
  • Use max_depth: 1-2 for most multi-page scenarios
  • Set appropriate max_pages to avoid excessive resource usage
  • Consider the site structure when setting depth limits

3. Performance Considerations

  • Use max_pages to control resource usage
  • Start with smaller limits and increase as needed
  • Consider the target site’s size and structure
  • Be respectful of server resources

4. Content Quality

  • Target specific sections or pages when possible
  • Use appropriate depth to capture related content
  • Consider the site’s navigation structure

Response Data

The Web API returns structured content including:

Basic Information

  • URL: The scraped page URL
  • Title: Page title and metadata
  • Content: Extracted text content
  • Markdown: Formatted markdown version of the content

Technical Details

  • HTTP Status: Response status codes
  • Content Type: MIME type of the content
  • Encoding: Character encoding information
  • Links: Internal and external links found

Crawling Information

  • Depth: How deep the page was found from the starting URL
  • Crawl Order: Order in which pages were processed
  • Parent URLs: Links that led to each page

Limitations

  • Maximum crawl depth is limited by the max_depth parameter
  • Page limits are enforced by the max_pages parameter
  • Some sites may block automated scraping
  • JavaScript-heavy sites may not be fully scraped
  • Rate limiting may apply to prevent excessive requests

Error Handling

Common Error Scenarios

Invalid URL Format

{
  "error": "url must include a scheme (http:// or https://)"
}

URL Not Accessible

{
  "error": "invalid URL format: [specific error details]"
}

Invalid Parameters

  • max_depth must be non-negative
  • max_pages must be at least 1
  • url is required and cannot be empty

Missing Required Parameters

  • url parameter is required for all scraping operations
  • type must be "scraper"

Notes

  • All scraped content is automatically converted to markdown format
  • The scraper respects robots.txt by default
  • Multi-page crawling follows internal links found on pages
  • External links are not followed during crawling
  • Content extraction focuses on main page content, not navigation or ads