Web - Gopher

All job types use the /api/v1/search/live POST endpoint and follow the same structure: a top-level type to define the data source, and a set of arguments defining the parameters of the job.

URL Scraping

Extract content from specific web pages as markdown

LLM Processing

Get LLM processed text from the scraped content

Multi-Page Crawling

Crawl multiple pages with configurable depth

Content Extraction

Extract structured content and metadata

Overview

The Web API provides comprehensive web scraping capabilities for extracting content from websites. It supports single-page scraping as well as multi-page crawling with configurable depth and page limits.

`scraper`

Scrapes content from web pages and returns metadata, markdown, and LLM-processed text. This is the primary job type for web data extraction. Use Cases:

Content extraction and analysis
Website documentation scraping
Data collection from web sources
Multi-page site crawling
Content preprocessing for LLM analysis

Parameters

Common Parameters

Parameter	Type	Required	Default	Description
`type`	string	Yes	-	The job type: `scraper`
`url`	string	Yes	-	The URL to scrape (must include http:// or https://)
`max_depth`	integer	No	`0`	Maximum crawl depth from the starting URL
`max_pages`	integer	No	`1`	Maximum number of pages to scrape

URL Requirements

Scheme Required: URLs must include http:// or https:// scheme
Valid Format: Must be a properly formatted URL
Accessible: The URL must be accessible and return valid content

Crawling Behavior

Depth Control: max_depth controls how many levels deep to crawl from the starting URL
Page Limits: max_pages prevents excessive scraping by limiting total pages processed
Robots.txt: Currently respects robots.txt by default (configurable)
Markdown Output: Content is automatically converted to markdown format

Examples

Basic Single Page Scraping

{
  "type": "web",
  "arguments": {
    "type": "scraper",
    "url": "https://docs.learnbittensor.org",
    "max_depth": 0,
    "max_pages": 1
  }
}

Multi-Page Crawling

{
  "type": "web",
  "arguments": {
    "type": "scraper",
    "url": "https://docs.learnbittensor.org",
    "max_depth": 2,
    "max_pages": 10
  }
}

Deep Site Crawling

{
  "type": "web",
  "arguments": {
    "type": "scraper",
    "url": "https://example.com/sitemap",
    "max_depth": 3,
    "max_pages": 50
  }
}

Documentation Site Scraping

{
  "type": "web",
  "arguments": {
    "type": "scraper",
    "url": "https://api.example.com/docs",
    "max_depth": 1,
    "max_pages": 25
  }
}

Use Case Examples

API Documentation Extraction

Extract complete API documentation from a documentation site:

{
  "type": "web",
  "arguments": {
    "type": "scraper",
    "url": "https://docs.api.example.com",
    "max_depth": 2,
    "max_pages": 100
  }
}

News Article Collection

Scrape multiple articles from a news site:

{
  "type": "web",
  "arguments": {
    "type": "scraper",
    "url": "https://news.example.com/tech",
    "max_depth": 1,
    "max_pages": 20
  }
}

Product Catalog Scraping

Extract product information from an e-commerce site:

{
  "type": "web",
  "arguments": {
    "type": "scraper",
    "url": "https://shop.example.com/products",
    "max_depth": 2,
    "max_pages": 50
  }
}

Best Practices

1. URL Validation

Always include the full URL with http:// or https://
Test URLs manually before scraping to ensure accessibility
Use specific URLs rather than redirects when possible

2. Depth and Page Limits

Start with max_depth: 0 for single pages
Use max_depth: 1-2 for most multi-page scenarios
Set appropriate max_pages to avoid excessive resource usage
Consider the site structure when setting depth limits

3. Performance Considerations

Use max_pages to control resource usage
Start with smaller limits and increase as needed
Consider the target site’s size and structure
Be respectful of server resources

4. Content Quality

Target specific sections or pages when possible
Use appropriate depth to capture related content
Consider the site’s navigation structure

Response Data

The Web API returns structured content including:

Basic Information

URL: The scraped page URL
Title: Page title and metadata
Content: Extracted text content
Markdown: Formatted markdown version of the content

Technical Details

HTTP Status: Response status codes
Content Type: MIME type of the content
Encoding: Character encoding information
Links: Internal and external links found

Crawling Information

Depth: How deep the page was found from the starting URL
Crawl Order: Order in which pages were processed
Parent URLs: Links that led to each page

Limitations

Maximum crawl depth is limited by the max_depth parameter
Page limits are enforced by the max_pages parameter
Some sites may block automated scraping
JavaScript-heavy sites may not be fully scraped
Rate limiting may apply to prevent excessive requests

Error Handling

Common Error Scenarios

Invalid URL Format

{
  "error": "url must include a scheme (http:// or https://)"
}

URL Not Accessible

{
  "error": "invalid URL format: [specific error details]"
}

Invalid Parameters

max_depth must be non-negative
max_pages must be at least 1
url is required and cannot be empty

Missing Required Parameters

url parameter is required for all scraping operations
type must be "scraper"

Notes

All scraped content is automatically converted to markdown format
The scraper respects robots.txt by default
Multi-page crawling follows internal links found on pages
External links are not followed during crawling
Content extraction focuses on main page content, not navigation or ads

Data API

URL Scraping

LLM Processing

Multi-Page Crawling

Content Extraction

​Overview

​scraper

​Parameters

​Common Parameters

​URL Requirements

​Crawling Behavior

​Examples

​Basic Single Page Scraping

​Multi-Page Crawling

​Deep Site Crawling

​Documentation Site Scraping

​Use Case Examples

​API Documentation Extraction

​News Article Collection

​Product Catalog Scraping

​Best Practices

​1. URL Validation

​2. Depth and Page Limits

​3. Performance Considerations

​4. Content Quality

​Response Data

​Basic Information

​Technical Details

​Crawling Information

​Limitations

​Error Handling

​Common Error Scenarios

​Invalid URL Format

​URL Not Accessible

​Invalid Parameters

​Missing Required Parameters

​Notes

Overview

`scraper`

Parameters

Common Parameters

URL Requirements

Crawling Behavior

Examples

Basic Single Page Scraping

Multi-Page Crawling

Deep Site Crawling

Documentation Site Scraping

Use Case Examples

API Documentation Extraction

News Article Collection

Product Catalog Scraping

Best Practices

1. URL Validation

2. Depth and Page Limits

3. Performance Considerations

4. Content Quality

Response Data

Basic Information

Technical Details

Crawling Information

Limitations

Error Handling

Common Error Scenarios

Invalid URL Format

URL Not Accessible

Invalid Parameters

Missing Required Parameters

Notes