/api/v1/search/live POST endpoint and follow the same structure: a top-level type to define the data source, and a set of arguments defining the parameters of the job.
URL Scraping
Extract content from specific web pages as markdown
LLM Processing
Get LLM processed text from the scraped content
Multi-Page Crawling
Crawl multiple pages with configurable depth
Content Extraction
Extract structured content and metadata
Overview
The Web API provides comprehensive web scraping capabilities for extracting content from websites. It supports single-page scraping as well as multi-page crawling with configurable depth and page limits.scraper
Scrapes content from web pages and returns metadata, markdown, and LLM-processed text. This is the primary job type for web data extraction.
Use Cases:
- Content extraction and analysis
- Website documentation scraping
- Data collection from web sources
- Multi-page site crawling
- Content preprocessing for LLM analysis
Parameters
Common Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
type | string | Yes | - | The job type: scraper |
url | string | Yes | - | The URL to scrape (must include http:// or https://) |
max_depth | integer | No | 0 | Maximum crawl depth from the starting URL |
max_pages | integer | No | 1 | Maximum number of pages to scrape |
URL Requirements
- Scheme Required: URLs must include
http://orhttps://scheme - Valid Format: Must be a properly formatted URL
- Accessible: The URL must be accessible and return valid content
Crawling Behavior
- Depth Control:
max_depthcontrols how many levels deep to crawl from the starting URL - Page Limits:
max_pagesprevents excessive scraping by limiting total pages processed - Robots.txt: Currently respects robots.txt by default (configurable)
- Markdown Output: Content is automatically converted to markdown format
Examples
Basic Single Page Scraping
Multi-Page Crawling
Deep Site Crawling
Documentation Site Scraping
Use Case Examples
API Documentation Extraction
Extract complete API documentation from a documentation site:News Article Collection
Scrape multiple articles from a news site:Product Catalog Scraping
Extract product information from an e-commerce site:Best Practices
1. URL Validation
- Always include the full URL with
http://orhttps:// - Test URLs manually before scraping to ensure accessibility
- Use specific URLs rather than redirects when possible
2. Depth and Page Limits
- Start with
max_depth: 0for single pages - Use
max_depth: 1-2for most multi-page scenarios - Set appropriate
max_pagesto avoid excessive resource usage - Consider the site structure when setting depth limits
3. Performance Considerations
- Use
max_pagesto control resource usage - Start with smaller limits and increase as needed
- Consider the target site’s size and structure
- Be respectful of server resources
4. Content Quality
- Target specific sections or pages when possible
- Use appropriate depth to capture related content
- Consider the site’s navigation structure
Response Data
The Web API returns structured content including:Basic Information
- URL: The scraped page URL
- Title: Page title and metadata
- Content: Extracted text content
- Markdown: Formatted markdown version of the content
Technical Details
- HTTP Status: Response status codes
- Content Type: MIME type of the content
- Encoding: Character encoding information
- Links: Internal and external links found
Crawling Information
- Depth: How deep the page was found from the starting URL
- Crawl Order: Order in which pages were processed
- Parent URLs: Links that led to each page
Limitations
- Maximum crawl depth is limited by the
max_depthparameter - Page limits are enforced by the
max_pagesparameter - Some sites may block automated scraping
- JavaScript-heavy sites may not be fully scraped
- Rate limiting may apply to prevent excessive requests
Error Handling
Common Error Scenarios
Invalid URL Format
URL Not Accessible
Invalid Parameters
max_depthmust be non-negativemax_pagesmust be at least 1urlis required and cannot be empty
Missing Required Parameters
urlparameter is required for all scraping operationstypemust be"scraper"
Notes
- All scraped content is automatically converted to markdown format
- The scraper respects robots.txt by default
- Multi-page crawling follows internal links found on pages
- External links are not followed during crawling
- Content extraction focuses on main page content, not navigation or ads