Web Scraper Module
Extract structured data from websites for AI training and analysis.
Overview
The Web Scraper extracts content from web pages: - Article text and metadata - Product information - Tables and lists - Images and links
Getting Started
- Navigate to Dashboard > Web Scraper
- Enter the URL(s) to scrape
- Configure extraction options
- Click Scrape
- Review and export results
Scraping Modes
Single Page
Scrape one URL with detailed extraction: - Full page content - All links and images - Structured data (JSON-LD, microdata)
Multi-Page
Scrape multiple URLs at once: - Paste multiple URLs (one per line) - Or use a sitemap URL - Batch processing with progress tracking
Crawl Mode
Start from one page and follow links: - Set depth limit (1-3 levels recommended) - Filter by URL pattern - Great for documentation sites
Extraction Options
| Option | Description |
|---|---|
| Main content | Extract article/main text only |
| Full page | All text including navigation |
| Tables | Extract tables as structured data |
| Links | Capture all links on page |
| Images | Download images with alt text |
| Metadata | Page title, description, author |
Output Formats
| Format | Best For |
|---|---|
| JSON | Structured data with all metadata |
| JSONL | Training data (one object per page) |
| Markdown | Readable documentation |
| CSV | Tabular data from multiple pages |
Session Output
Each scrape creates a session containing:
- content.json - All extracted data
- content.jsonl - Line-delimited for training
- content.md - Markdown version
- images/ - Downloaded images (if enabled)
Best Practices
- Respect robots.txt - Don't scrape restricted pages
- Add delays - Don't overload target servers
- Check terms of service - Ensure scraping is allowed
- Verify output - Preview content before using
Example: Create Training Data
- Scrape documentation pages with Crawl Mode
- Select JSONL output format
- Share the JSONL file with Full Access
- Use in Custom GPT for documentation Q&A
JavaScript-Rendered Content
For pages that require JavaScript: - Enable JavaScript rendering option - Processing takes longer but captures dynamic content - Works with React, Vue, Angular sites
Handling Authentication
For sites requiring login: 1. Use browser cookies (export from browser) 2. Configure custom headers 3. Note: Use only on sites you have permission to access
Troubleshooting
"Page blocked" or 403 errors
- Site may block scraping
- Try adding User-Agent header
- Check if login is required
"Content not extracted"
- Enable JavaScript rendering
- Check if content is in iframes
- Verify URL is accessible
"Slow scraping"
- Multi-page scrapes take time
- Enable concurrent requests (where appropriate)
- Reduce crawl depth
Scraped content appears in My Files > Sessions