Web Scraper Module

Extract structured data from websites for AI training and analysis.

Overview

The Web Scraper extracts content from web pages: - Article text and metadata - Product information - Tables and lists - Images and links

Getting Started

Navigate to Dashboard > Web Scraper
Enter the URL(s) to scrape
Configure extraction options
Click Scrape
Review and export results

Scraping Modes

Single Page

Scrape one URL with detailed extraction: - Full page content - All links and images - Structured data (JSON-LD, microdata)

Multi-Page

Scrape multiple URLs at once: - Paste multiple URLs (one per line) - Or use a sitemap URL - Batch processing with progress tracking

Crawl Mode

Start from one page and follow links: - Set depth limit (1-3 levels recommended) - Filter by URL pattern - Great for documentation sites

Extraction Options

Option	Description
Main content	Extract article/main text only
Full page	All text including navigation
Tables	Extract tables as structured data
Links	Capture all links on page
Images	Download images with alt text
Metadata	Page title, description, author

Output Formats

Format	Best For
JSON	Structured data with all metadata
JSONL	Training data (one object per page)
Markdown	Readable documentation
CSV	Tabular data from multiple pages

Session Output

Each scrape creates a session containing: - content.json - All extracted data - content.jsonl - Line-delimited for training - content.md - Markdown version - images/ - Downloaded images (if enabled)

Best Practices

Respect robots.txt - Don't scrape restricted pages
Add delays - Don't overload target servers
Check terms of service - Ensure scraping is allowed
Verify output - Preview content before using

Example: Create Training Data

Scrape documentation pages with Crawl Mode
Select JSONL output format
Share the JSONL file with Full Access
Use in Custom GPT for documentation Q&A

JavaScript-Rendered Content

For pages that require JavaScript: - Enable JavaScript rendering option - Processing takes longer but captures dynamic content - Works with React, Vue, Angular sites

Handling Authentication

For sites requiring login: 1. Use browser cookies (export from browser) 2. Configure custom headers 3. Note: Use only on sites you have permission to access

Troubleshooting

"Page blocked" or 403 errors

Site may block scraping
Try adding User-Agent header
Check if login is required

"Content not extracted"

Enable JavaScript rendering
Check if content is in iframes
Verify URL is accessible

"Slow scraping"

Multi-page scrapes take time
Enable concurrent requests (where appropriate)
Reduce crawl depth

Scraped content appears in My Files > Sessions