Web Scraper Module

Extract structured data from websites for AI training and analysis.

Overview

The Web Scraper extracts content from web pages: - Article text and metadata - Product information - Tables and lists - Images and links

Getting Started

  1. Navigate to Dashboard > Web Scraper
  2. Enter the URL(s) to scrape
  3. Configure extraction options
  4. Click Scrape
  5. Review and export results

Scraping Modes

Single Page

Scrape one URL with detailed extraction: - Full page content - All links and images - Structured data (JSON-LD, microdata)

Multi-Page

Scrape multiple URLs at once: - Paste multiple URLs (one per line) - Or use a sitemap URL - Batch processing with progress tracking

Crawl Mode

Start from one page and follow links: - Set depth limit (1-3 levels recommended) - Filter by URL pattern - Great for documentation sites

Extraction Options

Option Description
Main content Extract article/main text only
Full page All text including navigation
Tables Extract tables as structured data
Links Capture all links on page
Images Download images with alt text
Metadata Page title, description, author

Output Formats

Format Best For
JSON Structured data with all metadata
JSONL Training data (one object per page)
Markdown Readable documentation
CSV Tabular data from multiple pages

Session Output

Each scrape creates a session containing: - content.json - All extracted data - content.jsonl - Line-delimited for training - content.md - Markdown version - images/ - Downloaded images (if enabled)

Best Practices

  1. Respect robots.txt - Don't scrape restricted pages
  2. Add delays - Don't overload target servers
  3. Check terms of service - Ensure scraping is allowed
  4. Verify output - Preview content before using

Example: Create Training Data

  1. Scrape documentation pages with Crawl Mode
  2. Select JSONL output format
  3. Share the JSONL file with Full Access
  4. Use in Custom GPT for documentation Q&A

JavaScript-Rendered Content

For pages that require JavaScript: - Enable JavaScript rendering option - Processing takes longer but captures dynamic content - Works with React, Vue, Angular sites

Handling Authentication

For sites requiring login: 1. Use browser cookies (export from browser) 2. Configure custom headers 3. Note: Use only on sites you have permission to access

Troubleshooting

"Page blocked" or 403 errors

  • Site may block scraping
  • Try adding User-Agent header
  • Check if login is required

"Content not extracted"

  • Enable JavaScript rendering
  • Check if content is in iframes
  • Verify URL is accessible

"Slow scraping"

  • Multi-page scrapes take time
  • Enable concurrent requests (where appropriate)
  • Reduce crawl depth

Scraped content appears in My Files > Sessions

Connecting