# Web Scraping
Scrape websites and index content into the knowledge base.
## Quick Start
1. **Dashboard → Scrape tab → Enter URL**
2. Configure scope and depth
3. Click "Start Scrape"
4. Content auto-indexes when complete
## Configuration
| Setting | Values | Default | Description |
|---------|--------|---------|-------------|
| URL | string | - | Starting URL to scrape |
| Max Pages | 1-1000 | 100 | Limit total pages |
| Max Depth | 0-10 | 3 | Link depth from start URL (0 = start page only) |
| Scope | SUBPAGES, HOSTNAME, DOMAIN | HOSTNAME | How far to crawl |
| Include Patterns | glob patterns | - | Only scrape matching URLs |
| Exclude Patterns | glob patterns | - | Skip matching URLs |
| Scrape Mode | AUTO, FAST, FULL | AUTO | How to fetch pages |
| Cookies | string | - | For authenticated sites |
| Force Rescrape | boolean | false | Re-scrape even if unchanged |
**Scope values:**
- `SUBPAGES` - Only pages under the starting path
- `HOSTNAME` - All pages on same hostname
- `DOMAIN` - Include subdomains
**Scrape Mode values:**
- `AUTO` - Try fast mode, fall back to full for SPAs
- `FAST` - HTTP only, faster but may miss JavaScript content
- `FULL` - Uses headless browser, handles all JavaScript
## GraphQL API
Start a scrape job programmatically:
```graphql
mutation StartScrape($input: StartScrapeInput!) {
startScrape(input: $input) {
jobId
baseUrl
status
}
}
```
Variables:
```json
{
"input": {
"url": "https://docs.example.com",
"maxPages": 100,
"maxDepth": 3,
"scope": "HOSTNAME",
"includePatterns": ["/docs/*", "/api/*"],
"excludePatterns": ["/blog/*", "/changelog/*"],
"scrapeMode": "AUTO",
"cookies": "session=abc123; auth=xyz789",
"forceRescrape": false
}
}
```
Check job status:
```graphql
query GetScrapeJob($jobId: ID!) {
getScrapeJob(jobId: $jobId) {
job {
jobId
status
totalUrls
processedCount
failedCount
}
}
}
```
List jobs:
```graphql
query ListScrapeJobs($limit: Int) {
listScrapeJobs(limit: $limit) {
items {
jobId
baseUrl
status
processedCount
totalUrls
}
}
}
```
Cancel a job:
```graphql
mutation CancelScrape($jobId: ID!) {
cancelScrape(jobId: $jobId) {
jobId
status
}
}
```
### Authentication
Include your API key in the request headers:
```
x-api-key: da2-xxxxxxxxxxxx
```
Get your API key from **Dashboard → Settings → API Key**.
## How It Works
```text
Start URL → Discovery Queue → Process Queue → S3 → Knowledge Base
```
1. **ScrapeStart** - Creates job, queues initial URL
2. **ScrapeDiscover** - Finds links, respects scope/depth, queues new URLs
3. **ScrapeProcess** - Fetches content, converts to markdown, saves to S3
4. **ProcessDocument** - Standard pipeline indexes the markdown
## Deduplication
Content is hashed using SHA-256. Re-scraping skips unchanged pages (hash match) unless "Force Rescrape" is enabled.
## Real-time Updates
Progress publishes via GraphQL subscriptions. The UI updates automatically as pages process.
## Troubleshooting
### Scrape stuck at 0%
- Check ScrapeDiscover Lambda logs
- Verify URL is accessible
### Pages missing
- Check scope setting (subpages is restrictive)
- Increase max depth
- Some SPAs need "full" mode
### Content garbled
- Try "full" mode for JavaScript-heavy sites