Overview
Knowledge base connectors allow you to automatically sync content from external sources into your AI agent’s knowledge base. Instead of manually uploading files, connectors can fetch and update content programmatically, ensuring your agents always have access to the latest information.
Available Connectors
Website Scraper Connector
The Website Scraper connector fetches and extracts content from web pages, making it easy to keep your agent informed about documentation, help articles, or any web-based content.
Use Cases
- Documentation Sites: Keep your agent updated with the latest product documentation
- Help Centers: Sync FAQ pages and support articles
- Blog Posts: Include recent blog content in your agent’s knowledge
- Company Pages: Pull content from About Us, Terms of Service, or other key pages
Configuration
To use the Website Scraper connector, you need to configure it with the following parameters:
Required Parameters
| Parameter | Type | Description |
|---|
url | string | The web page URL to scrape (must be HTTP or HTTPS) |
Optional Parameters
| Parameter | Type | Default | Description |
|---|
timeout | integer | 30 | Request timeout in seconds |
selectors | object | null | Custom CSS or tag selectors for targeted content extraction |
Credentials (Optional)
| Field | Type | Description |
|---|
headers | object | Custom HTTP headers for authenticated requests (e.g., API keys, auth tokens) |
Basic Example
{
"connector_type": "website",
"config": {
"url": "https://docs.example.com/api-guide"
},
"credentials": {}
}
Advanced Example with Selectors
For more control over what content gets extracted, you can specify custom selectors:
{
"connector_type": "website",
"config": {
"url": "https://docs.example.com/api-guide",
"timeout": 60,
"selectors": {
"Main Content": {
"selector": "article.documentation",
"type": "css"
},
"Code Examples": {
"selector": "pre.code-block",
"type": "css"
},
"Headers": {
"selector": "h2",
"type": "tag"
}
}
},
"credentials": {}
}
Authenticated Requests Example
If the website requires authentication or custom headers:
{
"connector_type": "website",
"config": {
"url": "https://internal-docs.example.com/guide"
},
"credentials": {
"headers": {
"Authorization": "Bearer your-api-token",
"X-Custom-Header": "custom-value"
}
}
}
The Website Scraper automatically:
- Validates URLs: Only HTTP/HTTPS schemes are allowed, and private IPs/localhost are blocked for security
- Cleans Content: Removes script tags, styles, navigation, footers, and other non-content elements
- Formats Text: Extracts clean, readable text with proper line breaks
Default Extraction Strategy (when no selectors are provided):
- First tries to find an
<article> tag
- Falls back to
<main> or <div class="content">
- If neither exists, extracts all text from
<body>
Custom Selectors (when provided):
- Extracts content matching each selector
- Supports both CSS selectors and HTML tag names
- Each section is labeled with the selector name
Security Features
The Website Scraper includes built-in protections against SSRF (Server-Side Request Forgery) attacks:
- Blocks requests to private IP ranges (10.x.x.x, 172.16.x.x, 192.168.x.x)
- Blocks localhost and loopback addresses
- Blocks link-local addresses (e.g., AWS metadata service at 169.254.169.254)
- Only allows HTTP and HTTPS protocols
Limitations
- Minimum content length: 100 characters (pages with less content will fail)
- Does not execute JavaScript (static HTML only)
- Cannot handle pages requiring complex authentication flows
- Cannot scrape content behind CAPTCHAs or bot protection
BigQuery Connector
Documentation for the BigQuery connector coming soon.
When to Use Connectors vs. File Uploads
| Scenario | Recommended Approach |
|---|
| Content changes frequently | Use connectors with scheduled syncing |
| Static documents (PDFs, docs) | Direct file upload |
| Web-based documentation | Website Scraper connector |
| Database queries | BigQuery connector |
| One-time knowledge addition | Direct file upload |
| Multiple related web pages | Website Scraper with multiple configurations |
Best Practices
- Start Simple: Begin with basic URL configuration, then add selectors if needed
- Test Selectors: Use browser dev tools to test CSS selectors before configuring
- Set Appropriate Timeouts: Increase timeout for slow-loading pages
- Monitor Content Length: Ensure scraped content meets the 100-character minimum
- Schedule Regular Syncs: Keep knowledge base fresh by scheduling periodic syncs
- Use Specific Selectors: Target main content areas to avoid extracting navigation and footers
Troubleshooting
”Scraped content appears empty or too short”
- Check if the URL is correct and publicly accessible
- Verify selectors are matching the expected elements
- Try removing custom selectors to use default extraction
- Check if the page requires JavaScript (not supported)
“URL validation failed”
- Ensure the URL uses HTTP or HTTPS
- Check that the URL doesn’t point to a private IP or localhost
- Verify the hostname can be resolved
”Network error while fetching URL”
- Increase the timeout value for slow-loading pages
- Check if the website requires authentication headers
- Verify the URL is accessible from your network
API Reference
For programmatic access to knowledge base management, see:
Next Steps
- Set up your first connector
- Schedule automated syncs
- Monitor sync status and errors
- Combine multiple connectors for comprehensive knowledge bases