Skip to main content

Overview

Knowledge base connectors allow you to automatically sync content from external sources into your AI agent’s knowledge base. Instead of manually uploading files, connectors can fetch and update content programmatically, ensuring your agents always have access to the latest information.

Available Connectors

Website Scraper Connector

The Website Scraper connector fetches and extracts content from web pages, making it easy to keep your agent informed about documentation, help articles, or any web-based content.

Use Cases

  • Documentation Sites: Keep your agent updated with the latest product documentation
  • Help Centers: Sync FAQ pages and support articles
  • Blog Posts: Include recent blog content in your agent’s knowledge
  • Company Pages: Pull content from About Us, Terms of Service, or other key pages

Configuration

To use the Website Scraper connector, you need to configure it with the following parameters:
Required Parameters
ParameterTypeDescription
urlstringThe web page URL to scrape (must be HTTP or HTTPS)
Optional Parameters
ParameterTypeDefaultDescription
timeoutinteger30Request timeout in seconds
selectorsobjectnullCustom CSS or tag selectors for targeted content extraction
Credentials (Optional)
FieldTypeDescription
headersobjectCustom HTTP headers for authenticated requests (e.g., API keys, auth tokens)

Basic Example

{
  "connector_type": "website",
  "config": {
    "url": "https://docs.example.com/api-guide"
  },
  "credentials": {}
}

Advanced Example with Selectors

For more control over what content gets extracted, you can specify custom selectors:
{
  "connector_type": "website",
  "config": {
    "url": "https://docs.example.com/api-guide",
    "timeout": 60,
    "selectors": {
      "Main Content": {
        "selector": "article.documentation",
        "type": "css"
      },
      "Code Examples": {
        "selector": "pre.code-block",
        "type": "css"
      },
      "Headers": {
        "selector": "h2",
        "type": "tag"
      }
    }
  },
  "credentials": {}
}

Authenticated Requests Example

If the website requires authentication or custom headers:
{
  "connector_type": "website",
  "config": {
    "url": "https://internal-docs.example.com/guide"
  },
  "credentials": {
    "headers": {
      "Authorization": "Bearer your-api-token",
      "X-Custom-Header": "custom-value"
    }
  }
}

Content Extraction Behavior

The Website Scraper automatically:
  1. Validates URLs: Only HTTP/HTTPS schemes are allowed, and private IPs/localhost are blocked for security
  2. Cleans Content: Removes script tags, styles, navigation, footers, and other non-content elements
  3. Formats Text: Extracts clean, readable text with proper line breaks
Default Extraction Strategy (when no selectors are provided):
  • First tries to find an <article> tag
  • Falls back to <main> or <div class="content">
  • If neither exists, extracts all text from <body>
Custom Selectors (when provided):
  • Extracts content matching each selector
  • Supports both CSS selectors and HTML tag names
  • Each section is labeled with the selector name

Security Features

The Website Scraper includes built-in protections against SSRF (Server-Side Request Forgery) attacks:
  • Blocks requests to private IP ranges (10.x.x.x, 172.16.x.x, 192.168.x.x)
  • Blocks localhost and loopback addresses
  • Blocks link-local addresses (e.g., AWS metadata service at 169.254.169.254)
  • Only allows HTTP and HTTPS protocols

Limitations

  • Minimum content length: 100 characters (pages with less content will fail)
  • Does not execute JavaScript (static HTML only)
  • Cannot handle pages requiring complex authentication flows
  • Cannot scrape content behind CAPTCHAs or bot protection

BigQuery Connector

Documentation for the BigQuery connector coming soon.

When to Use Connectors vs. File Uploads

ScenarioRecommended Approach
Content changes frequentlyUse connectors with scheduled syncing
Static documents (PDFs, docs)Direct file upload
Web-based documentationWebsite Scraper connector
Database queriesBigQuery connector
One-time knowledge additionDirect file upload
Multiple related web pagesWebsite Scraper with multiple configurations

Best Practices

  1. Start Simple: Begin with basic URL configuration, then add selectors if needed
  2. Test Selectors: Use browser dev tools to test CSS selectors before configuring
  3. Set Appropriate Timeouts: Increase timeout for slow-loading pages
  4. Monitor Content Length: Ensure scraped content meets the 100-character minimum
  5. Schedule Regular Syncs: Keep knowledge base fresh by scheduling periodic syncs
  6. Use Specific Selectors: Target main content areas to avoid extracting navigation and footers

Troubleshooting

”Scraped content appears empty or too short”

  • Check if the URL is correct and publicly accessible
  • Verify selectors are matching the expected elements
  • Try removing custom selectors to use default extraction
  • Check if the page requires JavaScript (not supported)

“URL validation failed”

  • Ensure the URL uses HTTP or HTTPS
  • Check that the URL doesn’t point to a private IP or localhost
  • Verify the hostname can be resolved

”Network error while fetching URL”

  • Increase the timeout value for slow-loading pages
  • Check if the website requires authentication headers
  • Verify the URL is accessible from your network

API Reference

For programmatic access to knowledge base management, see:

Next Steps

  • Set up your first connector
  • Schedule automated syncs
  • Monitor sync status and errors
  • Combine multiple connectors for comprehensive knowledge bases