Web Crawler
The Web Crawler connects to and crawls URLs you specify for use in your managed
knowledge base. The Web Crawler traverses HTML pages starting from your seed URLs,
following child links according to your crawl scope and limits. You can also provide
sitemap URLs as starting points. The Web Crawler respects robots.txt in accordance with
RFC 9309
Important
When you select websites to crawl, you must adhere to the Amazon Acceptable Use Policy
Note
The Web Crawler does not support document-level access control (ACLs). All indexed content is accessible to any user who has access to the knowledge base. If you need ACL filtering, use a connector that supports it (for example, Amazon S3, SharePoint, or OneDrive).
Supported features
-
Crawl multiple seed URLs and sitemap URLs
-
Configurable crawl depth, rate limit, and links-per-URL limit
-
Crawl scope control: same host and path, host only, or host and subdomains
-
URL pattern filters (inclusion and exclusion regular expressions)
-
Crawl attachments linked from web pages (PDFs, documents, and so on)
-
Authentication for protected sites: basic, form-based, or SAML
-
Incremental content syncs for added, updated, and deleted content
Authentication methods
The Web Crawler supports four authentication methods. Choose the method that
matches how the target site authenticates users. For public sites with no sign-in,
use NO_AUTH.
| Method | How it authenticates | When to use |
|---|---|---|
No authentication (NO_AUTH) |
The crawler sends requests without credentials. | Public websites that do not require sign-in. |
Basic authentication (BASIC_AUTH) |
The crawler sends an HTTP Authorization: Basic
header with a user name and password from your secret. |
Sites protected by HTTP Basic Authentication (the browser-style user name and password dialog). |
Form authentication (FORM) |
The crawler signs in by submitting an HTML form. You provide the login URL, credentials, and XPath expressions that locate the form fields. | Sites that use an HTML form for sign-in. |
SAML authentication (SAML) |
The crawler signs in through a SAML identity provider's login form. You provide the IdP login URL, credentials, and XPath expressions that locate the form fields. | Sites that use SAML-based single sign-on. |
Prerequisites
For the website you want to crawl, make sure you:
-
Have permission to crawl the website and its content.
-
Confirm that
robots.txtfor the site does not disallow the URLs you want to crawl. The Web Crawler defaults to disallow if arobots.txtfile is not found. -
If the site requires sign-in, identify the authentication method (basic, form, or SAML). For form and SAML, locate the XPath expressions for the user name field, password field, and submit button on the login page. To find an XPath, right-click the form element in your browser and choose Inspect, then copy the XPath from the developer tools.
In your AWS account, make sure you:
-
If your site requires authentication, store your credentials in an AWS Secrets Manager secret and note its Amazon Resource Name (ARN). For the exact key-value pairs, see Authentication credentials.
-
Include the necessary permissions to connect to your data source in your AWS Identity and Access Management (IAM) role/permissions policy for your knowledge base. For information on the required permissions, see Permissions to access your data sources.
How to set up a Web Crawler data source
Setting up a Web Crawler data source involves the following steps:
-
(If your site requires sign-in) Prepare credentials. Store the credentials for your authentication method in an AWS Secrets Manager secret. See Authentication credentials.
-
Connect the data source. Create the Web Crawler data source in the knowledge base using the AWS Management Console or the API. See Create the data source.
Create the data source
Connector parameters
The data source configuration uses the following connector parameters. To use the
Web Crawler, specify WEB as the connector type in
connectorParameters. For the fields that wrap
connectorParameters (such as
deletionProtectionConfiguration and
mediaExtractionConfiguration), see Connect a data source.
| Field | Required | Description |
|---|---|---|
seedUrls |
Conditional | List of seed URLs to start crawling from. Maximum of 10. Required
unless you provide siteMapUrls. |
siteMapUrls |
Conditional | List of sitemap URLs. Maximum of 3. Required unless you provide
seedUrls. |
authType |
Yes | The authentication type: NO_AUTH,
BASIC_AUTH, FORM, or SAML.
See Authentication methods. |
secretArn |
Conditional | The ARN of the AWS Secrets Manager secret containing your credentials.
Required when authType is not
NO_AUTH. |
| Field | Required | Description |
|---|---|---|
crawlDepth |
No | Maximum crawl depth. Range 0–10. 0 crawls only
the specified URLs; higher values follow links deeper into the
site. Defaults to 2. |
maxLinksPerUrl |
No | Maximum links to follow per URL. Range 1–1000. Defaults to
100. |
maxCrawledUrlsPerMinute |
No | Maximum URLs crawled per minute (rate limit). Range 1–300. |
implicitWaitInSeconds |
No | Wait time, in seconds, after a page reaches a ready state before the crawler reads it. Increase this for pages with dynamic JavaScript content that loads after the main template. |
syncScope |
No | The scope of links to follow. One of PATH_SPECIFIC
(same host and same initial URL path as the seed URL),
DOMAINS_ONLY (same host as the seed URL, any path),
or SUB_DOMAINS (same primary domain, including
subdomains). When omitted, the crawler crawls only the same host
and the same initial URL path as the seed URL. |
crawlAttachments |
No | Whether to crawl files and attachments linked from web pages (such as PDFs and other documents). |
| Field | Required | Description |
|---|---|---|
inclusionPatterns |
No | List of regular expressions. Only URLs that match at least one pattern are crawled and indexed. |
exclusionPatterns |
No | List of regular expressions. URLs that match any pattern are not crawled or indexed. |
maxFileSizeInMegaBytes |
No | Maximum size, in megabytes, of any single file the crawler
ingests. Provide as a numeric string (for example,
"500"). Defaults to "500". |
Authentication credentials
If your website requires authentication, store your credentials in an AWS Secrets Manager secret. The secret format depends on the authentication type you choose.
Basic authentication
(BASIC_AUTH)
{ "userName": "your-username", "password": "your-password", "authentication": "BASIC_AUTH" }
Form authentication (FORM)
For form-based authentication, provide XPath expressions that identify the user name field, password field, and submit button on the login page.
{ "authentication": "FORM", "loginPageUrl": "https://example.com/login", "userName": "your-username", "password": "your-password", "userNameFieldXpath": "//input[@name='username']", "passwordFieldXpath": "//input[@name='password']", "userNameButtonXpath": "//button[@type='submit']", "passwordButtonXpath": "//button[@type='submit']" }
SAML authentication (SAML)
For SAML authentication, provide the SAML identity provider's login page URL and XPath expressions for the form fields.
{ "authentication": "SAML", "loginPageUrl": "https://your-idp.example.com/login", "userName": "your-username", "password": "your-password", "userNameFieldXpath": "//input[@name='username']", "passwordFieldXpath": "//input[@name='password']", "userNameButtonXpath": "//button[@type='submit']", "passwordButtonXpath": "//button[@type='submit']" }
Note
To find an XPath in your browser, right-click the form element on the login page and choose Inspect. In the developer tools, right-click the highlighted HTML, choose Copy, and then choose Copy XPath.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Sync completes successfully but only the seed URL is indexed. | Site navigation links are wired through JavaScript event
handlers (click, scroll, dynamic menus) instead of standard
<a href="..."> elements. The crawler renders
JavaScript but does not simulate user interactions, so it cannot
discover those links. |
Provide additional seed URLs for the pages you want to crawl, or provide a sitemap URL that lists every URL to crawl. If content can be exported as files, consider using the Amazon S3 connector instead. |
| Sync returns no content or fewer pages than expected. | The site's robots.txt file disallows the URLs you
want to crawl, or pages have a noindex meta tag. |
Update robots.txt for the host so it allows the
paths you want crawled, or remove the noindex meta
tag from pages you want indexed. Do not block the page in
robots.txt if you also want meta tag detection,
because the crawler must access the page to read meta tags. |
| Authentication fails (HTTP 401 or 403, login redirect loop, or session timeout). | Credentials are incorrect or expired, or the XPath expressions do not match the login page elements. | Verify the credentials in your secret. For FORM or
SAML auth, validate each XPath in your browser's
developer tools, and verify loginPageUrl. |
| Sync fails with rate limiting (HTTP 429) or incomplete content. | The crawler is fetching pages faster than the site allows. | Lower maxCrawledUrlsPerMinute, or increase
implicitWaitInSeconds for sites with dynamic content
that loads after the page becomes ready. |
| Pages are missing because they are larger than expected. | The page or attachment exceeds maxFileSizeInMegaBytes. |
Increase maxFileSizeInMegaBytes, or accept that
files larger than the limit are not ingested. |