Skip to content

redirection to a forbidden domain happened without slash suffix character in the web crawler #738

@msmygit

Description

@msmygit

Setup

% langstream -V
LangStream CLI 0.5.0 (8162f382)

Web crawler configuration

pipeline:
  - name: "Crawl the WebSite"
    type: "webcrawler-source"
    configuration:
      seed-urls:
        - "https://aws.amazon.com/about-aws/whats-new/2023/11"
      allowed-domains:
        - "https://aws.amazon.com/about-aws/whats-new/2023/11"
      forbidden-paths: []
      ...

When we execute the below command,

langstream docker run test -app examples/docker-chatbot -s ./secrets.yaml

we get the following error,

15:23:56.896 [crawler-webcrawler-source-1-runner-465eeb4a-f140-4b8d-b683-ee51ee76f401] INFO  a.l.a.webcrawler.WebCrawlerSource -- The last cycle didn't produce any new documents
15:23:56.896 [crawler-webcrawler-source-1-runner-465eeb4a-f140-4b8d-b683-ee51ee76f401] INFO  a.l.a.webcrawler.crawler.WebCrawler -- Crawling url: https://aws.amazon.com/about-aws/whats-new/2023/11
15:23:57.086 [crawler-webcrawler-source-1-runner-465eeb4a-f140-4b8d-b683-ee51ee76f401] WARN  a.l.a.webcrawler.crawler.WebCrawler -- A redirection to a forbidden domain happened (from https://aws.amazon.com/about-aws/whats-new/2023/11 to /about-aws/whats-new/2023/11/)

Workaround

Adding the slash (/) character suffix at the seed-urls and allowed-domains fixed the error.

pipeline:
  - name: "Crawl the WebSite"
    type: "webcrawler-source"
    configuration:
      seed-urls:
        - "https://aws.amazon.com/about-aws/whats-new/2023/11/"
      allowed-domains:
        - "https://aws.amazon.com/about-aws/whats-new/2023/11/"
      forbidden-paths: []
      ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions