Skip to content

feat: Apply saved workflow settings to current crawl #2514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 29, 2025

Conversation

SuaYoo
Copy link
Member

@SuaYoo SuaYoo commented Mar 24, 2025

Resolves #2366

Changes

Allows users to update current crawl with newly saved workflow settings.

Manual testing

  1. Log in as crawler
  2. Start a crawl
  3. Go to edit workflow. Verify "Update Crawl" button is shown
  4. Click "Update Crawl". Verify crawl is updated with new settings

Screenshots

Page Image/video
Edit Workflow Screenshot 2025-03-24 at 3 51 36 PM

@SuaYoo SuaYoo requested a review from ikreymer March 25, 2025 04:45
@SuaYoo SuaYoo marked this pull request as ready for review April 2, 2025 07:31
@SuaYoo SuaYoo requested review from tw4l and emma-sg April 2, 2025 07:31
Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could use some documentation and/or help text to help explain what updates will actually be applied to a running crawl.

Not knowing what was supported, I started a large domain crawl of a site with one hop out and then attempted to set a lower page limit while the crawl was running, which was not applied and did not prevent additional pages from being added to the queue.

I can imagine other users might have a similar experience and think the feature isn't working as intended.

@SuaYoo
Copy link
Member Author

SuaYoo commented Apr 8, 2025

Not knowing what was supported, I started a large domain crawl of a site with one hop out and then attempted to set a lower page limit while the crawl was running, which was not applied and did not prevent additional pages from being added to the queue.

@ikreymer did you get a chance to look into this?

@ikreymer
Copy link
Member

ikreymer commented Apr 8, 2025

@ikreymer did you get a chance to look into this?

Not yet, will try tomorrow, but removed it from milestone in case we don't get to it yet.

@ikreymer
Copy link
Member

I think this could use some documentation and/or help text to help explain what updates will actually be applied to a running crawl.

Not knowing what was supported, I started a large domain crawl of a site with one hop out and then attempted to set a lower page limit while the crawl was running, which was not applied and did not prevent additional pages from being added to the queue.

I can imagine other users might have a similar experience and think the feature isn't working as intended.

Made a change that will cause the crawler to restart when Update Crawl is selected. However, it looks like even though the change in limit is applied to the config, the crawler doesn't actually remove URLs from the queue in the page limit is lowered, that may require a crawler change (on startup, check if size of queue exceeds limit).

ikreymer added a commit to webrecorder/browsertrix-crawler that referenced this pull request Apr 29, 2025
…ew limit, taking into account finished/failed URLs

useful to support dynamically lowering pageLimit when restarting a crawl
fixes issue raised in webrecorder/browsertrix#2514
@ikreymer
Copy link
Member

webrecorder/browsertrix-crawler#821 adds support for lowering pageLimit / removing URLs already queued when shorter limit is set.

ikreymer and others added 5 commits April 29, 2025 10:29
… that the running crawl, if any, should be updated

the response includes 'updatedRunning' boolean which is set to true if a running crawl has been updated
option is ignored if there is no running crawl
@tw4l tw4l force-pushed the issue-2366-update-running-crawl-option branch from 6c91ab0 to 2fd2799 Compare April 29, 2025 14:29
Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and now working as expected, including for lowering limit with webrecorder/browsertrix-crawler#821 crawler patch. Nice work and thank you!

@ikreymer ikreymer merged commit 1fa4333 into main Apr 29, 2025
27 checks passed
@ikreymer ikreymer deleted the issue-2366-update-running-crawl-option branch April 29, 2025 18:43
ikreymer added a commit to webrecorder/browsertrix-crawler that referenced this pull request Apr 30, 2025
useful to support dynamically lowering pageLimit when restarting a crawl
fixes issue raised in webrecorder/browsertrix#2514
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Add option to update running crawl when workflow changes are made.
3 participants