Skip to content

Conversation

@deldrid1
Copy link

These changes prevent puppeteer-cluster from thinking that the browser has closed with Protocol error (Runtime.callFunctionOn): Target closed. or Navigation failed because browser has disconnected! and issuing retries when cluster.close() is called. Since this method should only be used after cluster.idle() I think this is much more reasonable behavior (at least it fits my use case below!).

The added test case demonstrates this well. Before the change the new test case fails in this manor:

$ npx jest -t 'retries stop after close called'                                                                                                                                                                                                                                                                                                   
 FAIL  test/Cluster.test.ts (45.014s)
  ● options › concurrency: 1 › retries stop after close called

    expect.assertions(1)

    Expected one assertion to be called but received four assertion calls.

      161 | 
      162 |             test('retries stop after close called', async () => {
    > 163 |                 expect.assertions(1); // 3 retries will be 4 times called (just like test above), unless we actually close the cluster and cancel retries
          |                        ^
      164 | 
      165 |                 const cluster = await Cluster.launch({
      166 |                     concurrency,

      at test/Cluster.test.ts:163:24
      at test/Cluster.test.ts:8:71
      at Object.<anonymous>.__awaiter (test/Cluster.test.ts:4:12)
      at Object.test (test/Cluster.test.ts:162:64)

  ● options › concurrency: 2 › retries stop after close called

    expect.assertions(1)

    Expected one assertion to be called but received four assertion calls.

      161 | 
      162 |             test('retries stop after close called', async () => {
    > 163 |                 expect.assertions(1); // 3 retries will be 4 times called (just like test above), unless we actually close the cluster and cancel retries
          |                        ^
      164 | 
      165 |                 const cluster = await Cluster.launch({
      166 |                     concurrency,

      at test/Cluster.test.ts:163:24
      at test/Cluster.test.ts:8:71
      at Object.<anonymous>.__awaiter (test/Cluster.test.ts:4:12)
      at Object.test (test/Cluster.test.ts:162:64)

  ● options › concurrency: 3 › retries stop after close called

    expect.assertions(1)

    Expected one assertion to be called but received four assertion calls.

      161 | 
      162 |             test('retries stop after close called', async () => {
    > 163 |                 expect.assertions(1); // 3 retries will be 4 times called (just like test above), unless we actually close the cluster and cancel retries
          |                        ^
      164 | 
      165 |                 const cluster = await Cluster.launch({
      166 |                     concurrency,

      at test/Cluster.test.ts:163:24
      at test/Cluster.test.ts:8:71
      at Object.<anonymous>.__awaiter (test/Cluster.test.ts:4:12)
      at Object.test (test/Cluster.test.ts:162:64)

Test Suites: 1 failed, 3 skipped, 1 of 4 total
Tests:       3 failed, 80 skipped, 83 total
Snapshots:   0 total
Time:        46.225s, estimated 57s
Ran all test suites with tests matching "retries stop after close called".

After the change, all tests pass!

The use case for this is aborting long running scrape jobs. My particular issue is that my cloud environment may kill an HTTP request and issue its own retries (kicking off duplicate scrapes), and my long running scrape job needs to abort. With an http.IncomingRequest from node.js this is pretty straightforward

  cluster.on('taskerror', (err, job, willRetry) => {
      if (willRetry) {
        console.log(`Encountered an error while crawling.  This job will be retried.`, err.message)
      } else {
        console.error(`Failed to crawl`, err.message)
      }
  });


  request.on("aborted", () => {
    console.log(`HTTP Request ended unexpectedly after ${elapsedTime(requestStart)}s.  Signaling scraping to shut down`)
    
    return cluster.close()
      .then(() => {
        log('Cluster IS CLOSED!')
      })
      .catch((error) => {
        logError("UNABLE TO CLOSE CLUSTER", error)
      })
  })

Unfortunately, what I was finding was that puppeteer-cluster wasn't aborting like I wanted it to! Instead it was detecting a browser crash and restarting.

HTTP Request ended unexpectedly after 2.368s.  Signaling scraping to shut down
Cluster IS CLOSED!
Encountered an error while crawling.  This job will be retried. Protocol error (Runtime.callFunctionOn): Target closed.
Encountered an error while crawling.  This job will be retried. Navigation failed because browser has disconnected!
Encountered an error while crawling.  This job will be retried. Navigation failed because browser has disconnected!

The current implementation leaves the taskerror event being called (with jobWillRetry now set to false) even after isClosed has resolved and returned true. I don't have a preference for if this is right or not (after cluster.close() I could argue that no events should be emitted), but worth calling out as to being how the implementation currently works

@DemGiran
Copy link

DemGiran commented Aug 26, 2020

+1 for it to be merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants