Skip to content

Conversation

BlueGreenMagick
Copy link
Contributor

@BlueGreenMagick BlueGreenMagick commented Sep 19, 2024

Fixes #533, fixes #590, fixes #801

@codecov-commenter
Copy link

codecov-commenter commented Sep 19, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.17%. Comparing base (7558577) to head (3ebe221).
Report is 96 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #802      +/-   ##
==========================================
- Coverage   61.81%   60.17%   -1.65%     
==========================================
  Files          41       41              
  Lines       16798    15958     -840     
==========================================
- Hits        10384     9603     -781     
+ Misses       6414     6355      -59     
Flag Coverage Δ
unittests 60.17% <100.00%> (-1.65%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@BlueGreenMagick
Copy link
Contributor Author

Modified the code so that balance_buf is computed only once.

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That PR will not fix the problem entirely, because correct DTD can contain unbalanced number of < and >, but I think that is rare case, so until proper fix will be implemented, a partial one also acceptable. The implementation, however, still may be improved. The problem here that we do not store state between invocations of BangType::parse. So the solution is to store balance inside BangType::DocType, then, I think, it will be unnecessary to calculate it over buf (because everything in buf previously was in chunk).

I also see the problem in the external for i in memchr::memchr_iter(b'>', chunk) loop. It will skip <s inside chunks that does not contain >, so the balance will be calculated incorrectly. I think, if create a test carefully, it will catch such situation.

So I ask you to do the following changes:

  • convert BangType::DocType -> BangType::DocType(i32)
  • start with balance == 1 (it is < in <!DOCTYPE) here
    Some(b'D') | Some(b'd') => Self::DocType,
  • when calculate balance take into account saved balance
  • emit event only when BangType::DocType(0) here
    BangType::DocType if uncased_starts_with(buf, b"!DOCTYPE") => {
  • add tests for the mentioned case (chunk with < + chunk with > which closes < from the first chunk). It is enough just craft such a string and use appropriate buffer size to slice string to desired chunks
  • also add regression test from #590 (comment)
  • try to check if this fixes #533. I think it should because it seems to be a duplicate of these issues, but need to check

@BlueGreenMagick
Copy link
Contributor Author

BlueGreenMagick commented Sep 19, 2024

Thanks for the detailed feedback! I adjusted the PR.

  • start with balance == 1 (it is < in <!DOCTYPE) here

As the balance is calculated in up to chunk excluding current found >, I did not add 1 to balance initially.

add tests for the mentioned case (chunk with < + chunk with > which closes < from the first chunk). It is enough just craft such a string and use appropriate buffer size to slice string to desired chunks

Are regression tests for issues 533, 590, 801 enough? Or are additional tests desired?

try to check if this fixes #533. I think it should because it seems to be a duplicate of these issues, but need to check

This PR does indeed fixes the issue. Added a regression test for the case, and modified the initial post so the issue will be closed when this PR is merged.

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are regression tests for issues 533, 590, 801 enough? Or are additional tests desired?

I'm not sure that tests still cover the case when we get four chunks with roughly such content:

  1. <!DOCTYPE
  2. < without any >
  3. >, which would close < from point 2, not the doctype
  4. > which should close doctype

If that case already covered, fine, although I would prefer to have test which explicitly guaranties such conditions (can tweak one of added regression tests for that).

And please add changelog entry to Changelog.md that people know that bug (#533) fixed. I'll close other as duplicate.

@Mingun
Copy link
Collaborator

Mingun commented Sep 19, 2024

You also can squash all your changes if you wish

@BlueGreenMagick BlueGreenMagick changed the title Fix DocType closing tag recognition in BufRead Fix DocType closing bracket recognition in buffered reader Sep 20, 2024
@BlueGreenMagick
Copy link
Contributor Author

BlueGreenMagick commented Sep 20, 2024

Done!

I adjusted regression test for issue801 so that all angle brackets are explicitly in different buffer (by lowering buffer size to 2 bytes).

I think it might be better (and easier with GitHub UI) to squash merge on your end, so others can still follow the conversation in this PR in the future.

@Mingun Mingun force-pushed the fix-bufread-doctype branch from d66842a to 3ebe221 Compare September 20, 2024 16:41
@Mingun Mingun merged commit 51d9e23 into tafia:master Sep 20, 2024
7 checks passed
@Mingun
Copy link
Collaborator

Mingun commented Sep 20, 2024

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when parsing !DOCTYPE longer than buffer of BufRead Unexpected Bang Since 0.23.1 Deserialization of a doctype with very long content fails
3 participants