Skip to content

Draft: CONJ-1238 rewriteBatchStatements #202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

knoxg
Copy link

@knoxg knoxg commented Feb 22, 2025

This PR reinstates the rewriteBatchStatements connection property, which is needed to preserve LAST_INSERT_ID() behaviour when connecting to non-MariaDB MySQL servers. It also improves performance of the connector back to the level we enjoyed with the old 2.x connector, when connecting to non-MariaDB MySQL servers.

Ticket in CONJ issue tracker: https://jira.mariadb.org/browse/CONJ-1238

I've based the changes on this branch on the code that was removed in commit 6c2213a ( on Mar 26, 2021 )

The sourcecode has diverged quite a bit since that commit.

The code in that commit used to perform rewriting by breaking the supplied SQL into multiple queryParts[]
e.g. the SQL
INSERT INTO TABLE(col1,col2,col3,col4, col5) VALUES (9, ?, 5, ?, 8) ON DUPLICATE KEY UPDATE col2=col2+10
would be split into 5 queryParts:

INSERT INTO TABLE(col1,col2,col3,col4, col5) VALUES (9, ?, 5, ?, 8) ON DUPLICATE KEY UPDATE col2=col2+10
---------------------------------------------------                                                       <- up to and including 'values'
                                                   ----                                                   <- values block, not including positional parameters
                                                         ----                                             <- : continued
                                                               ----                                       <- : continued
                                                                    ------------------------------------  <- last segment

which allowed it to repeat the VALUES() sections, substituting parameters as it went.

The current 3.x code does not keep a queryParts[] array. Instead, it converts the supplied SQL String into a byte[] of it's UTF-8 representation, and stores a List<Integer> of paramPositions which are the indexes of any ? positional placeholders within that byte array.

To support rewriteBatchStatements functionality, this branch extends ClientParser class so that when rewriteBatchStatements is set to true and the SQL is in the required format, it can also capture the indexes of the opening and closing parenthesis around the VALUES() clause of the SQL. These byte indexes are stored in valuesBracketPositions.

The QueryWithParametersPacket class, which normally sends queries containing parameters, has been extended so that instead of containing a single Parameters object ( containing a list of Parameter objects ) it contains a List<Parameters>.

The QueryWithParametersPacket encode() method has been modified so that instead of iterating over paramPositions once, it can iterate over it multiple times provided it has been supplied multiple Parameters, and has valuesBracketPositions.

This new form of QueryWithParametersPacket is created by ClientPreparedStatement if the conditions for rewriting are present.

It constructs multiple QueryWithParametersPackets in a new getBatchPackets() method, which also determines how many parameters can fit into a single packet. In order to determine how to split the parameters across packets, a new ByteCountingWriter class has been created, which performs no I/O, but only counts the number of bytes consumed by a parameter if it had been included in a packet.

This seems to be how the old code used to demarcate packets, although the old code did that demarcation at encode time, which is no longer possible. To write this code I referred back to the ComQuery class from the 2.x branch at commit 7b569cb ( the commit the 3.x code was branched from ): https://github.com/mariadb-corporation/mariadb-connector-j/blob/7b569cb5a791baf36f3a9d0b4811522db8f590cd/src/main/java/org/mariadb/jdbc/internal/com/send/ComQuery.java

I've marked this PR as Draft in order to get some feedback on whether this is the right approach, and for some guidance on what tests might be appropriate. I've reinstated the ClientPrepareResultTest from the old 6c2213a commit, but we probably need some more thorough tests than that.

@rusher
Copy link
Collaborator

rusher commented Mar 20, 2025

thanks for this PR.
Initially, 3.x was rewritten from scratch to have a small connector (and get rid of old license code), so a few option like rewriteBatchStatements has been left aside.

The only reason to do that :

  • compatibility with 2.x, but limited to MySQL servers

Con :

  • this is not compatible with galera
  • very limited. this only works for INSERT command, and not possible for "on duplicate key update" commands.
  • slower than bulk
  • having to maintain multiple implementation of batching :
    • bulk
    • rewriteBatchStatements
    • pipelining
    • basic loop

@knoxg
Copy link
Author

knoxg commented Apr 3, 2025

hi @rusher thanks for the feedback.

To address those 'con' points on the pro / con list:

  • this is not compatible with galera

I'm not too familiar with galera so not sure how this is incompatible with it, can you supply some more detail ? ( Is the 2.x connector also incompatible with galera ? )

From a quick google, it looks like galera is based on mariadb servers, which isn't the intended use-case for bringing this connection option back. This option is for users who are connecting to non-mariadb mysql servers, who therefore can't get the benefits of COM_STMT_BULK_EXECUTE.
Also, if we are connecting to a mariadb server with rewriteBatchStatements enabled, the connector will still use bulk execution ( if allowed by the connection properties ) before it will attempt to rewrite a batch statement.
If this PR is accepted, maybe something can be added to the connection property docs to make it clearer what the intended use-case for this option is, and when it should not be used.

  • very limited. this only works for INSERT command, and not possible for "on duplicate key update" commands.

yep, agree it's limited in the sense that the original rewriteBatchStatements option was limited, but the purpose of this isn't to add more rewriting capabilities, it's to bring back those existing ones.
Think some ON DUPLICATE KEY UPDATE clauses can still be rewritten as long as there are no parameters outside the VALUES() block, as was the case with the 2.x code.

  • slower than bulk

Right, but again the use-case is for users connecting to non-mariadb mysql servers where bulk is not available.
On non-mariadb servers, I would still expect the current 3.x connector ( without rewriteBatchStatements ) to be slower than 2.x connector (with rewriteBatchStatements), as it's sending a packet for each record vs sending a packet containing multiple records. I haven't done performance tests on this but could write something up.

  • having to maintain multiple implementation of batching

Yes this would be another implementation that would need to be maintained, but hopefully the code isn't too difficult to understand and conforms to existing code conventions. I intentionally kept the rewritableParts code in the ClientParser class separate to the existing parameterParts partly because that's what the old code did but also to reduce the risk of this new code affecting that existing code.

Not sure if that addressed all your concerns, let me know if there's anything else I can add that might tip the balance in favour of keeping this connection option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants