CASSANDRA-20918 Add cursor-based low allocation optimized compaction implementation #4402

nitsanw · 2025-09-29T15:22:19Z

patch by Nitsan Wakart; reviewed by TBD for CASSANDRA-TBD

EPPv1 - all new code
Cursor compaction integration
JMH benchmarks for compaction and cursor impls
EPPv1 - New tests
Existing tests tweaks for new code
[revert?] change the default partitioner to expand testing of new code
[revert?] test data used some benchmarks
[revert?] jmh tweak GC settings for stability
[revert?] javadoc typos, marking unused params
[revert?] clarifying comment
[revert?] toString improvement
[revert?] remove spurious keywords
[revert?] marking metadata collection
[revert?] cursor verifier
Exclude SAI and counter column
Exclude BTI and legacy versions
Temporarily skip very long running test

Thanks for sending a pull request! Here are some tips if you're new here:

Ensure you have added or run the appropriate tests for your PR.
Be sure to keep the PR description updated to reflect all changes.
Write your PR title to summarize what this PR proposes.
If possible, provide a concise example to reproduce the issue for a faster review.
Read our contributor guidelines
If you're making a documentation change, see our guide to documentation contribution

Commit messages should follow the following format:

<One sentence description, usually Jira title or CHANGES.txt summary>

<Optional lengthier description (context on patch)>

patch by <Authors>; reviewed by <Reviewers> for CASSANDRA-#####

Co-authored-by: Name1 <email1>
Co-authored-by: Name2 <email2>

The Cassandra Jira

patch by Nitsan Wakart; reviewed by TBD for CASSANDRA-TBD - EPPv1 - all new code - Cursor compaction integration - JMH benchmarks for compaction and cursor impls - EPPv1 - New tests - Existing tests tweaks for new code - [revert?] change the default partitioner to expand testing of new code - [revert?] test data used some benchmarks - [revert?] jmh tweak GC settings for stability - [revert?] javadoc typos, marking unused params - [revert?] clarifying comment - [revert?] toString improvement - [revert?] remove spurious keywords - [revert?] marking metadata collection - [revert?] cursor verifier - Exclude SAI and counter column - Exclude BTI and legacy versions - Temporarily skip very long running test

aratno · 2025-10-08T19:21:37Z

src/java/org/apache/cassandra/db/compaction/CompactionTask.java

 {
    protected static final Logger logger = LoggerFactory.getLogger(CompactionTask.class);
+    public static final int MEGABYTE = 1024 * 1024 * 1024;
+    public static final boolean CURSOR_COMPACTION_ENABLED = SystemProperties.getBoolean("cassandra.enable_cursor_compaction", () -> true);


Could you move this to Config.java? One advantage of having it there is that the AST tests (like SingleNodeTableWalkTest) that generate random config would be able to exercise it. That's our best path to coverage with lots of different schemas and configurations.

If you can locally do a longer run of that test with cursor-compaction enabled, that would be useful too. That would be done via overriding StatefulASTBase#clusterConfig with the new config set.

Also related to testing, we need to be running all tests both with this feature enabled as well as disabled.

Let's make sure that among test, test-oa and test-latest we have at least one that is running with cursor compaction and one without.

Let's make sure that among test, test-oa and test-latest we have at least one that is running with cursor compaction and one without.

How do I do that?

Look at the testlist-xxx definitions in build.xml. E.g. adding a line to set this flag to false to testlist-oa should be enough.

src/java/org/apache/cassandra/db/rows/Rows.java

src/java/org/apache/cassandra/io/util/RandomAccessReader.java

blambov · 2025-10-09T07:15:59Z

src/java/org/apache/cassandra/db/compaction/CompactionTask.java

 {
    protected static final Logger logger = LoggerFactory.getLogger(CompactionTask.class);
+    public static final int MEGABYTE = 1024 * 1024 * 1024;
+    public static final boolean CURSOR_COMPACTION_ENABLED = SystemProperties.getBoolean("cassandra.enable_cursor_compaction", () -> true);


Also related to testing, we need to be running all tests both with this feature enabled as well as disabled.

Let's make sure that among test, test-oa and test-latest we have at least one that is running with cursor compaction and one without.

src/java/org/apache/cassandra/db/compaction/CompactionTask.java

blambov

It must be stated that this approach that bundles all the steps of the processing in one single file will be quite difficult to maintain and keep in sync with the combination of iterators and transformations that we use in other parts of the code such as the query path. However, once we have reached a point of stability for a piece of functionality where we do not expect it to change significantly for a long time, it does makes sense to unpack the code and present it in a way that makes its execution as direct as possible, and this patch is a good such representation of the compaction process.

Personally, I am very unhappy about switching to mutable, pooled and reused objects, which are significantly more unwieldy and error prone, especially in contexts where concurrent access can occur. It seems this is becoming a necessity if we need to achieve acceptable performance with the current state of our heap usage, but we still need to very carefully separate the mutable versions of concepts from the immutable ones used throughout the code base. Suddenly making a DeletionTime mutable is not an acceptable change.

First batch of targeted comments below, mainly going over CompactionCursor.java.

src/java/org/apache/cassandra/db/compaction/CompactionTask.java

blambov · 2025-10-09T13:38:53Z

src/java/org/apache/cassandra/db/compaction/CompactionTask.java

+                    else if (e instanceof CompactionInterruptedException)
+                        throw (CompactionInterruptedException) e;
+                    else
+                        throw new IllegalStateException(e);


What is this conversion addressing?

Defensively checking for incorrect exception types

I'd prefer not to change the existing behaviour (which wraps these into RuntimeException instead of IllegalStateException), as I don't know what may be relying on this wrapping (including customer tools).

blambov · 2025-10-09T14:08:28Z

src/java/org/apache/cassandra/db/compaction/CompactionCursor.java

+ *       data that is not read often, so compaction "pro-actively" fix such index entries. This is mainly
+ *       an optimization).</li>
+ * </ul>
+ */


Please update the JavaDoc and explain the approach.

Well the unchanged part of the doc isn't quite right. This is not creating a compacted iterator, an iterator of any kind, or even a cursor of any kind.

The class pulls data from multiple cursors and pushes it into a given writer. This is fundamentally different from the way compaction works in other code. Please explain the difference, because it is critical to understanding the class and being able to work on it.

src/java/org/apache/cassandra/db/compaction/CompactionCursor.java

blambov

Next batch of comments.

src/java/org/apache/cassandra/db/ClusteringPrefix.java

src/java/org/apache/cassandra/db/ClusteringComparator.java

blambov · 2025-10-13T13:30:00Z

src/java/org/apache/cassandra/io/sstable/UnfilteredDescriptor.java

+        }
+        if (!UnfilteredSerializer.hasAllColumns(flags))
+        {
+            // TODO: re-implement GC free


DataStax's branch has an implementation of it.

src/java/org/apache/cassandra/io/sstable/format/SortedTableWriter.java

src/java/org/apache/cassandra/db/compaction/writers/DefaultCompactionWriter.java

blambov · 2025-10-13T14:56:50Z

src/java/org/apache/cassandra/io/sstable/metadata/MetadataCollector.java

    }

+    public void updateClusteringValues(ClusteringDescriptor newClustering) {
+        if (newClustering == null || newClustering.clusteringKind().isBoundary())


I think you need to copy the comment from updateClusteringValuesByBoundOrBoundary to explain skipping boundaries.

Not sure which comment you mean

The reason for doing nothing on isBoundary is not obvious. This is the comment I am referring to:

// In a SSTable, every opening marker will be closed, so the start of a range tombstone marker will never be // the maxClustering (the corresponding close might though) and there is no point in doing the comparison // (and vice-versa for the close). By the same reasoning, a boundary will never be either the min or max // clustering, and we can save on comparisons.

blambov · 2025-10-13T15:02:03Z

src/java/org/apache/cassandra/io/sstable/metadata/MetadataCollector.java

        Map<MetadataType, MetadataComponent> components = new EnumMap<>(MetadataType.class);
        components.put(MetadataType.VALIDATION, new ValidationMetadata(partitioner, bloomFilterFPChance));
+        Slice coveredClustering;
+        if (minClusteringDescriptor.clusteringKind() != ClusteringPrefix.Kind.EXCL_START_BOUND) // min is end only if the descriptors are unused


The minimum can certainly be EXCL_START_BOUND when it is used, if a partition starts with a range tombstone. The maximum, on the other hand, can't.

If you want to do this by a single operation (and also remove the minClusteringDescriptor.clusteringColumnsBound() == 0 check in updateClusteringValues), you can change the uninitialized min kind to SSTABLE_UPPER_BOUND, because that won't ever be given to updateClusteringValues.

This is mirroring existing code AFAICT:

if (minClustering == ClusteringBound.MAX_START) minClustering = clustering;

MAX_START also has an empty values array, while this line only checks the kind of clustering (i.e. the && minClusteringDescriptor.clusteringColumnsBound() == 0 part is missing here). The additional check is necessary, as the code as it stands will do the wrong thing for sstables whose lower limit is a range tombstone.

src/java/org/apache/cassandra/db/ClusteringComparator.java

…erties/Config

… elsewhere using seek

blambov

Next batch of comments.

src/java/org/apache/cassandra/utils/vint/VIntCoding.java

blambov · 2025-10-14T09:13:35Z

src/java/org/apache/cassandra/io/util/ReusableLongToken.java

+import org.apache.cassandra.dht.Murmur3Partitioner;
+import org.jctools.util.UnsafeAccess;
+
+public class ReusableLongToken extends Murmur3Partitioner.LongToken


Nit: This shouldn't need to be public.

blambov · 2025-10-14T09:25:14Z

src/java/org/apache/cassandra/io/util/ReusableDecoratedKey.java

+import org.apache.cassandra.utils.ByteArrayUtil;
+import org.apache.cassandra.utils.ByteBufferUtil;
+
+public class ReusableDecoratedKey extends DecoratedKey


This class is pretty hacky. It shouldn't be hard to move the support for reusable tokens to the partitioner (throwing exceptions for all except Murmur and local).

src/java/org/apache/cassandra/db/LivenessInfo.java

src/java/org/apache/cassandra/io/sstable/AbstractSSTableSimpleWriter.java

src/java/org/apache/cassandra/io/sstable/SSTableCursorReader.java

blambov · 2025-10-14T13:10:51Z

src/java/org/apache/cassandra/io/sstable/SSTableCursorReader.java

+        return state;
+    }
+
+    private int checkNextFlags() throws IOException


The caller of this method appears to be pretty well aware what kind of flags/state it expects this to be called in. Would it make sense to split it into checkNext(Partition|Unfiltered|Cell)Flags?

blambov · 2025-10-14T14:25:49Z

src/java/org/apache/cassandra/io/sstable/SSTableCursorWriter.java

+        appendBIGIndex(partitionKey, partitionKeyLength, partitionStart, headerLength, partitionDeletionTime, partitionEnd);
+    }
+
+    private void appendBIGIndex(byte[] key, int keyLength, long partitionStart, int headerLength, DeletionTime partitionDeletionTime, long partitionEnd) throws IOException


Is it not easy to modify and reuse the index building code from BigFormatPartitionWriter? The duplication here seems quite unnecessary.

blambov · 2025-10-14T14:26:56Z

src/java/org/apache/cassandra/io/sstable/SSTableCursorWriter.java

+    private final int indexBlockThreshold;
+
+
+    private SSTableCursorWriter(


This class should be split into a common SortedTableCursorWriter, with format-specific subclasses that instantiate the index builders it uses, and placed into the correct per-format packages.

Currently only BIG is supported, so I held off here. I think the index code split can be postponed to the time when the BTI format is supported. It may also be interesting to explore splitting the SSTable write phase and indexing phase so as to allow more flexibility in composing the phases (e.g. index on the fly/index per partition write/index at end of write, parallel index pass etc).

…nMerge` and return true when partitions remain

Javadoc for bubbleInsertToPreSorted

Simplify deletion merging loop, and some refactoring of names

blambov · 2025-11-03T14:08:34Z

src/java/org/apache/cassandra/config/Config.java

    @Replaces(oldName = "enable_drop_compact_storage", converter = Converters.IDENTITY, deprecated = true)
    public volatile boolean drop_compact_storage_enabled = false;

+    public boolean enable_cursor_compaction = ENABLE_CURSOR_COMPACTION.getBoolean();


It looks like the enable_X style of naming is deprecated. Could you rename this (as well as the getter and system property) to the preferred cursor_compaction_enabled?

blambov · 2025-11-03T14:18:52Z

src/java/org/apache/cassandra/db/compaction/CompactionTask.java

+                    else if (e instanceof CompactionInterruptedException)
+                        throw (CompactionInterruptedException) e;
+                    else
+                        throw new IllegalStateException(e);


I'd prefer not to change the existing behaviour (which wraps these into RuntimeException instead of IllegalStateException), as I don't know what may be relying on this wrapping (including customer tools).

blambov · 2025-11-03T14:33:11Z

src/java/org/apache/cassandra/db/compaction/CompactionCursor.java

+ *       data that is not read often, so compaction "pro-actively" fix such index entries. This is mainly
+ *       an optimization).</li>
+ * </ul>
+ */


Well the unchanged part of the doc isn't quite right. This is not creating a compacted iterator, an iterator of any kind, or even a cursor of any kind.

The class pulls data from multiple cursors and pushes it into a given writer. This is fundamentally different from the way compaction works in other code. Please explain the difference, because it is critical to understanding the class and being able to work on it.

blambov · 2025-11-03T15:29:18Z

src/java/org/apache/cassandra/db/compaction/CompactionCursor.java

+    public static boolean isSupported(AbstractCompactionStrategy.ScannerList scanners, AbstractCompactionController controller)
+    {
+        TableMetadata metadata = controller.cfs.metadata();
+        if (metadata.getTableDirectoryName().contains("system") ||


Use SchemaConstants.isSystemKeyspace(table.keyspace) or some of the other variations in SchemaConstants for this check.

What makes the check necessary? Why isn't the partitioner class check sufficient?

src/java/org/apache/cassandra/db/compaction/CompactionCursor.java

blambov · 2025-11-03T15:44:32Z

src/java/org/apache/cassandra/db/compaction/CompactionCursor.java

+
+            for (SSTableReader reader : scanner.getBackingSSTables()) {
+                Version version = reader.descriptor.version;
+                if (!(version.format instanceof BigFormat))


The input sstable format and the output format are not the same thing (we can be in the middle of an upgrade). This cursor's restriction is that it can't write the bti format, which we can't check by going through the source sstables -- we need to use DatabaseDescriptor.getSelectedSSTableFormat() instead.

blambov · 2025-11-03T15:45:27Z

src/java/org/apache/cassandra/db/compaction/AbstractCompactionPipeline.java

+    {
        if (DatabaseDescriptor.enableCursorCompaction()) {
-            try {
+            if (CompactionCursor.isSupported(scanners, controller))


We should still log whether we are doing cursor compaction.

blambov · 2025-11-03T15:57:26Z

src/java/org/apache/cassandra/db/compaction/CompactionCursor.java

+     * <ul>
+     *     <li>PARTITION_START - Partition header is loaded in preparation for merge</li>
+     *     <li>begining of unfiltered/end of partition - header is loaded, list is sorted after this point</li>
+     *     <li>DONE - need to be reset</li>


It is not clear at all why the cursor "need to be reset". Also, we will keep resetting finished cursors again and again if e.g. only one scanner remains.

Could you explain both points?

Or maybe add an intermediate state that this method advances to DONE? Alternatively, only recognize the end of the file in the PARTITION_START processing, which we may need to do anyway to handle the state where a file is immediately exhausted (which is more likely to happen for partial scanners)?

blambov · 2025-11-03T16:02:26Z

src/java/org/apache/cassandra/db/compaction/CompactionCursor.java

+     * Sorts the cursors array in preparation for partition merge. This assumes cursors are in one of 3 states:
+     * <ul>
+     *     <li>PARTITION_START - Partition header is loaded in preparation for merge</li>
+     *     <li>begining of unfiltered/end of partition - header is loaded, list is sorted after this point</li>


Could you explain why? Something in the sense of "If the cursor was used in the processing of the previous partition, its state would have advanced to PARTITION_START or DONE. Otherwise, it would remain positioned after the partition header, in one of these states."

blambov · 2025-11-03T16:03:26Z

src/java/org/apache/cassandra/db/compaction/CompactionCursor.java

+    /**
+     * Sorts the cursors array in preparation for partition merge. This assumes cursors are in one of 3 states:
+     * <ul>
+     *     <li>PARTITION_START - Partition header is loaded in preparation for merge</li>


Shouldn't this be "needs to be loaded"?

blambov · 2025-11-04T08:58:38Z

src/java/org/apache/cassandra/db/compaction/CompactionCursor.java

+        if (!mergedDeletion.isLive() && !purger.shouldPurge(mergedDeletion))
+        {
+            toWritePartitionDeletion = mergedDeletion;
+            maybeSwitchWriter(compactionAwareWriter);


I know this may be slightly less efficient, but I would prefer the writing to be done by the caller, rechecking isLive on the purged deletion, for clarity.

Alternatively, rename the method to something that clarifies that it may also write the partition header.

Nitsan Wakart and others added 2 commits August 28, 2025 14:26

Merge branch 'trunk' into compaction-work-pr-prep

759dc2c

nitsanw changed the title ~~Add cursor based optimized compaction path (WIP)~~ CASSANDRA-20918 Add cursor-based low allocation optimized compaction implementation Sep 29, 2025

blambov self-requested a review September 30, 2025 07:40

aratno reviewed Oct 8, 2025

View reviewed changes

blambov reviewed Oct 9, 2025

View reviewed changes

Fix MEGABYTE constant

d75d59c

blambov requested changes Oct 10, 2025

View reviewed changes

blambov reviewed Oct 13, 2025

View reviewed changes

Nitsan Wakart added 6 commits October 14, 2025 14:40

Fix 0x11/0x10 should be 0b11/0b10

35ea514

Introduce ENABLE_CURSOR_COMPACTION controls via CassandraRelevantProp…

33b9e47

…erties/Config

Revert matched comments left to track stats tracking

d46511d

Revert change to RandomAccessReader and match row skipping logic from…

b417274

… elsewhere using seek

Improve CompactionCursor javadoc

85379f1

Extract isSupported from constructor

097b995

blambov reviewed Oct 14, 2025

View reviewed changes

Nitsan Wakart added 14 commits October 15, 2025 10:51

Fix Trasnactions/Transformations

1af0f0e

Add comment and rename sortForPartitionMerge -> `prepareForPartitio…

7576864

…nMerge` and return true when partitions remain

Typo: preturbed -> perturbed

2205e48

Javadoc for bubbleInsertToPreSorted and minor refactor

dea15c9

Javadoc for bubbleInsertToPreSorted

Typo: passed -> past

5d200ef

Remove redundant TODOs

1a1c761

Revert 'unused' params

b9dc6db

Rename ElementDescriptor -> UnfilteredDescriptor (and fallout)

5accdd5

Remove unused parameter

17be5f4

Remove unused method

3b0f0a9

Fix indentation

ae65b7e

Move SSTableCursorPipeUtil to benchmarks

62837f9

Rename partitionLength back to finishResult and clarify comment

4064d32

Remove unused methods

797ace1

Nitsan Wakart added 14 commits October 21, 2025 15:26

Improve bubbleInsertElementToPreSorted, delay element insert

855a740

Remove redundant cursor status check

beef9ad

Simplify deletion merging loop, clarify partitionDeletion variable names

b19b959

Simplify deletion merging loop, and some refactoring of names

Neaten up SSTableCursorReader

f5718b3

Dead code removal

87a1c1c

Revert making classes public

ff5eff1

Transform LivenessInfo an interface

ed48fe5

Fix javadoc

a8bc083

Fix intellij warnings

be8b4e3

Explicitly split DeletionTime implementations

011213a

Rely on nextElementEquality in findMergeLimit

4d74fce

Refactor prepareAndSortForMerge code

b9f5802

Move merge limit == 0 out of mergeRows

308a3b3

Add TODO for clustering read/skip

562e437

blambov reviewed Nov 3, 2025

View reviewed changes

Simplify ClusteringComparator code and remove redundant code

37104f6

blambov reviewed Nov 4, 2025

View reviewed changes

		private final int indexBlockThreshold;


		private SSTableCursorWriter(

CASSANDRA-20918 Add cursor-based low allocation optimized compaction implementation #4402

Are you sure you want to change the base?

CASSANDRA-20918 Add cursor-based low allocation optimized compaction implementation #4402

Conversation

nitsanw commented Sep 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nitsanw Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

blambov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blambov Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blambov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

blambov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nitsanw Oct 15, 2025 •

edited

Loading

blambov Nov 3, 2025 •

edited

Loading

blambov Nov 3, 2025 •

edited

Loading