(improvement)Optimize DCAware/RackAware RoundRobinPolicy with host distance caching #651

mykaul · 2026-01-22T18:14:21Z

Refactor DCAwareRoundRobinPolicy to use a Copy-On-Write (COW) strategy for managing host distances.

Key changes:

Introduce _remote_hosts to cache REMOTE hosts, enabling O(1) distance lookups during query planning for distance. IGNORED hosts do not need to be stored in the cache.
For 'LOCAL' we do a simple comparison.
Add _refresh_remote_hosts to handle node changes.

This is a different attempt from #650 to add caching to host distance to make query planning faster.
I'm not sure code-wise it's less code, but it certainly does make sense to do it per policy, not just for the TokenAware? Unsure.

The 1st commit is for DCAwareRoundRobinPolicy. Once it passes CI, I have an additional commit for RackAware one, and perhaps more are needed later.

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

mykaul · 2026-01-23T16:44:59Z

This is interesting, my change has exposed this -

2026-01-23 18:26:45.488 DEBUG [libevreactor:376]: Message pushed from server: <EventMessage(event_type='STATUS_CHANGE', event_args={'change_type': 'DOWN', 'address': ('127.0.0.3', 9042)}, stream_id=-1, trace_id=None)>

2026-01-23 18:26:45.489 WARNING [libevreactor:376]: Host 127.0.0.3:9042 has been marked down                      <--- host .3 is marked as DOWN

2026-01-23 18:26:45.489 DEBUG [thread:73]: First connection created to 127.0.0.2:9042 for shard_id=0
2026-01-23 18:26:45.489 DEBUG [thread:73]: Finished initializing connection for host 127.0.0.2:9042
2026-01-23 18:26:45.489 DEBUG [thread:73]: Added pool for host 127.0.0.2:9042 to session
2026-01-23 18:26:45.489 DEBUG [thread:73]: Removed connection pool for <Host: 127.0.0.3:9042 dc1>
2026-01-23 18:26:45.490 DEBUG [thread:73]: Shutting down connections to 127.0.0.3:9042
2026-01-23 18:26:45.490 DEBUG [thread:73]: Closing connection (139753730215760) to 127.0.0.3:9042
2026-01-23 18:26:48.496 DEBUG [test_ip_change:35]: Change IP address for node3
2026-01-23 18:26:48.534 DEBUG [test_ip_change:40]: Start node3 again with ip address 127.0.0.33
2026-01-23 18:26:48.551 DEBUG [cluster:772]: node3: Starting scylla: args=['/home/ykaul/github/python-driver/tests/integration/ccm/test_ip_change/node3/bin/scylla', '--options-file', '/home/ykaul/github/python-driver/tests/integration/ccm/test_ip_change/node3/conf/scylla.yaml', '--log-to-stdout', '1', '--api-address', '127.0.0.33', '--smp', '1', '--memory', '512M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.0.33', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=False wait_for_binary_proto=True
2026-01-23 18:26:49.947 INFO [cluster:775]: node3: Started scylla: pid: 186960
2026-01-23 18:26:49.947 DEBUG [test_ip_change:45]: ['127.0.0.1', '127.0.0.3', '127.0.0.2']
2026-01-23 18:26:50.164 DEBUG [libevreactor:376]: Message pushed from server: <EventMessage(event_type='TOPOLOGY_CHANGE', event_args={'change_type': 'NEW_NODE', 'address': ('127.0.0.33', 9042)}, stream_id=-1, trace_id=None)>
2026-01-23 18:26:50.165 DEBUG [libevreactor:376]: Message pushed from server: <EventMessage(event_type='STATUS_CHANGE', event_args={'change_type': 'UP', 'address': ('127.0.0.33', 9042)}, stream_id=-1, trace_id=None)>
2026-01-23 18:26:50.448 DEBUG [test_ip_change:45]: ['127.0.0.1', '127.0.0.3', '127.0.0.2']
2026-01-23 18:26:50.948 DEBUG [test_ip_change:45]: ['127.0.0.1', '127.0.0.3', '127.0.0.2']
2026-01-23 18:26:51.449 DEBUG [test_ip_change:45]: ['127.0.0.1', '127.0.0.3', '127.0.0.2']
2026-01-23 18:26:51.569 DEBUG [thread:73]: [control connection] Refreshing node list and token map
2026-01-23 18:26:51.570 DEBUG [thread:73]: [control connection] Updating host ip from 127.0.0.3:9042 to 127.0.0.33:9042 for (c989a851-2dcb-4b05-8a0c-fb1658a32e21)

2026-01-23 18:26:51.570 WARNING [thread:73]: Host 127.0.0.33:9042 has been marked down            <-- due to an IP change, the host is marked as down!?!

2026-01-23 18:26:51.571 DEBUG [thread:73]: [control connection] Finished fetching ring info
2026-01-23 18:26:51.949 DEBUG [test_ip_change:45]: ['127.0.0.1', '127.0.0.33', '127.0.0.2']

Need to understand this better :-/

mykaul · 2026-01-23T16:50:37Z

            if host is None:
                host = self._cluster.metadata.get_host_by_host_id(host_id)
                if host and host.endpoint != endpoint:
                    log.debug("[control connection] Updating host ip from %s to %s for (%s)", host.endpoint, endpoint, host_id)
                    old_endpoint = host.endpoint
                    host.endpoint = endpoint
                    self._cluster.metadata.update_host(host, old_endpoint)
                    reconnector = host.get_and_set_reconnection_handler(None)
                    if reconnector:
                        reconnector.cancel()
                    self._cluster.on_down(host, is_host_addition=False, expect_host_to_be_down=True)

So first we update the host with the new endpoint, then mark it as down?

mykaul · 2026-01-23T17:39:47Z

This fixes it for me:

diff --git a/cassandra/cluster.py b/cassandra/cluster.py
index a9c1d00e..099043ea 100644
--- a/cassandra/cluster.py
+++ b/cassandra/cluster.py
@@ -3831,14 +3831,16 @@ class ControlConnection(object):
                 host = self._cluster.metadata.get_host_by_host_id(host_id)
                 if host and host.endpoint != endpoint:
                     log.debug("[control connection] Updating host ip from %s to %s for (%s)", host.endpoint, endpoint, host_id)
-                    old_endpoint = host.endpoint
-                    host.endpoint = endpoint
-                    self._cluster.metadata.update_host(host, old_endpoint)
                     reconnector = host.get_and_set_reconnection_handler(None)
                     if reconnector:
                         reconnector.cancel()
                     self._cluster.on_down(host, is_host_addition=False, expect_host_to_be_down=True)
 
+                    old_endpoint = host.endpoint
+                    host.endpoint = endpoint
+                    self._cluster.metadata.update_host(host, old_endpoint)
+                    self._cluster.on_up(host)
+
             if host is None:
                 log.debug("[control connection] Found new host to connect to: %s", endpoint)
                 host, _ = self._cluster.add_host(endpoint, datacenter=datacenter, rack=rack, signal=True, refresh_nodes=False, host_id=host_id)

which also makes sense to me.
@dkropachev - I think this fix should go in a separate issue and PR, no? (context - start with #651 (comment) - my changes here failed, due to a wrong order of update of a host which changed its IP)

…tate Refactor `DCAwareRoundRobinPolicy` to simplify distance calculations and memory usage. Key changes: - Remove `_hosts_by_distance` and the complex caching of LOCAL hosts. - `distance()` now checks `host.datacenter` directly for LOCAL calculation, which is correct and static. - Only cache `_remote_hosts` to efficiently handle `used_hosts_per_remote_dc`. - Optimize control plane operations (`on_up`, `on_down`) to only rebuild the remote cache when necessary (when remote hosts change or local DC changes). - This removes the overhead of maintaining a redundant LOCAL cache and ensures correct behavior even if a local host is marked down. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

When a host changes its IP address, the driver previously updated the host endpoint to the new IP before calling on_down. This caused on_down to mistakenly target the new IP for connection cleanup. This change reorders the operations to ensure on_down cleans up the old IP's resources before the host object is updated and on_up is called for the new IP. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

mykaul · 2026-01-23T20:14:13Z

I think CI failure is unrelated and is #359

mykaul · 2026-01-24T11:49:05Z

By using the (not amazing) benchmark from #653 , I got the following results:

For master branch as a baseline:

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.2309     | 433       
RackAware                      | 100000     | 0.3607     | 277       
TokenAware(DCAware)            | 100000     | 1.3262     | 75        
TokenAware(RackAware)          | 100000     | 1.4343     | 69

This branch (with just DC aware improvements):

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1280     | 781       
RackAware                      | 100000     | 0.3572     | 279       
TokenAware(DCAware)            | 100000     | 1.1620     | 86        
TokenAware(RackAware)          | 100000     | 1.4435     | 69

** 433 -> 781 Kops/sec improvement **

With improvement to rack aware (on top of master), I got:

=== Performance Benchmarks ===
Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.2306     | 433       
RackAware                      | 100000     | 0.3084     | 324       
TokenAware(DCAware)            | 100000     | 1.3031     | 76        
TokenAware(RackAware)          | 100000     | 1.3440     | 74

** 277 -> 324 Kops/sec improvement **

And on top of this branch:

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1283     | 779       
RackAware                      | 100000     | 0.2905     | 344       
TokenAware(DCAware)            | 100000     | 1.1454     | 87        
TokenAware(RackAware)          | 100000     | 1.3293     | 75

** 277 -> 344 Kops/sec improvement **

And finally, for #650 :

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.2325     | 430       
RackAware                      | 100000     | 0.3611     | 276       
TokenAware(DCAware)            | 100000     | 1.5826     | 63        
TokenAware(RackAware)          | 100000     | 1.6927     | 59

which kinda makes me suspect that branch is no good :-/

mykaul · 2026-01-24T11:57:18Z

This fixes it for me:

diff --git a/cassandra/cluster.py b/cassandra/cluster.py
index a9c1d00e..099043ea 100644
--- a/cassandra/cluster.py
+++ b/cassandra/cluster.py
@@ -3831,14 +3831,16 @@ class ControlConnection(object):
                 host = self._cluster.metadata.get_host_by_host_id(host_id)
                 if host and host.endpoint != endpoint:
                     log.debug("[control connection] Updating host ip from %s to %s for (%s)", host.endpoint, endpoint, host_id)
-                    old_endpoint = host.endpoint
-                    host.endpoint = endpoint
-                    self._cluster.metadata.update_host(host, old_endpoint)
                     reconnector = host.get_and_set_reconnection_handler(None)
                     if reconnector:
                         reconnector.cancel()
                     self._cluster.on_down(host, is_host_addition=False, expect_host_to_be_down=True)
 
+                    old_endpoint = host.endpoint
+                    host.endpoint = endpoint
+                    self._cluster.metadata.update_host(host, old_endpoint)
+                    self._cluster.on_up(host)
+
             if host is None:
                 log.debug("[control connection] Found new host to connect to: %s", endpoint)
                 host, _ = self._cluster.add_host(endpoint, datacenter=datacenter, rack=rack, signal=True, refresh_nodes=False, host_id=host_id)

which also makes sense to me. @dkropachev - I think this fix should go in a separate issue and PR, no? (context - start with #651 (comment) - my changes here failed, due to a wrong order of update of a host which changed its IP)

Sent separate PR - #654

mykaul · 2026-01-24T15:27:45Z

With rack aware added (3rd commit), these are the current numbers:

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1235     | 809       
RackAware                      | 100000     | 0.2934     | 340       
TokenAware(DCAware)            | 100000     | 1.1371     | 87        
TokenAware(RackAware)          | 100000     | 1.3291     | 75

…distances Refactor `RackAwareRoundRobinPolicy` to simplify distance calculations and memory usage. Add self._remote_hosts to cache remote hosts distance, self._non_local_rack_hosts for non-local rack host distance. This improves the performance nicely, from ~290K query plans per second to ~600K query plans per second. - Only cache `_remote_hosts` to efficiently handle `used_hosts_per_remote_dc`. - Optimize control plane operations (`on_up`, `on_down`) to only rebuild the remote cache when necessary (when remote hosts change or local DC changes). Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

mykaul · 2026-01-24T15:43:46Z

With rack aware added (3rd commit), these are the current numbers:

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1235     | 809       
RackAware                      | 100000     | 0.2934     | 340       
TokenAware(DCAware)            | 100000     | 1.1371     | 87        
TokenAware(RackAware)          | 100000     | 1.3291     | 75

Now that I also cache non-local hosts, not just remote (duh!), perf. is better:

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1247     | 802       
RackAware                      | 100000     | 0.1624     | 615       
TokenAware(DCAware)            | 100000     | 1.2408     | 80        
TokenAware(RackAware)          | 100000     | 1.3087     | 76

…y planning. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

mykaul · 2026-01-24T20:44:07Z

Added for TokenAware as well some optimization (need to improve commit message).
Current results:

Policy                         | Ops        | Time (s)   | Kops/s    | (master)
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1266     | 790    |  433   
RackAware                      | 100000     | 0.1670     | 598   | 277     
TokenAware(DCAware)            | 100000     | 0.2663     | 375  | 75      
TokenAware(RackAware)          | 100000     | 0.3009     | 332   | 69

So reasonable improvement, at least in this micro-benchmark.

mykaul marked this pull request as draft January 22, 2026 18:14

mykaul force-pushed the query_plan_opt_2 branch from 76ee195 to edab823 Compare January 23, 2026 19:04

mykaul added 2 commits January 23, 2026 21:13

mykaul force-pushed the query_plan_opt_2 branch from edab823 to 1884f59 Compare January 23, 2026 20:14

mykaul changed the title ~~(improvement)Optimize DCAwareRoundRobinPolicy with host distance caching~~ (improvement)Optimize DCAware/RackAware RoundRobinPolicy with host distance caching Jan 24, 2026

mykaul force-pushed the query_plan_opt_2 branch from cc0204d to 6282e6f Compare January 24, 2026 15:43

(improvement)TokenAware round robin policy and others - improved quer…

8f96d39

…y planning. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(improvement)Optimize DCAware/RackAware RoundRobinPolicy with host distance caching #651

(improvement)Optimize DCAware/RackAware RoundRobinPolicy with host distance caching #651

mykaul commented Jan 22, 2026 •

edited

Loading

Uh oh!

mykaul commented Jan 23, 2026

Uh oh!

mykaul commented Jan 23, 2026

Uh oh!

mykaul commented Jan 23, 2026

Uh oh!

mykaul commented Jan 23, 2026

Uh oh!

mykaul commented Jan 24, 2026 •

edited

Loading

Uh oh!

mykaul commented Jan 24, 2026

Uh oh!

mykaul commented Jan 24, 2026

Uh oh!

mykaul commented Jan 24, 2026

Uh oh!

mykaul commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

(improvement)Optimize DCAware/RackAware RoundRobinPolicy with host distance caching #651

Are you sure you want to change the base?

(improvement)Optimize DCAware/RackAware RoundRobinPolicy with host distance caching #651

Conversation

mykaul commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-review checklist

Uh oh!

mykaul commented Jan 23, 2026

Uh oh!

mykaul commented Jan 23, 2026

Uh oh!

mykaul commented Jan 23, 2026

Uh oh!

mykaul commented Jan 23, 2026

Uh oh!

mykaul commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mykaul commented Jan 24, 2026

Uh oh!

mykaul commented Jan 24, 2026

Uh oh!

mykaul commented Jan 24, 2026

Uh oh!

mykaul commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mykaul commented Jan 22, 2026 •

edited

Loading

mykaul commented Jan 24, 2026 •

edited

Loading