Skip to content

Conversation

@mykaul
Copy link

@mykaul mykaul commented Jan 22, 2026

Refactor DCAwareRoundRobinPolicy to use a Copy-On-Write (COW) strategy for managing host distances.

Key changes:

  • Introduce _remote_hosts to cache REMOTE hosts, enabling O(1) distance lookups during query planning for distance. IGNORED hosts do not need to be stored in the cache.
    For 'LOCAL' we do a simple comparison.
  • Add _refresh_remote_hosts to handle node changes.

This is a different attempt from #650 to add caching to host distance to make query planning faster.
I'm not sure code-wise it's less code, but it certainly does make sense to do it per policy, not just for the TokenAware? Unsure.

The 1st commit is for DCAwareRoundRobinPolicy. Once it passes CI, I have an additional commit for RackAware one, and perhaps more are needed later.

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@mykaul mykaul marked this pull request as draft January 22, 2026 18:14
@mykaul
Copy link
Author

mykaul commented Jan 23, 2026

This is interesting, my change has exposed this -

2026-01-23 18:26:45.488 DEBUG [libevreactor:376]: Message pushed from server: <EventMessage(event_type='STATUS_CHANGE', event_args={'change_type': 'DOWN', 'address': ('127.0.0.3', 9042)}, stream_id=-1, trace_id=None)>

2026-01-23 18:26:45.489 WARNING [libevreactor:376]: Host 127.0.0.3:9042 has been marked down                      <--- host .3 is marked as DOWN

2026-01-23 18:26:45.489 DEBUG [thread:73]: First connection created to 127.0.0.2:9042 for shard_id=0
2026-01-23 18:26:45.489 DEBUG [thread:73]: Finished initializing connection for host 127.0.0.2:9042
2026-01-23 18:26:45.489 DEBUG [thread:73]: Added pool for host 127.0.0.2:9042 to session
2026-01-23 18:26:45.489 DEBUG [thread:73]: Removed connection pool for <Host: 127.0.0.3:9042 dc1>
2026-01-23 18:26:45.490 DEBUG [thread:73]: Shutting down connections to 127.0.0.3:9042
2026-01-23 18:26:45.490 DEBUG [thread:73]: Closing connection (139753730215760) to 127.0.0.3:9042
2026-01-23 18:26:48.496 DEBUG [test_ip_change:35]: Change IP address for node3
2026-01-23 18:26:48.534 DEBUG [test_ip_change:40]: Start node3 again with ip address 127.0.0.33
2026-01-23 18:26:48.551 DEBUG [cluster:772]: node3: Starting scylla: args=['/home/ykaul/github/python-driver/tests/integration/ccm/test_ip_change/node3/bin/scylla', '--options-file', '/home/ykaul/github/python-driver/tests/integration/ccm/test_ip_change/node3/conf/scylla.yaml', '--log-to-stdout', '1', '--api-address', '127.0.0.33', '--smp', '1', '--memory', '512M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.0.33', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=False wait_for_binary_proto=True
2026-01-23 18:26:49.947 INFO [cluster:775]: node3: Started scylla: pid: 186960
2026-01-23 18:26:49.947 DEBUG [test_ip_change:45]: ['127.0.0.1', '127.0.0.3', '127.0.0.2']
2026-01-23 18:26:50.164 DEBUG [libevreactor:376]: Message pushed from server: <EventMessage(event_type='TOPOLOGY_CHANGE', event_args={'change_type': 'NEW_NODE', 'address': ('127.0.0.33', 9042)}, stream_id=-1, trace_id=None)>
2026-01-23 18:26:50.165 DEBUG [libevreactor:376]: Message pushed from server: <EventMessage(event_type='STATUS_CHANGE', event_args={'change_type': 'UP', 'address': ('127.0.0.33', 9042)}, stream_id=-1, trace_id=None)>
2026-01-23 18:26:50.448 DEBUG [test_ip_change:45]: ['127.0.0.1', '127.0.0.3', '127.0.0.2']
2026-01-23 18:26:50.948 DEBUG [test_ip_change:45]: ['127.0.0.1', '127.0.0.3', '127.0.0.2']
2026-01-23 18:26:51.449 DEBUG [test_ip_change:45]: ['127.0.0.1', '127.0.0.3', '127.0.0.2']
2026-01-23 18:26:51.569 DEBUG [thread:73]: [control connection] Refreshing node list and token map
2026-01-23 18:26:51.570 DEBUG [thread:73]: [control connection] Updating host ip from 127.0.0.3:9042 to 127.0.0.33:9042 for (c989a851-2dcb-4b05-8a0c-fb1658a32e21)

2026-01-23 18:26:51.570 WARNING [thread:73]: Host 127.0.0.33:9042 has been marked down            <-- due to an IP change, the host is marked as down!?!

2026-01-23 18:26:51.571 DEBUG [thread:73]: [control connection] Finished fetching ring info
2026-01-23 18:26:51.949 DEBUG [test_ip_change:45]: ['127.0.0.1', '127.0.0.33', '127.0.0.2']

Need to understand this better :-/

@mykaul
Copy link
Author

mykaul commented Jan 23, 2026

            if host is None:
                host = self._cluster.metadata.get_host_by_host_id(host_id)
                if host and host.endpoint != endpoint:
                    log.debug("[control connection] Updating host ip from %s to %s for (%s)", host.endpoint, endpoint, host_id)
                    old_endpoint = host.endpoint
                    host.endpoint = endpoint
                    self._cluster.metadata.update_host(host, old_endpoint)
                    reconnector = host.get_and_set_reconnection_handler(None)
                    if reconnector:
                        reconnector.cancel()
                    self._cluster.on_down(host, is_host_addition=False, expect_host_to_be_down=True)

So first we update the host with the new endpoint, then mark it as down?

@mykaul
Copy link
Author

mykaul commented Jan 23, 2026

This fixes it for me:

diff --git a/cassandra/cluster.py b/cassandra/cluster.py
index a9c1d00e..099043ea 100644
--- a/cassandra/cluster.py
+++ b/cassandra/cluster.py
@@ -3831,14 +3831,16 @@ class ControlConnection(object):
                 host = self._cluster.metadata.get_host_by_host_id(host_id)
                 if host and host.endpoint != endpoint:
                     log.debug("[control connection] Updating host ip from %s to %s for (%s)", host.endpoint, endpoint, host_id)
-                    old_endpoint = host.endpoint
-                    host.endpoint = endpoint
-                    self._cluster.metadata.update_host(host, old_endpoint)
                     reconnector = host.get_and_set_reconnection_handler(None)
                     if reconnector:
                         reconnector.cancel()
                     self._cluster.on_down(host, is_host_addition=False, expect_host_to_be_down=True)
 
+                    old_endpoint = host.endpoint
+                    host.endpoint = endpoint
+                    self._cluster.metadata.update_host(host, old_endpoint)
+                    self._cluster.on_up(host)
+
             if host is None:
                 log.debug("[control connection] Found new host to connect to: %s", endpoint)
                 host, _ = self._cluster.add_host(endpoint, datacenter=datacenter, rack=rack, signal=True, refresh_nodes=False, host_id=host_id)

which also makes sense to me.
@dkropachev - I think this fix should go in a separate issue and PR, no? (context - start with #651 (comment) - my changes here failed, due to a wrong order of update of a host which changed its IP)

…tate

Refactor `DCAwareRoundRobinPolicy` to simplify distance calculations and memory usage.

Key changes:
- Remove `_hosts_by_distance` and the complex caching of LOCAL hosts.
- `distance()` now checks `host.datacenter` directly for LOCAL calculation, which is correct and static.
- Only cache `_remote_hosts` to efficiently handle `used_hosts_per_remote_dc`.
- Optimize control plane operations (`on_up`, `on_down`) to only rebuild the remote cache when necessary (when remote hosts change or local DC changes).
- This removes the overhead of maintaining a redundant LOCAL cache and ensures correct behavior even if a local host is marked down.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
When a host changes its IP address, the driver previously updated the host endpoint to the new IP before calling on_down.
This caused on_down to mistakenly target the new IP for connection cleanup.
This change reorders the operations to ensure on_down cleans up the old IP's resources before the host object is updated and on_up is called for the new IP.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
@mykaul
Copy link
Author

mykaul commented Jan 23, 2026

I think CI failure is unrelated and is #359

@mykaul
Copy link
Author

mykaul commented Jan 24, 2026

By using the (not amazing) benchmark from #653 , I got the following results:

For master branch as a baseline:

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.2309     | 433       
RackAware                      | 100000     | 0.3607     | 277       
TokenAware(DCAware)            | 100000     | 1.3262     | 75        
TokenAware(RackAware)          | 100000     | 1.4343     | 69        

This branch (with just DC aware improvements):

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1280     | 781       
RackAware                      | 100000     | 0.3572     | 279       
TokenAware(DCAware)            | 100000     | 1.1620     | 86        
TokenAware(RackAware)          | 100000     | 1.4435     | 69        

** 433 -> 781 Kops/sec improvement **

With improvement to rack aware (on top of master), I got:

=== Performance Benchmarks ===
Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.2306     | 433       
RackAware                      | 100000     | 0.3084     | 324       
TokenAware(DCAware)            | 100000     | 1.3031     | 76        
TokenAware(RackAware)          | 100000     | 1.3440     | 74        

** 277 -> 324 Kops/sec improvement **

And on top of this branch:

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1283     | 779       
RackAware                      | 100000     | 0.2905     | 344       
TokenAware(DCAware)            | 100000     | 1.1454     | 87        
TokenAware(RackAware)          | 100000     | 1.3293     | 75        

** 277 -> 344 Kops/sec improvement **

And finally, for #650 :

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.2325     | 430       
RackAware                      | 100000     | 0.3611     | 276       
TokenAware(DCAware)            | 100000     | 1.5826     | 63        
TokenAware(RackAware)          | 100000     | 1.6927     | 59        

which kinda makes me suspect that branch is no good :-/

@mykaul
Copy link
Author

mykaul commented Jan 24, 2026

This fixes it for me:

diff --git a/cassandra/cluster.py b/cassandra/cluster.py
index a9c1d00e..099043ea 100644
--- a/cassandra/cluster.py
+++ b/cassandra/cluster.py
@@ -3831,14 +3831,16 @@ class ControlConnection(object):
                 host = self._cluster.metadata.get_host_by_host_id(host_id)
                 if host and host.endpoint != endpoint:
                     log.debug("[control connection] Updating host ip from %s to %s for (%s)", host.endpoint, endpoint, host_id)
-                    old_endpoint = host.endpoint
-                    host.endpoint = endpoint
-                    self._cluster.metadata.update_host(host, old_endpoint)
                     reconnector = host.get_and_set_reconnection_handler(None)
                     if reconnector:
                         reconnector.cancel()
                     self._cluster.on_down(host, is_host_addition=False, expect_host_to_be_down=True)
 
+                    old_endpoint = host.endpoint
+                    host.endpoint = endpoint
+                    self._cluster.metadata.update_host(host, old_endpoint)
+                    self._cluster.on_up(host)
+
             if host is None:
                 log.debug("[control connection] Found new host to connect to: %s", endpoint)
                 host, _ = self._cluster.add_host(endpoint, datacenter=datacenter, rack=rack, signal=True, refresh_nodes=False, host_id=host_id)

which also makes sense to me. @dkropachev - I think this fix should go in a separate issue and PR, no? (context - start with #651 (comment) - my changes here failed, due to a wrong order of update of a host which changed its IP)

Sent separate PR - #654

@mykaul mykaul changed the title (improvement)Optimize DCAwareRoundRobinPolicy with host distance caching (improvement)Optimize DCAware/RackAware RoundRobinPolicy with host distance caching Jan 24, 2026
@mykaul
Copy link
Author

mykaul commented Jan 24, 2026

With rack aware added (3rd commit), these are the current numbers:

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1235     | 809       
RackAware                      | 100000     | 0.2934     | 340       
TokenAware(DCAware)            | 100000     | 1.1371     | 87        
TokenAware(RackAware)          | 100000     | 1.3291     | 75    

…distances

Refactor `RackAwareRoundRobinPolicy` to simplify distance calculations and memory usage.

Add self._remote_hosts to cache remote hosts distance, self._non_local_rack_hosts for non-local rack host distance.
This improves the performance nicely, from ~290K query plans per second to ~600K query plans per second.

- Only cache `_remote_hosts` to efficiently handle `used_hosts_per_remote_dc`.
- Optimize control plane operations (`on_up`, `on_down`) to only rebuild the remote cache when necessary (when remote hosts change or local DC changes).

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
@mykaul
Copy link
Author

mykaul commented Jan 24, 2026

With rack aware added (3rd commit), these are the current numbers:

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1235     | 809       
RackAware                      | 100000     | 0.2934     | 340       
TokenAware(DCAware)            | 100000     | 1.1371     | 87        
TokenAware(RackAware)          | 100000     | 1.3291     | 75    

Now that I also cache non-local hosts, not just remote (duh!), perf. is better:

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1247     | 802       
RackAware                      | 100000     | 0.1624     | 615       
TokenAware(DCAware)            | 100000     | 1.2408     | 80        
TokenAware(RackAware)          | 100000     | 1.3087     | 76   

…y planning.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
@mykaul
Copy link
Author

mykaul commented Jan 24, 2026

Added for TokenAware as well some optimization (need to improve commit message).
Current results:

Policy                         | Ops        | Time (s)   | Kops/s    | (master)
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1266     | 790    |  433   
RackAware                      | 100000     | 0.1670     | 598   | 277     
TokenAware(DCAware)            | 100000     | 0.2663     | 375  | 75      
TokenAware(RackAware)          | 100000     | 0.3009     | 332   | 69

So reasonable improvement, at least in this micro-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant