Skip to content

HDDS-14725. Display retry messages in cli when scm's are unavailable#9834

Open
Gargi-jais11 wants to merge 1 commit intoapache:masterfrom
Gargi-jais11:HDDS-14725
Open

HDDS-14725. Display retry messages in cli when scm's are unavailable#9834
Gargi-jais11 wants to merge 1 commit intoapache:masterfrom
Gargi-jais11:HDDS-14725

Conversation

@Gargi-jais11
Copy link
Contributor

What changes were proposed in this pull request?

When all SCM instances are down or unreachable, CLI commands that query SCM (e.g. ozone admin datanode list, decommission, diskbalancer, usageinfo, maintenance, etc.) appear to hang for up to ~10–15 minutes before failing.

Current behaviour with scm's down in ha : Happens for all commands querying scm

bash-5.1$ ozone admin datanode list
<----------- Seems as stuck for 15mins with no cli error message ----------->

Proposed fix:
Make Retry logs to be shown up in the cli output to stderr in SCMFailoverProxyProviderBase.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14725

How was this patch tested?

Added integration test in TestFailoverWithScmHA for commands querying scm.
Tested locally:

bash-5.1$ ozone admin datanode list
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm2 after 1 failover attempt(s). Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm3 after 2 failover attempt(s). Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm1 after 3 failover attempt(s). Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm2 after 4 failover attempt(s). Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm3 after 5 failover attempt(s). Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm1 after 6 failover attempt(s). Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm2 after 7 failover attempt(s). Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm3 after 8 failover attempt(s). Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm1 after 9 failover attempt(s). Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm2 after 10 failover attempt(s). Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm3 after 11 failover attempt(s). Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm1 after 12 failover attempt(s). Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm2 after 13 failover attempt(s). Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm3 after 14 failover attempt(s). Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking StorageContainerLocationProtocolPB over scm1 after 15 failover attempt(s). Trying to failover after sleeping for 2000ms.}}
{{Invalid host name: local host is: "om1/172.18.0.4"; destination host is: "scm1":9860; java.net.UnknownHostException: Invalid host name: local host is: "om1/172.18.0.4"; destination host is: "scm1":9860; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost; For more details see: http://wiki.apache.org/hadoop/UnknownHost 

@Gargi-jais11 Gargi-jais11 marked this pull request as ready for review February 26, 2026 11:47
@Gargi-jais11
Copy link
Contributor Author

@adoroszlai @ChenSammi Please review the patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant