Skip to content

HDDS-14509. Allow client to choose the read consistency level#9796

Merged
szetszwo merged 21 commits intoapache:masterfrom
ivandika3:HDDS-14509
Feb 27, 2026
Merged

HDDS-14509. Allow client to choose the read consistency level#9796
szetszwo merged 21 commits intoapache:masterfrom
ivandika3:HDDS-14509

Conversation

@ivandika3
Copy link
Contributor

@ivandika3 ivandika3 commented Feb 20, 2026

What changes were proposed in this pull request?

Currently, if OM follower read is enabled, all OM followers will serve read requests even if the client does not enable the follower read.

For leader-only client, this might be unexpected since they expect the OM follower to throw OMNotLeaderException.

We can support some kind of OM request metadata to indicate to the OM whether the client enables the follower read or not.

The implementation introduces ReadConsistencyHint field in OMRequest which allows client to hint to the OM about what the read consistency the client wants. However, for compatibility reasons, the client ReadConsistencyHint is only supported by OM in a best-effort basis. For example, if the new client sends the hint to old OM that does not support it or OM that disables linearizable read, the OM can choose not to respect the read consistency.

Currently, client can configure the default leader and follower read consistency which applies to corresponding requests. However, the long term idea is to allow client to specify per-request consistency requirements. For example in S3 use cases, client can use custom header (e.g. "x-ozone-read-consistency") to pick the desired consistency.

The idea of client-defined consistency level is inspired by other systems and literature:

Note: Since this changes require changes in protobuf definition and possibly user-facing read consistency level. Thorough reviews and suggestions are greatly appreciated. (Edit: Currently putting this to Draft first until the proto and API changes are finalized since IMO the API and proto changes are still slightly awkward).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14509

How was this patch tested?

IT.

Clean CI run: https://github.com/ivandika3/ozone/actions/runs/22216665699

@szetszwo
Copy link
Contributor

@ivandika3 , thanks for working on this!

This PR also includes some refactoring/renaming and the patch file is ~70KB. Could you separate the refactoring/renaming to another JIRA? It will be easier to review.

@ivandika3 ivandika3 self-assigned this Feb 22, 2026
@ivandika3
Copy link
Contributor Author

Thanks @szetszwo for the review. I have removed some of refactoring by reverting rename from leaderFailoverProxy to failoverProxy. Regarding the refactoring for transferLeadershipToAnotherNode to MiniOzoneHAClusterImpl, I did it to remove duplication on this method in the new tests. Let me know if you'd prefer to revert this as well.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivandika3 , thanks for the update!

I have some high level comments inlined.

Comment on lines 39 to 43
DEFAULT(ConsistencyType.NON_LINEARIZABLE, false),
STALE(ConsistencyType.STALE, true),
LINEARIZABLE_LEADER_READ(ConsistencyType.LINEARIZABLE, false),
LINEARIZABLE_FOLLOWER_READ(ConsistencyType.LINEARIZABLE, true),
LOCAL_LEASE_FOLLOWER_READ(ConsistencyType.LOCAL_LEASE, true);
Copy link
Contributor

@szetszwo szetszwo Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Stale read can be supported by local lease with infinite log lag and time limit. We may remove STALE for simplicity.
  2. I think we don't need the inner enum ConsistencyType. Just add a boolean for linearizable.
  3. isAllowFollowerRead sounds odd. Let's use nouns for the fields and add the verb to the method.

Below is my suggestion:

public enum ReadConsistency {
  DEFAULT(false, false),
  LOCAL_LEASE(false, true),
  LINEARIZABLE_LEADER_ONLY(true, false),
  LINEARIZABLE_ALLOW_FOLLOWER(true, true);

  private final boolean linearizable;
  private final boolean followerRead;

  ReadConsistency(boolean linearizable, boolean followerRead) {
    this.linearizable = linearizable;
    this.followerRead = followerRead;
  }

  public boolean isLinearizable() {
    return linearizable;
  }

  public boolean allowFollowerRead() {
    return followerRead;
  }

  ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and the suggestion on streamlining the ReadConsistency. I have changed accordingly and change some namings (e.g. removed "Consistency Type" naming).

}
// The LocalLease lag is too high, trigger failover
throw createLeaderErrorException(raftServerStatus);
case LEADER_AND_NOT_READY:
Copy link
Contributor

@szetszwo szetszwo Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For LOCAL_LEASE_FOLLOWER_READ, the LEADER_AND_NOT_READY case should be treated as the same as the NOT_LEADER case, i.e. just consider it as a follower. (fixed typo "LEADER_AND_READY")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, updated.

Comment on lines 287 to 290
long localLeaseLagLimit = localLeaseContext.getLagLimit() > 0 ?
localLeaseContext.getLagLimit() : ozoneManager.getConfig().getFollowerReadLocalLeaseLagLimit();
long localLeaseLeaseTimeMs = localLeaseContext.getLeaseTimeMs() > 0 ?
localLeaseContext.getLeaseTimeMs() : ozoneManager.getConfig().getFollowerReadLocalLeaseTimeMs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use localLeaseContext.getLagLimit() and hasLeaseTimeMs(). We may use -1 for allowing infinite lag/time.

Copy link
Contributor Author

@ivandika3 ivandika3 Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, used the has... and also updated allowFollowerReadLocalLease to allow infinite lag.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivandika3 , thanks for the update! Please see the comments inlined.

Comment on lines 238 to 245
ReadConsistencyProto defaultReadConsistency = isFollowerReadEligible
? followerReadConsistencyType : leaderReadConsistencyType;
if (defaultReadConsistency != null &&
defaultReadConsistency != ReadConsistencyProto.UNKNOWN_READ_CONSISTENCY) {
omRequest = omRequest.toBuilder()
.setReadConsistencyHint(ReadConsistencyHint.newBuilder()
.setReadConsistency(defaultReadConsistency).build())
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change them to followerReadConsistencyType and leaderReadConsistencyType to ReadConsistencyHint. So it does not need to build again and again.

      if (!omRequest.hasReadConsistencyHint()) {
        final ReadConsistencyHint defaultReadConsistency = isFollowerReadEligible
            ? followerReadConsistencyType : leaderReadConsistencyType;
        if (defaultReadConsistency != null) {
          omRequest = omRequest.toBuilder()
              .setReadConsistencyHint(defaultReadConsistency)
              .build();
          args[1] = omRequest;
        }
      }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this optimization, updated.

enum ReadConsistencyProto {
// Unknown consistency, the read consistency behavior is decided
// by the OM
UNKNOWN_READ_CONSISTENCY = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good idea to have this as described in https://protobuf.dev/best-practices/dos-donts/#unspecified-enum. How about calling it UNSPECIFIED?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, updated. We might add some kind of suffix or prefix (e.g. UNSPECIFIED_CONSISTENCY) since AFAIK once one one enum uses UNSPECIFIED other enum within the scope cannot use UNSPECIFIED anymore. Please let me know what you think.


private final HadoopRpcOMFailoverProxyProvider<OzoneManagerProtocolPB> omFailoverProxyProvider;
private final boolean followerReadEnabled;
private final ReadConsistencyProto defaultLeaderReadConsistency;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ReadConsistencyHint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

.build();
}

private RaftClientRequest.Type getRaftReadRequestType(OMRequest omRequest) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add static.

}

if (leaseLagLimit == -1) {
// Allow infinite lag time, which allows unbounded stale reads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be "infinite log lag time"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, updated.

Comment on lines 367 to 369
if (leaseTimeMsLimit == -1) {
leaseTimeMsLimit = Long.MAX_VALUE;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine to the if-condition below:

    if (leaseTimeMsLimit >= 0 && leaderInfo.getLastRpcElapsedTimeMs() > leaseTimeMsLimit) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial local lease implementation allows negative leaseTimeMsLimit which means that lease will be rejected all the time? It is tested by testAllowFollowerReadLocalLease. IMO, any negative leaseTimeMsLimit should infer infinite log and lag time? In that case, should we change the test to reflect this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make leaseTimeMsLimit >= -1 and use -1 for infinity. We should change the test.

}

boolean allowFollowerReadLocalLease(Division ratisDivision, long leaseLogLimit, long leaseTimeMsLimit) {
boolean allowFollowerReadLocalLease(Division ratisDivision, long leaseLagLimit, long leaseTimeMsLimit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are two types of lag, time lag and log lag, the original name leaseLogLimit is better.

BTW, we should change the conf

  • ozone.om.follower.read.local.lease.lag.limit

to

  • ozone.om.follower.read.local.lease.log.limit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, updated. The initial change was because of the configuration which lead me to believe it's a typo.

raftServerStatus = omRatisServer.getLeaderStatus();
switch (raftServerStatus) {
case LEADER_AND_NOT_READY:
throw createLeaderErrorException(raftServerStatus);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line since LEADER_AND_NOT_READY and NOT_LEADER should be the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. LeaderStateImpl#getReadIndex handles isReady and will wait until the leader apply the no-op entry.


private final HadoopRpcOMFailoverProxyProvider<OzoneManagerProtocolPB> omFailoverProxyProvider;
private final boolean followerReadEnabled;
private final ReadConsistencyProto defaultLeaderReadConsistency;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ReadConsistencyHint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivandika3 , thanks for the update! The change looks good. Just have two minor comments inlined.

enum ReadConsistencyProto {
// Unspecified consistency, the read consistency behavior is decided
// by the OM
UNSPECIFIED = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image Just realized that UNSPECIFIED only can be used once. Let's use READ_CONSISTENCY_UNSPECIFIED

Comment on lines 524 to 525
if (!omRequest.hasReadConsistencyHint() || !omRequest.getReadConsistencyHint().hasReadConsistency() ||
omRequest.getReadConsistencyHint().getReadConsistency() == ReadConsistencyProto.UNSPECIFIED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a util method. Use it here and also OzoneManagerProtocolServerSideTranslatorPB.

//OmUtils
  public static boolean specifiedReadConsistency(OMRequest request) {
    return request.hasReadConsistencyHint()
        && request.getReadConsistencyHint().hasReadConsistency()
        && request.getReadConsistencyHint().getReadConsistency() != UNSPECIFIED;
  }

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivandika3 , thanks for the update! Please see the comments inlined.

}

if (leaseTimeMsLimit != -1 && leaderInfo.getLastRpcElapsedTimeMs() > leaseTimeMsLimit) {
if (leaseTimeMsLimit >= -1 && leaderInfo.getLastRpcElapsedTimeMs() > leaseTimeMsLimit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be != or > -- when setting it to -1, it should not return false here.

Comment on lines 170 to 183
@@ -178,7 +179,8 @@ public class OmConfig extends ReconfigurableConfig {
tags = {ConfigTag.OM, ConfigTag.PERFORMANCE, ConfigTag.HA, ConfigTag.RATIS},
description = " If the lag time Ms between leader OM and follower OM is larger " +
"than this number, the follower OM is not up-to-date. " +
"By default, it's set to Ratis RPC timeout value."
"By default, it's set to Ratis RPC timeout value." +
"Setting this to -1 allows read to return immediately."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are two limits, it may not "return immediately". How about below?

  • "Setting this to -1 to allow infinite lag."

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 The change looks good.

@ivandika3 ivandika3 marked this pull request as ready for review February 27, 2026 15:12
@ivandika3
Copy link
Contributor Author

@szetszwo Thank you for the thorough reviews.

@szetszwo szetszwo merged commit 1edbb6d into apache:master Feb 27, 2026
101 of 102 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants