HDDS-14509. Allow client to choose the read consistency level#9796
HDDS-14509. Allow client to choose the read consistency level#9796szetszwo merged 21 commits intoapache:masterfrom
Conversation
|
@ivandika3 , thanks for working on this! This PR also includes some refactoring/renaming and the patch file is ~70KB. Could you separate the refactoring/renaming to another JIRA? It will be easier to review. |
|
Thanks @szetszwo for the review. I have removed some of refactoring by reverting rename from |
szetszwo
left a comment
There was a problem hiding this comment.
@ivandika3 , thanks for the update!
I have some high level comments inlined.
| DEFAULT(ConsistencyType.NON_LINEARIZABLE, false), | ||
| STALE(ConsistencyType.STALE, true), | ||
| LINEARIZABLE_LEADER_READ(ConsistencyType.LINEARIZABLE, false), | ||
| LINEARIZABLE_FOLLOWER_READ(ConsistencyType.LINEARIZABLE, true), | ||
| LOCAL_LEASE_FOLLOWER_READ(ConsistencyType.LOCAL_LEASE, true); |
There was a problem hiding this comment.
- Stale read can be supported by local lease with infinite log lag and time limit. We may remove STALE for simplicity.
- I think we don't need the inner enum
ConsistencyType. Just add a boolean for linearizable. isAllowFollowerReadsounds odd. Let's use nouns for the fields and add the verb to the method.
Below is my suggestion:
public enum ReadConsistency {
DEFAULT(false, false),
LOCAL_LEASE(false, true),
LINEARIZABLE_LEADER_ONLY(true, false),
LINEARIZABLE_ALLOW_FOLLOWER(true, true);
private final boolean linearizable;
private final boolean followerRead;
ReadConsistency(boolean linearizable, boolean followerRead) {
this.linearizable = linearizable;
this.followerRead = followerRead;
}
public boolean isLinearizable() {
return linearizable;
}
public boolean allowFollowerRead() {
return followerRead;
}
...
}There was a problem hiding this comment.
Thanks for the review and the suggestion on streamlining the ReadConsistency. I have changed accordingly and change some namings (e.g. removed "Consistency Type" naming).
| } | ||
| // The LocalLease lag is too high, trigger failover | ||
| throw createLeaderErrorException(raftServerStatus); | ||
| case LEADER_AND_NOT_READY: |
There was a problem hiding this comment.
For LOCAL_LEASE_FOLLOWER_READ, the LEADER_AND_NOT_READY case should be treated as the same as the NOT_LEADER case, i.e. just consider it as a follower. (fixed typo "LEADER_AND_READY")
There was a problem hiding this comment.
Thanks for catching this, updated.
| long localLeaseLagLimit = localLeaseContext.getLagLimit() > 0 ? | ||
| localLeaseContext.getLagLimit() : ozoneManager.getConfig().getFollowerReadLocalLeaseLagLimit(); | ||
| long localLeaseLeaseTimeMs = localLeaseContext.getLeaseTimeMs() > 0 ? | ||
| localLeaseContext.getLeaseTimeMs() : ozoneManager.getConfig().getFollowerReadLocalLeaseTimeMs(); |
There was a problem hiding this comment.
Use localLeaseContext.getLagLimit() and hasLeaseTimeMs(). We may use -1 for allowing infinite lag/time.
There was a problem hiding this comment.
Thanks, used the has... and also updated allowFollowerReadLocalLease to allow infinite lag.
szetszwo
left a comment
There was a problem hiding this comment.
@ivandika3 , thanks for the update! Please see the comments inlined.
| ReadConsistencyProto defaultReadConsistency = isFollowerReadEligible | ||
| ? followerReadConsistencyType : leaderReadConsistencyType; | ||
| if (defaultReadConsistency != null && | ||
| defaultReadConsistency != ReadConsistencyProto.UNKNOWN_READ_CONSISTENCY) { | ||
| omRequest = omRequest.toBuilder() | ||
| .setReadConsistencyHint(ReadConsistencyHint.newBuilder() | ||
| .setReadConsistency(defaultReadConsistency).build()) | ||
| .build(); |
There was a problem hiding this comment.
Change them to followerReadConsistencyType and leaderReadConsistencyType to ReadConsistencyHint. So it does not need to build again and again.
if (!omRequest.hasReadConsistencyHint()) {
final ReadConsistencyHint defaultReadConsistency = isFollowerReadEligible
? followerReadConsistencyType : leaderReadConsistencyType;
if (defaultReadConsistency != null) {
omRequest = omRequest.toBuilder()
.setReadConsistencyHint(defaultReadConsistency)
.build();
args[1] = omRequest;
}
}There was a problem hiding this comment.
Thanks for this optimization, updated.
| enum ReadConsistencyProto { | ||
| // Unknown consistency, the read consistency behavior is decided | ||
| // by the OM | ||
| UNKNOWN_READ_CONSISTENCY = 0; |
There was a problem hiding this comment.
It is a good idea to have this as described in https://protobuf.dev/best-practices/dos-donts/#unspecified-enum. How about calling it UNSPECIFIED?
There was a problem hiding this comment.
Got it, updated. We might add some kind of suffix or prefix (e.g. UNSPECIFIED_CONSISTENCY) since AFAIK once one one enum uses UNSPECIFIED other enum within the scope cannot use UNSPECIFIED anymore. Please let me know what you think.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/helpers/ReadConsistency.java
Show resolved
Hide resolved
|
|
||
| private final HadoopRpcOMFailoverProxyProvider<OzoneManagerProtocolPB> omFailoverProxyProvider; | ||
| private final boolean followerReadEnabled; | ||
| private final ReadConsistencyProto defaultLeaderReadConsistency; |
| .build(); | ||
| } | ||
|
|
||
| private RaftClientRequest.Type getRaftReadRequestType(OMRequest omRequest) { |
| } | ||
|
|
||
| if (leaseLagLimit == -1) { | ||
| // Allow infinite lag time, which allows unbounded stale reads |
There was a problem hiding this comment.
It should be "infinite log lag time"
There was a problem hiding this comment.
Thanks for catching this, updated.
| if (leaseTimeMsLimit == -1) { | ||
| leaseTimeMsLimit = Long.MAX_VALUE; | ||
| } |
There was a problem hiding this comment.
Combine to the if-condition below:
if (leaseTimeMsLimit >= 0 && leaderInfo.getLastRpcElapsedTimeMs() > leaseTimeMsLimit) {There was a problem hiding this comment.
The initial local lease implementation allows negative leaseTimeMsLimit which means that lease will be rejected all the time? It is tested by testAllowFollowerReadLocalLease. IMO, any negative leaseTimeMsLimit should infer infinite log and lag time? In that case, should we change the test to reflect this?
There was a problem hiding this comment.
Let's make leaseTimeMsLimit >= -1 and use -1 for infinity. We should change the test.
| } | ||
|
|
||
| boolean allowFollowerReadLocalLease(Division ratisDivision, long leaseLogLimit, long leaseTimeMsLimit) { | ||
| boolean allowFollowerReadLocalLease(Division ratisDivision, long leaseLagLimit, long leaseTimeMsLimit) { |
There was a problem hiding this comment.
Since there are two types of lag, time lag and log lag, the original name leaseLogLimit is better.
BTW, we should change the conf
ozone.om.follower.read.local.lease.lag.limit
to
ozone.om.follower.read.local.lease.log.limit
There was a problem hiding this comment.
Thanks for catching this, updated. The initial change was because of the configuration which lead me to believe it's a typo.
| raftServerStatus = omRatisServer.getLeaderStatus(); | ||
| switch (raftServerStatus) { | ||
| case LEADER_AND_NOT_READY: | ||
| throw createLeaderErrorException(raftServerStatus); |
There was a problem hiding this comment.
Remove this line since LEADER_AND_NOT_READY and NOT_LEADER should be the same.
There was a problem hiding this comment.
Good catch. LeaderStateImpl#getReadIndex handles isReady and will wait until the leader apply the no-op entry.
|
|
||
| private final HadoopRpcOMFailoverProxyProvider<OzoneManagerProtocolPB> omFailoverProxyProvider; | ||
| private final boolean followerReadEnabled; | ||
| private final ReadConsistencyProto defaultLeaderReadConsistency; |
szetszwo
left a comment
There was a problem hiding this comment.
@ivandika3 , thanks for the update! The change looks good. Just have two minor comments inlined.
| enum ReadConsistencyProto { | ||
| // Unspecified consistency, the read consistency behavior is decided | ||
| // by the OM | ||
| UNSPECIFIED = 0; |
| if (!omRequest.hasReadConsistencyHint() || !omRequest.getReadConsistencyHint().hasReadConsistency() || | ||
| omRequest.getReadConsistencyHint().getReadConsistency() == ReadConsistencyProto.UNSPECIFIED) { |
There was a problem hiding this comment.
Add a util method. Use it here and also OzoneManagerProtocolServerSideTranslatorPB.
//OmUtils
public static boolean specifiedReadConsistency(OMRequest request) {
return request.hasReadConsistencyHint()
&& request.getReadConsistencyHint().hasReadConsistency()
&& request.getReadConsistencyHint().getReadConsistency() != UNSPECIFIED;
}
szetszwo
left a comment
There was a problem hiding this comment.
@ivandika3 , thanks for the update! Please see the comments inlined.
| } | ||
|
|
||
| if (leaseTimeMsLimit != -1 && leaderInfo.getLastRpcElapsedTimeMs() > leaseTimeMsLimit) { | ||
| if (leaseTimeMsLimit >= -1 && leaderInfo.getLastRpcElapsedTimeMs() > leaseTimeMsLimit) { |
There was a problem hiding this comment.
It should be != or > -- when setting it to -1, it should not return false here.
| @@ -178,7 +179,8 @@ public class OmConfig extends ReconfigurableConfig { | |||
| tags = {ConfigTag.OM, ConfigTag.PERFORMANCE, ConfigTag.HA, ConfigTag.RATIS}, | |||
| description = " If the lag time Ms between leader OM and follower OM is larger " + | |||
| "than this number, the follower OM is not up-to-date. " + | |||
| "By default, it's set to Ratis RPC timeout value." | |||
| "By default, it's set to Ratis RPC timeout value." + | |||
| "Setting this to -1 allows read to return immediately." | |||
There was a problem hiding this comment.
Since there are two limits, it may not "return immediately". How about below?
- "Setting this to -1 to allow infinite lag."
szetszwo
left a comment
There was a problem hiding this comment.
+1 The change looks good.
|
@szetszwo Thank you for the thorough reviews. |

What changes were proposed in this pull request?
Currently, if OM follower read is enabled, all OM followers will serve read requests even if the client does not enable the follower read.
For leader-only client, this might be unexpected since they expect the OM follower to throw OMNotLeaderException.
We can support some kind of OM request metadata to indicate to the OM whether the client enables the follower read or not.
The implementation introduces
ReadConsistencyHintfield inOMRequestwhich allows client to hint to the OM about what the read consistency the client wants. However, for compatibility reasons, the clientReadConsistencyHintis only supported by OM in a best-effort basis. For example, if the new client sends the hint to old OM that does not support it or OM that disables linearizable read, the OM can choose not to respect the read consistency.Currently, client can configure the default leader and follower read consistency which applies to corresponding requests. However, the long term idea is to allow client to specify per-request consistency requirements. For example in S3 use cases, client can use custom header (e.g. "x-ozone-read-consistency") to pick the desired consistency.
The idea of client-defined consistency level is inspired by other systems and literature:
Through Baseball": https://www.microsoft.com/en-us/research/wp-content/uploads/2011/10/ConsistencyAndBaseballReport.pdf
Note: Since this changes require changes in protobuf definition and possibly user-facing read consistency level. Thorough reviews and suggestions are greatly appreciated. (Edit: Currently putting this to Draft first until the proto and API changes are finalized since IMO the API and proto changes are still slightly awkward).
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14509
How was this patch tested?
IT.
Clean CI run: https://github.com/ivandika3/ozone/actions/runs/22216665699