Skip to content

Conversation

@JNSimba
Copy link
Member

@JNSimba JNSimba commented Feb 3, 2026

What problem does this PR solve?

Related PR: #58898 #59461

In some scenarios, it is necessary to tolerate a certain amount of erroneous data.

Supported parameters:

load.strict_mode: Whether to enable strict mode, defaults to false.

load.max_filter_ratio: The maximum allowed filtering rate within the sampling window, defaults to zero tolerance. The sampling window is max_interval * 10. That is, if the number of erroneous rows/total rows exceeds max_filter_ratio within the sampling window, the job will be paused, requiring manual intervention to check data quality issues.

eg:

CREATE JOB test_streaming_mysql_job_errormsg
ON STREAMING
FROM MYSQL (
"jdbc_url" = "jdbc:mysql://127.0.0.1:3308",
......
)
TO DATABASE database (
"load.max_filter_ratio" = "1"
)

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba
Copy link
Member Author

JNSimba commented Feb 3, 2026

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/21) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

TPC-H: Total hot run time: 32066 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a92766b116558900d2d541e5a74d1b476a6b0af8, data reload: false

------ Round 1 ----------------------------------
q1	17649	5228	5047	5047
q2	2038	308	188	188
q3	10202	1355	750	750
q4	10232	896	320	320
q5	8139	2177	1952	1952
q6	227	179	149	149
q7	904	738	607	607
q8	9270	1444	1238	1238
q9	5390	4807	4832	4807
q10	6877	1921	1552	1552
q11	511	303	278	278
q12	379	377	227	227
q13	17781	4071	3201	3201
q14	256	249	226	226
q15	908	839	817	817
q16	678	677	637	637
q17	665	767	511	511
q18	6755	6542	6524	6524
q19	1448	1005	622	622
q20	408	370	229	229
q21	2638	2099	1907	1907
q22	358	328	277	277
Total cold run time: 103713 ms
Total hot run time: 32066 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5373	5311	5281	5281
q2	273	343	262	262
q3	2194	2675	2259	2259
q4	1366	1733	1303	1303
q5	4318	4253	4329	4253
q6	215	181	141	141
q7	2441	2023	1942	1942
q8	2602	2612	2474	2474
q9	7485	7782	7465	7465
q10	2798	3058	2582	2582
q11	567	475	446	446
q12	684	748	639	639
q13	3945	4966	3671	3671
q14	289	314	299	299
q15	883	839	855	839
q16	675	729	698	698
q17	1213	1370	1393	1370
q18	8039	8185	7594	7594
q19	844	848	875	848
q20	2079	2161	2012	2012
q21	4735	4165	4164	4164
q22	596	544	522	522
Total cold run time: 53614 ms
Total hot run time: 51064 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.69 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit a92766b116558900d2d541e5a74d1b476a6b0af8, data reload: false

query1	0.05	0.05	0.05
query2	0.10	0.04	0.04
query3	0.26	0.08	0.09
query4	1.62	0.11	0.11
query5	0.27	0.25	0.25
query6	1.17	0.69	0.67
query7	0.03	0.02	0.03
query8	0.06	0.03	0.04
query9	0.56	0.51	0.50
query10	0.56	0.54	0.55
query11	0.15	0.10	0.09
query12	0.14	0.11	0.11
query13	0.63	0.61	0.60
query14	1.05	1.04	1.04
query15	0.88	0.87	0.88
query16	0.42	0.39	0.40
query17	1.18	1.15	1.15
query18	0.23	0.21	0.21
query19	2.10	2.00	2.00
query20	0.02	0.01	0.02
query21	15.40	0.25	0.14
query22	5.20	0.05	0.05
query23	15.92	0.31	0.11
query24	1.07	0.63	0.79
query25	0.11	0.05	0.11
query26	0.16	0.13	0.13
query27	0.09	0.06	0.06
query28	5.00	1.14	0.97
query29	12.56	3.94	3.21
query30	0.28	0.12	0.11
query31	2.82	0.64	0.41
query32	3.25	0.59	0.49
query33	3.27	3.26	3.35
query34	16.18	5.39	4.70
query35	4.76	4.77	4.73
query36	0.67	0.50	0.50
query37	0.10	0.07	0.06
query38	0.07	0.04	0.04
query39	0.04	0.02	0.03
query40	0.20	0.18	0.16
query41	0.09	0.03	0.03
query42	0.05	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 98.81 s
Total hot run time: 28.69 s

@JNSimba JNSimba changed the title [Improve](StreamingJob) add stream load properties for mysql/pg streaming job [Improve](StreamingJob) add max_filter_ratio and strict mode for mysql/pg streaming job Feb 3, 2026
@JNSimba
Copy link
Member Author

JNSimba commented Feb 3, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 31554 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 856a0da5bcdcd9c316beadb076a751f95b163f82, data reload: false

------ Round 1 ----------------------------------
q1	17652	5323	5042	5042
q2	2030	298	204	204
q3	10233	1281	748	748
q4	10200	803	324	324
q5	7519	2207	1886	1886
q6	197	179	149	149
q7	867	761	596	596
q8	9271	1421	1021	1021
q9	5081	4807	4826	4807
q10	6817	1929	1565	1565
q11	502	304	277	277
q12	337	373	232	232
q13	17797	4038	3263	3263
q14	243	236	227	227
q15	874	828	820	820
q16	676	668	624	624
q17	651	832	456	456
q18	6685	6449	6288	6288
q19	1234	997	631	631
q20	390	348	232	232
q21	2672	2015	1897	1897
q22	353	313	265	265
Total cold run time: 102281 ms
Total hot run time: 31554 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5292	5251	5235	5235
q2	259	342	265	265
q3	2187	2664	2265	2265
q4	1339	1725	1317	1317
q5	4275	4235	4298	4235
q6	228	187	139	139
q7	1948	2302	1918	1918
q8	2545	2438	2518	2438
q9	7569	7453	7581	7453
q10	2952	3135	2581	2581
q11	542	495	463	463
q12	679	727	615	615
q13	3789	4480	3576	3576
q14	310	348	288	288
q15	896	837	856	837
q16	684	728	694	694
q17	1203	1374	1382	1374
q18	8016	7965	7849	7849
q19	890	847	849	847
q20	2091	2205	2104	2104
q21	4751	4229	4147	4147
q22	568	545	496	496
Total cold run time: 53013 ms
Total hot run time: 51136 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.33 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 856a0da5bcdcd9c316beadb076a751f95b163f82, data reload: false

query1	0.05	0.05	0.05
query2	0.09	0.05	0.04
query3	0.26	0.09	0.08
query4	1.60	0.11	0.11
query5	0.27	0.26	0.24
query6	1.16	0.67	0.67
query7	0.03	0.02	0.02
query8	0.05	0.04	0.04
query9	0.56	0.51	0.49
query10	0.54	0.54	0.55
query11	0.15	0.10	0.10
query12	0.14	0.11	0.10
query13	0.64	0.61	0.61
query14	1.06	1.08	1.07
query15	0.87	0.84	0.87
query16	0.44	0.39	0.41
query17	1.17	1.14	1.08
query18	0.23	0.21	0.21
query19	2.02	1.95	1.99
query20	0.02	0.01	0.01
query21	15.39	0.27	0.15
query22	4.97	0.05	0.05
query23	15.83	0.28	0.10
query24	0.93	0.93	0.32
query25	0.11	0.08	0.06
query26	0.14	0.14	0.14
query27	0.10	0.04	0.06
query28	3.68	1.13	0.96
query29	12.58	3.93	3.17
query30	0.28	0.13	0.12
query31	2.81	0.64	0.43
query32	3.23	0.60	0.49
query33	3.26	3.25	3.28
query34	16.42	5.44	4.74
query35	4.79	4.80	4.78
query36	0.64	0.50	0.50
query37	0.11	0.08	0.07
query38	0.07	0.05	0.04
query39	0.05	0.04	0.04
query40	0.19	0.16	0.15
query41	0.08	0.03	0.03
query42	0.04	0.03	0.03
query43	0.06	0.03	0.03
Total cold run time: 97.11 s
Total hot run time: 28.33 s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants