Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat](spill) spill and reserve #47462

Merged
merged 3 commits into from
Feb 25, 2025
Merged

[feat](spill) spill and reserve #47462

merged 3 commits into from
Feb 25, 2025

Conversation

mrhhsg
Copy link
Member

@mrhhsg mrhhsg commented Jan 26, 2025

What problem does this PR solve?

Problem Summary:

A brand-new spilling triggering strategy:

  1. Use workload groups to control/manage the memory usage of queries.
  2. Trigger spilling when the memory reservation attempt fails.

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Jan 26, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@mrhhsg mrhhsg marked this pull request as draft January 26, 2025 10:20
@mrhhsg
Copy link
Member Author

mrhhsg commented Jan 26, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32548 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit bbb25f1b0e644ce1b6368ccaba3ed10d144321ce, data reload: false

------ Round 1 ----------------------------------
q1	17797	5532	5403	5403
q2	2073	305	174	174
q3	10877	1213	726	726
q4	10219	961	546	546
q5	7652	2390	2168	2168
q6	200	172	134	134
q7	931	763	600	600
q8	9225	1383	1171	1171
q9	5228	4953	4898	4898
q10	6854	2344	1901	1901
q11	473	282	260	260
q12	346	365	215	215
q13	17759	3731	3151	3151
q14	230	229	216	216
q15	547	461	469	461
q16	638	627	598	598
q17	555	879	333	333
q18	7246	6703	6558	6558
q19	2001	965	539	539
q20	303	320	190	190
q21	2870	2166	1983	1983
q22	377	341	323	323
Total cold run time: 104401 ms
Total hot run time: 32548 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5571	5487	5449	5449
q2	242	339	243	243
q3	2267	2659	2350	2350
q4	1445	1811	1418	1418
q5	4341	4762	4714	4714
q6	170	165	126	126
q7	2077	2007	1857	1857
q8	2659	2890	2689	2689
q9	7349	7233	7324	7233
q10	3065	3337	2763	2763
q11	609	518	487	487
q12	657	739	612	612
q13	3606	4025	3296	3296
q14	293	296	282	282
q15	535	480	480	480
q16	660	714	667	667
q17	1263	1741	1263	1263
q18	7794	7586	7479	7479
q19	800	851	1069	851
q20	2027	2102	1920	1920
q21	5876	5238	5022	5022
q22	634	626	587	587
Total cold run time: 53940 ms
Total hot run time: 51788 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 193144 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit bbb25f1b0e644ce1b6368ccaba3ed10d144321ce, data reload: false

query1	1309	933	949	933
query2	6247	2093	2102	2093
query3	11101	4725	4800	4725
query4	32690	23660	23070	23070
query5	4694	593	458	458
query6	304	199	195	195
query7	3994	493	320	320
query8	300	252	237	237
query9	9296	2656	2635	2635
query10	473	307	277	277
query11	17877	15362	14970	14970
query12	171	116	106	106
query13	1575	528	408	408
query14	9577	7163	7283	7163
query15	243	210	185	185
query16	7572	672	517	517
query17	1550	754	572	572
query18	1449	387	313	313
query19	202	219	174	174
query20	120	114	117	114
query21	214	133	108	108
query22	4606	4550	4389	4389
query23	34401	33484	33503	33484
query24	6430	2361	2369	2361
query25	497	469	407	407
query26	741	279	160	160
query27	2143	505	343	343
query28	5274	2506	2471	2471
query29	550	564	430	430
query30	221	189	161	161
query31	956	895	819	819
query32	84	58	59	58
query33	479	360	316	316
query34	786	899	529	529
query35	832	875	771	771
query36	1020	1060	974	974
query37	127	106	88	88
query38	4405	4158	4247	4158
query39	1492	1455	1456	1455
query40	215	127	110	110
query41	60	57	56	56
query42	124	109	105	105
query43	524	541	499	499
query44	1352	851	865	851
query45	183	186	172	172
query46	879	1066	666	666
query47	1901	1948	1871	1871
query48	374	423	333	333
query49	715	551	436	436
query50	642	699	409	409
query51	4337	4355	4311	4311
query52	104	100	96	96
query53	248	278	192	192
query54	508	529	448	448
query55	89	79	88	79
query56	283	287	251	251
query57	1188	1184	1139	1139
query58	252	236	249	236
query59	3236	3204	3027	3027
query60	289	286	265	265
query61	126	124	121	121
query62	808	736	692	692
query63	236	205	192	192
query64	2993	1057	693	693
query65	3473	3272	3299	3272
query66	759	397	295	295
query67	16158	15588	15530	15530
query68	8486	902	555	555
query69	477	288	257	257
query70	1185	1170	1105	1105
query71	438	279	256	256
query72	5830	3961	4086	3961
query73	655	756	370	370
query74	10293	9180	8833	8833
query75	3975	3170	2656	2656
query76	3552	1177	781	781
query77	762	434	275	275
query78	9942	10230	9266	9266
query79	3655	843	588	588
query80	692	516	450	450
query81	507	276	242	242
query82	656	148	125	125
query83	200	183	154	154
query84	285	100	71	71
query85	767	363	311	311
query86	364	317	277	277
query87	4412	4441	4391	4391
query88	4789	2181	2141	2141
query89	417	335	287	287
query90	1805	191	191	191
query91	135	141	106	106
query92	65	57	50	50
query93	2322	928	546	546
query94	662	417	298	298
query95	338	263	261	261
query96	493	626	284	284
query97	3382	3396	3310	3310
query98	217	200	191	191
query99	1506	1405	1261	1261
Total cold run time: 291322 ms
Total hot run time: 193144 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.37 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit bbb25f1b0e644ce1b6368ccaba3ed10d144321ce, data reload: false

query1	0.03	0.04	0.03
query2	0.08	0.05	0.05
query3	0.23	0.05	0.06
query4	1.65	0.08	0.09
query5	0.55	0.55	0.54
query6	1.19	0.72	0.74
query7	0.02	0.01	0.02
query8	0.07	0.05	0.05
query9	0.56	0.50	0.51
query10	0.56	0.56	0.56
query11	0.17	0.12	0.13
query12	0.15	0.12	0.13
query13	0.61	0.61	0.59
query14	2.71	2.86	2.88
query15	0.92	0.86	0.83
query16	0.38	0.39	0.40
query17	1.04	1.04	1.07
query18	0.19	0.18	0.19
query19	1.94	1.88	1.97
query20	0.01	0.01	0.01
query21	15.41	0.98	0.65
query22	0.77	0.80	0.72
query23	14.96	1.50	0.68
query24	2.22	0.36	0.23
query25	0.15	0.09	0.08
query26	0.27	0.19	0.17
query27	0.08	0.08	0.08
query28	13.42	1.29	0.55
query29	12.66	4.11	3.41
query30	0.24	0.08	0.07
query31	2.85	0.62	0.40
query32	3.23	0.59	0.49
query33	3.01	3.04	3.05
query34	16.43	5.20	4.52
query35	4.64	4.68	4.58
query36	0.62	0.49	0.48
query37	0.20	0.16	0.16
query38	0.15	0.15	0.15
query39	0.05	0.05	0.04
query40	0.18	0.14	0.12
query41	0.09	0.05	0.05
query42	0.07	0.05	0.05
query43	0.05	0.04	0.05
Total cold run time: 104.81 s
Total hot run time: 31.37 s

@mrhhsg mrhhsg force-pushed the spill_rebased branch 2 times, most recently from 9dce485 to ccc257c Compare February 14, 2025 08:08
@mrhhsg
Copy link
Member Author

mrhhsg commented Feb 14, 2025

run buildall

@mrhhsg
Copy link
Member Author

mrhhsg commented Feb 14, 2025

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 82.25% (1061/1290)
Line Coverage: 65.79% (17575/26713)
Region Coverage: 65.34% (8662/13257)
Branch Coverage: 55.21% (4666/8452)
Coverage Report: http://coverage.selectdb-in.cc/coverage/6613edc989fbdcfcdc3ba88dc6198c37980bf465_6613edc989fbdcfcdc3ba88dc6198c37980bf465_cloud/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 31551 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 6613edc989fbdcfcdc3ba88dc6198c37980bf465, data reload: false

------ Round 1 ----------------------------------
q1	17623	5279	5148	5148
q2	2047	305	171	171
q3	10390	1283	749	749
q4	10209	1021	521	521
q5	7539	2319	2361	2319
q6	193	175	137	137
q7	903	756	613	613
q8	9340	1372	1151	1151
q9	4957	4651	4760	4651
q10	6828	2292	1875	1875
q11	462	279	262	262
q12	344	354	219	219
q13	17765	3713	3095	3095
q14	221	227	224	224
q15	523	471	457	457
q16	655	619	580	580
q17	597	874	341	341
q18	7003	6207	6080	6080
q19	1439	967	538	538
q20	311	332	194	194
q21	2825	2246	1926	1926
q22	370	334	300	300
Total cold run time: 102544 ms
Total hot run time: 31551 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5190	5131	5161	5131
q2	239	335	226	226
q3	2159	2721	2328	2328
q4	1483	1817	1363	1363
q5	4299	4157	4185	4157
q6	209	165	126	126
q7	1900	1841	1740	1740
q8	2631	2643	2549	2549
q9	7247	7187	7133	7133
q10	3009	3273	2801	2801
q11	585	539	485	485
q12	701	788	657	657
q13	3477	3930	3270	3270
q14	270	307	266	266
q15	511	476	446	446
q16	667	698	652	652
q17	1135	1590	1369	1369
q18	7509	7357	7403	7357
q19	807	868	939	868
q20	1984	2009	1873	1873
q21	5540	5064	4745	4745
q22	618	579	537	537
Total cold run time: 52170 ms
Total hot run time: 50079 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 190786 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 6613edc989fbdcfcdc3ba88dc6198c37980bf465, data reload: false

query1	1308	976	931	931
query2	6249	1885	1960	1885
query3	10964	4425	4594	4425
query4	56741	25630	23329	23329
query5	5212	496	477	477
query6	383	194	187	187
query7	5142	503	281	281
query8	331	242	236	236
query9	6586	2627	2620	2620
query10	451	313	266	266
query11	15100	15137	14891	14891
query12	156	107	102	102
query13	1214	521	408	408
query14	10124	6895	6266	6266
query15	221	200	177	177
query16	7137	624	470	470
query17	1088	742	580	580
query18	1601	422	322	322
query19	206	209	177	177
query20	126	130	126	126
query21	214	130	117	117
query22	4360	4633	4371	4371
query23	34241	33551	33297	33297
query24	5817	2466	2431	2431
query25	467	502	395	395
query26	697	279	152	152
query27	1821	481	337	337
query28	2896	2528	2490	2490
query29	585	565	462	462
query30	226	203	171	171
query31	901	900	790	790
query32	72	61	62	61
query33	485	370	295	295
query34	761	859	510	510
query35	804	817	751	751
query36	952	1010	906	906
query37	117	102	76	76
query38	4204	4387	4153	4153
query39	1478	1483	1450	1450
query40	228	121	105	105
query41	55	52	51	51
query42	123	112	99	99
query43	511	533	497	497
query44	1319	823	829	823
query45	178	175	170	170
query46	865	1066	655	655
query47	1816	1874	1833	1833
query48	388	419	309	309
query49	705	531	429	429
query50	726	760	420	420
query51	4261	4284	4245	4245
query52	108	105	97	97
query53	235	272	201	201
query54	481	509	427	427
query55	88	80	82	80
query56	314	273	269	269
query57	1174	1179	1165	1165
query58	248	244	249	244
query59	2959	3047	2852	2852
query60	281	284	272	272
query61	128	123	124	123
query62	745	743	713	713
query63	226	193	195	193
query64	1771	1065	699	699
query65	3355	3161	3149	3149
query66	782	395	298	298
query67	15884	15456	15246	15246
query68	7510	866	501	501
query69	536	288	264	264
query70	1224	1121	1083	1083
query71	489	300	260	260
query72	5770	3666	3782	3666
query73	1317	734	359	359
query74	8988	9099	8946	8946
query75	3705	3160	2698	2698
query76	4169	1167	730	730
query77	613	358	270	270
query78	10148	10086	9345	9345
query79	2949	816	582	582
query80	709	516	476	476
query81	522	280	244	244
query82	528	123	96	96
query83	312	171	211	171
query84	290	98	75	75
query85	801	349	312	312
query86	415	306	303	303
query87	4531	4463	4420	4420
query88	3793	2207	2184	2184
query89	404	317	288	288
query90	1822	194	191	191
query91	139	144	109	109
query92	69	58	56	56
query93	2113	1049	575	575
query94	662	401	303	303
query95	349	264	247	247
query96	476	557	268	268
query97	3329	3431	3259	3259
query98	229	209	204	204
query99	1442	1419	1266	1266
Total cold run time: 301745 ms
Total hot run time: 190786 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.46 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 6613edc989fbdcfcdc3ba88dc6198c37980bf465, data reload: false

query1	0.04	0.04	0.03
query2	0.10	0.05	0.05
query3	0.28	0.05	0.06
query4	1.61	0.07	0.08
query5	0.55	0.55	0.54
query6	1.20	0.73	0.74
query7	0.02	0.02	0.02
query8	0.05	0.05	0.05
query9	0.63	0.53	0.53
query10	0.58	0.58	0.58
query11	0.26	0.12	0.12
query12	0.24	0.12	0.13
query13	0.62	0.62	0.61
query14	2.72	2.80	2.81
query15	0.98	0.88	0.86
query16	0.37	0.37	0.39
query17	1.07	1.00	1.03
query18	0.18	0.17	0.19
query19	1.93	1.92	1.99
query20	0.02	0.01	0.02
query21	15.36	0.97	0.66
query22	0.92	1.04	0.75
query23	14.71	1.46	0.73
query24	7.49	0.86	0.38
query25	0.18	0.09	0.09
query26	0.62	0.23	0.18
query27	0.09	0.08	0.09
query28	11.07	1.13	0.54
query29	12.57	4.13	3.42
query30	0.27	0.08	0.06
query31	2.81	0.60	0.41
query32	3.22	0.58	0.50
query33	3.02	3.06	3.06
query34	16.56	5.08	4.40
query35	4.52	4.42	4.53
query36	0.63	0.51	0.50
query37	0.21	0.17	0.17
query38	0.17	0.16	0.15
query39	0.05	0.04	0.04
query40	0.20	0.15	0.15
query41	0.10	0.06	0.05
query42	0.06	0.04	0.05
query43	0.05	0.05	0.05
Total cold run time: 108.33 s
Total hot run time: 31.46 s

@mrhhsg
Copy link
Member Author

mrhhsg commented Feb 15, 2025

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 82.25% (1061/1290)
Line Coverage: 65.73% (17579/26746)
Region Coverage: 65.27% (8662/13271)
Branch Coverage: 55.15% (4667/8462)
Coverage Report: http://coverage.selectdb-in.cc/coverage/85b6160dc75487d3d60304943d05388aa020513b_85b6160dc75487d3d60304943d05388aa020513b_cloud/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 31894 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 85b6160dc75487d3d60304943d05388aa020513b, data reload: false

------ Round 1 ----------------------------------
q1	17683	5208	5524	5208
q2	2055	310	168	168
q3	10802	1254	747	747
q4	10276	1013	539	539
q5	8177	2387	2361	2361
q6	189	173	133	133
q7	896	755	607	607
q8	9320	1305	1097	1097
q9	4888	4739	4866	4739
q10	6817	2322	1886	1886
q11	478	282	262	262
q12	362	389	218	218
q13	17777	3707	3090	3090
q14	247	242	212	212
q15	513	478	472	472
q16	629	605	585	585
q17	588	863	349	349
q18	6856	6162	6235	6162
q19	1896	965	552	552
q20	314	322	195	195
q21	2852	2227	1988	1988
q22	370	337	324	324
Total cold run time: 103985 ms
Total hot run time: 31894 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5183	5104	5129	5104
q2	236	335	230	230
q3	2150	2666	2322	2322
q4	1482	1825	1371	1371
q5	4240	4143	4188	4143
q6	254	167	126	126
q7	1902	1877	1751	1751
q8	2617	2680	2586	2586
q9	7204	7146	7173	7146
q10	3033	3224	2797	2797
q11	572	511	495	495
q12	698	788	596	596
q13	3321	4008	3315	3315
q14	291	304	270	270
q15	509	469	450	450
q16	622	703	653	653
q17	1159	1635	1310	1310
q18	7653	7420	7355	7355
q19	800	779	836	779
q20	1984	2052	1866	1866
q21	5582	5059	4778	4778
q22	642	571	534	534
Total cold run time: 52134 ms
Total hot run time: 49977 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 190712 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 85b6160dc75487d3d60304943d05388aa020513b, data reload: false

query1	1324	948	949	948
query2	6155	1907	1847	1847
query3	11134	4759	4628	4628
query4	54449	25816	23279	23279
query5	5161	554	469	469
query6	328	194	207	194
query7	4883	507	293	293
query8	319	254	228	228
query9	5801	2652	2645	2645
query10	434	311	246	246
query11	15537	15038	14910	14910
query12	162	111	110	110
query13	1045	511	395	395
query14	10523	6345	6785	6345
query15	208	204	188	188
query16	7103	689	499	499
query17	1093	718	587	587
query18	1577	429	326	326
query19	211	196	219	196
query20	121	126	121	121
query21	216	126	109	109
query22	4546	4408	4307	4307
query23	34005	33274	33393	33274
query24	5897	2455	2415	2415
query25	466	458	402	402
query26	728	277	153	153
query27	1840	491	331	331
query28	2778	2531	2492	2492
query29	570	559	427	427
query30	214	187	157	157
query31	888	867	816	816
query32	78	64	63	63
query33	450	372	303	303
query34	773	874	492	492
query35	821	819	731	731
query36	951	1004	902	902
query37	121	96	68	68
query38	4175	4303	4113	4113
query39	1516	1418	1466	1418
query40	227	116	110	110
query41	52	49	49	49
query42	121	106	107	106
query43	504	525	482	482
query44	1387	810	817	810
query45	179	178	173	173
query46	886	1078	672	672
query47	1847	1874	1805	1805
query48	406	426	315	315
query49	704	538	415	415
query50	721	769	437	437
query51	4369	4274	4286	4274
query52	103	102	94	94
query53	229	262	196	196
query54	492	487	420	420
query55	87	82	80	80
query56	275	281	267	267
query57	1183	1191	1122	1122
query58	253	239	242	239
query59	2764	2914	2662	2662
query60	291	269	272	269
query61	122	126	120	120
query62	744	720	689	689
query63	233	188	188	188
query64	2019	1085	663	663
query65	3283	3234	3214	3214
query66	724	397	298	298
query67	16149	15681	15460	15460
query68	7075	897	510	510
query69	531	293	270	270
query70	1191	1148	1122	1122
query71	495	294	260	260
query72	6099	3600	3737	3600
query73	1457	736	355	355
query74	8952	8979	9062	8979
query75	3782	3144	2697	2697
query76	4206	1182	741	741
query77	602	360	282	282
query78	10158	10133	9294	9294
query79	2241	832	584	584
query80	632	574	436	436
query81	503	279	237	237
query82	613	124	107	107
query83	171	177	148	148
query84	282	98	75	75
query85	757	350	300	300
query86	378	290	301	290
query87	4383	4403	4351	4351
query88	3635	2211	2199	2199
query89	405	318	287	287
query90	1890	192	191	191
query91	135	144	107	107
query92	69	60	62	60
query93	1574	1054	585	585
query94	646	384	297	297
query95	346	263	252	252
query96	476	557	268	268
query97	3329	3410	3336	3336
query98	224	214	204	204
query99	1455	1360	1232	1232
Total cold run time: 297647 ms
Total hot run time: 190712 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.48 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 85b6160dc75487d3d60304943d05388aa020513b, data reload: false

query1	0.04	0.03	0.03
query2	0.10	0.04	0.04
query3	0.28	0.06	0.06
query4	1.61	0.08	0.07
query5	0.55	0.57	0.56
query6	1.20	0.73	0.72
query7	0.02	0.01	0.02
query8	0.05	0.05	0.04
query9	0.62	0.52	0.52
query10	0.57	0.58	0.56
query11	0.25	0.12	0.13
query12	0.25	0.13	0.13
query13	0.63	0.62	0.61
query14	2.70	2.67	2.67
query15	1.00	0.87	0.88
query16	0.38	0.38	0.37
query17	1.03	1.05	1.05
query18	0.18	0.19	0.17
query19	1.90	1.78	2.02
query20	0.01	0.01	0.01
query21	15.35	0.97	0.65
query22	0.93	1.00	0.80
query23	14.70	1.50	0.72
query24	7.62	0.88	0.39
query25	0.17	0.10	0.09
query26	0.61	0.23	0.18
query27	0.08	0.08	0.08
query28	11.05	1.18	0.56
query29	12.60	4.07	3.46
query30	0.28	0.08	0.07
query31	2.81	0.60	0.41
query32	3.23	0.58	0.50
query33	2.99	3.07	3.07
query34	16.56	5.14	4.46
query35	4.51	4.51	4.50
query36	0.63	0.50	0.49
query37	0.21	0.18	0.16
query38	0.17	0.16	0.16
query39	0.05	0.04	0.05
query40	0.20	0.16	0.16
query41	0.10	0.05	0.05
query42	0.07	0.06	0.05
query43	0.05	0.04	0.05
Total cold run time: 108.34 s
Total hot run time: 31.48 s

@mrhhsg
Copy link
Member Author

mrhhsg commented Feb 15, 2025

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 82.25% (1061/1290)
Line Coverage: 65.77% (17590/26746)
Region Coverage: 65.32% (8668/13271)
Branch Coverage: 55.25% (4675/8462)
Coverage Report: http://coverage.selectdb-in.cc/coverage/2b1cf1157d3b6e4b28be72dd5337d6ab6905237e_2b1cf1157d3b6e4b28be72dd5337d6ab6905237e_cloud/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 31885 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2b1cf1157d3b6e4b28be72dd5337d6ab6905237e, data reload: false

------ Round 1 ----------------------------------
q1	17575	5235	5180	5180
q2	2050	300	178	178
q3	10824	1240	781	781
q4	10351	1011	544	544
q5	9154	2370	2386	2370
q6	196	167	133	133
q7	888	759	610	610
q8	9321	1342	1125	1125
q9	4891	4829	4799	4799
q10	6832	2308	1886	1886
q11	467	294	263	263
q12	347	355	221	221
q13	17774	3706	3043	3043
q14	225	223	208	208
q15	522	474	465	465
q16	646	628	581	581
q17	590	849	347	347
q18	6746	6160	6118	6118
q19	1444	945	543	543
q20	320	343	192	192
q21	2926	2198	1986	1986
q22	371	339	312	312
Total cold run time: 104460 ms
Total hot run time: 31885 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5665	5170	5172	5170
q2	236	324	230	230
q3	2153	2675	2315	2315
q4	1456	1819	1403	1403
q5	4206	4101	4168	4101
q6	210	166	124	124
q7	1866	1918	1757	1757
q8	2615	2701	2597	2597
q9	7200	7216	7126	7126
q10	3052	3232	2766	2766
q11	588	507	488	488
q12	692	766	602	602
q13	3377	3931	3254	3254
q14	268	308	276	276
q15	515	467	469	467
q16	644	689	654	654
q17	1141	1584	1357	1357
q18	7514	7470	7456	7456
q19	825	806	875	806
q20	1996	2011	1869	1869
q21	5553	5097	4683	4683
q22	667	589	543	543
Total cold run time: 52439 ms
Total hot run time: 50044 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 190515 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2b1cf1157d3b6e4b28be72dd5337d6ab6905237e, data reload: false

query1	1301	948	947	947
query2	6096	1860	1874	1860
query3	11144	4453	4536	4453
query4	55170	24632	22971	22971
query5	5291	513	491	491
query6	390	208	193	193
query7	5273	534	292	292
query8	335	264	238	238
query9	7272	2641	2635	2635
query10	458	314	261	261
query11	15359	15241	14904	14904
query12	153	107	106	106
query13	1257	522	392	392
query14	10699	6645	6558	6558
query15	196	190	183	183
query16	6972	692	475	475
query17	1062	701	566	566
query18	1525	409	310	310
query19	191	187	153	153
query20	123	131	120	120
query21	209	124	105	105
query22	4611	4547	4423	4423
query23	34054	33410	33462	33410
query24	5608	2461	2409	2409
query25	451	460	391	391
query26	649	283	161	161
query27	1747	510	346	346
query28	2765	2519	2459	2459
query29	613	570	451	451
query30	214	198	162	162
query31	891	880	797	797
query32	76	63	68	63
query33	464	355	316	316
query34	748	870	512	512
query35	814	837	763	763
query36	941	992	914	914
query37	119	108	75	75
query38	4238	4193	4124	4124
query39	1523	1474	1518	1474
query40	222	125	110	110
query41	57	60	56	56
query42	121	106	108	106
query43	516	517	496	496
query44	1290	814	808	808
query45	193	176	169	169
query46	912	1057	653	653
query47	1846	1886	1792	1792
query48	390	412	324	324
query49	697	531	447	447
query50	750	749	417	417
query51	4342	4322	4294	4294
query52	111	106	97	97
query53	241	257	191	191
query54	487	487	415	415
query55	78	80	81	80
query56	262	286	257	257
query57	1178	1184	1146	1146
query58	278	246	249	246
query59	2751	2923	2767	2767
query60	286	273	272	272
query61	120	119	121	119
query62	702	737	677	677
query63	235	193	187	187
query64	1402	1056	681	681
query65	3291	3236	3233	3233
query66	760	392	296	296
query67	15866	15536	15203	15203
query68	6662	878	502	502
query69	549	342	269	269
query70	1203	1094	1116	1094
query71	487	297	262	262
query72	5955	3637	3654	3637
query73	1181	741	354	354
query74	9199	9148	8971	8971
query75	3677	3158	2712	2712
query76	4246	1193	749	749
query77	615	355	284	284
query78	10035	9978	9241	9241
query79	3725	822	570	570
query80	699	537	446	446
query81	510	274	238	238
query82	658	126	95	95
query83	314	171	151	151
query84	291	95	72	72
query85	795	348	384	348
query86	368	297	278	278
query87	4502	4580	4218	4218
query88	2833	2248	2223	2223
query89	434	314	289	289
query90	1968	190	191	190
query91	135	136	105	105
query92	77	60	56	56
query93	2204	1051	578	578
query94	683	410	302	302
query95	341	269	257	257
query96	483	561	276	276
query97	3336	3414	3265	3265
query98	214	212	205	205
query99	1452	1411	1276	1276
Total cold run time: 300217 ms
Total hot run time: 190515 ms

@mrhhsg
Copy link
Member Author

mrhhsg commented Feb 16, 2025

run buildall

@doris-robot
Copy link

TPC-DS: Total hot run time: 184587 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ee4b33e6b5a433f1a8f3b9cdcc4b2c8898fe06fa, data reload: false

query1	999	381	397	381
query2	6574	1875	1847	1847
query3	6798	221	211	211
query4	25989	23657	23320	23320
query5	4323	670	497	497
query6	310	195	201	195
query7	4612	499	294	294
query8	308	235	219	219
query9	8614	2511	2489	2489
query10	466	300	248	248
query11	15701	15089	14957	14957
query12	152	108	103	103
query13	1647	515	378	378
query14	9462	6321	6609	6321
query15	214	203	176	176
query16	7152	617	497	497
query17	1179	717	575	575
query18	1961	409	302	302
query19	196	190	161	161
query20	119	120	120	120
query21	208	135	103	103
query22	4311	4253	4270	4253
query23	33991	33032	33070	33032
query24	7803	2391	2386	2386
query25	552	488	463	463
query26	1237	280	158	158
query27	2118	501	320	320
query28	3886	2439	2422	2422
query29	750	542	419	419
query30	236	190	160	160
query31	928	875	794	794
query32	73	68	65	65
query33	555	355	304	304
query34	776	855	514	514
query35	813	822	776	776
query36	946	977	878	878
query37	160	97	75	75
query38	4254	4108	4085	4085
query39	1449	1397	1385	1385
query40	212	113	104	104
query41	62	61	61	61
query42	123	109	97	97
query43	486	512	474	474
query44	1317	796	788	788
query45	181	168	162	162
query46	873	1035	633	633
query47	1756	1777	1698	1698
query48	388	421	304	304
query49	803	509	449	449
query50	678	714	439	439
query51	4179	4161	4161	4161
query52	110	106	94	94
query53	223	248	187	187
query54	495	472	427	427
query55	78	78	81	78
query56	259	256	269	256
query57	1132	1125	1065	1065
query58	260	228	256	228
query59	2617	2718	2458	2458
query60	278	322	262	262
query61	131	121	119	119
query62	792	727	666	666
query63	227	201	195	195
query64	4364	1006	673	673
query65	3187	3153	3121	3121
query66	1128	399	301	301
query67	15914	15632	15351	15351
query68	8613	877	516	516
query69	460	293	259	259
query70	1207	1133	1100	1100
query71	449	278	282	278
query72	5423	3640	3750	3640
query73	757	746	363	363
query74	9222	9244	8890	8890
query75	3744	3135	2686	2686
query76	3608	1175	737	737
query77	798	353	287	287
query78	9869	10260	9432	9432
query79	2337	835	610	610
query80	616	532	448	448
query81	485	283	254	254
query82	666	124	98	98
query83	174	174	166	166
query84	247	101	77	77
query85	783	395	308	308
query86	339	299	273	273
query87	4465	4447	4432	4432
query88	3236	2175	2152	2152
query89	405	325	292	292
query90	1931	195	194	194
query91	135	138	114	114
query92	75	63	66	63
query93	1108	1081	579	579
query94	658	469	298	298
query95	347	263	266	263
query96	486	550	262	262
query97	3297	3485	3304	3304
query98	222	206	203	203
query99	1449	1399	1275	1275
Total cold run time: 271686 ms
Total hot run time: 184587 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.4 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ee4b33e6b5a433f1a8f3b9cdcc4b2c8898fe06fa, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.04	0.03
query3	0.23	0.06	0.06
query4	1.63	0.10	0.10
query5	0.56	0.55	0.55
query6	1.20	0.73	0.72
query7	0.02	0.01	0.01
query8	0.05	0.04	0.03
query9	0.58	0.54	0.53
query10	0.56	0.58	0.59
query11	0.15	0.11	0.11
query12	0.15	0.11	0.11
query13	0.61	0.60	0.60
query14	2.78	2.71	2.80
query15	0.92	0.86	0.85
query16	0.38	0.38	0.37
query17	1.02	1.04	1.02
query18	0.22	0.20	0.20
query19	1.89	1.97	1.87
query20	0.01	0.02	0.01
query21	15.36	0.90	0.55
query22	0.78	1.17	0.72
query23	15.09	1.38	0.67
query24	7.29	1.45	1.06
query25	0.51	0.25	0.12
query26	0.50	0.16	0.15
query27	0.05	0.05	0.06
query28	9.22	0.88	0.45
query29	12.61	4.05	3.35
query30	0.27	0.09	0.07
query31	2.81	0.58	0.40
query32	3.25	0.55	0.47
query33	3.01	3.00	2.97
query34	15.72	5.15	4.56
query35	4.53	4.57	4.52
query36	0.67	0.50	0.49
query37	0.10	0.06	0.06
query38	0.05	0.04	0.03
query39	0.03	0.02	0.03
query40	0.16	0.14	0.13
query41	0.08	0.02	0.03
query42	0.03	0.02	0.03
query43	0.04	0.04	0.03
Total cold run time: 105.23 s
Total hot run time: 31.4 s

@jacktengg
Copy link
Contributor

run performance

@doris-robot
Copy link

TPC-H: Total hot run time: 31557 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ee4b33e6b5a433f1a8f3b9cdcc4b2c8898fe06fa, data reload: false

------ Round 1 ----------------------------------
q1	17584	5433	5159	5159
q2	2052	314	171	171
q3	10398	1254	728	728
q4	10219	1052	536	536
q5	7554	2466	2302	2302
q6	198	169	136	136
q7	925	754	605	605
q8	9316	1320	1170	1170
q9	5042	4701	4650	4650
q10	6820	2329	1868	1868
q11	482	281	277	277
q12	344	359	223	223
q13	17753	3692	3025	3025
q14	228	229	210	210
q15	511	462	461	461
q16	646	622	593	593
q17	576	862	346	346
q18	6439	6206	6149	6149
q19	889	947	533	533
q20	302	321	192	192
q21	2763	2159	1923	1923
q22	372	329	300	300
Total cold run time: 101413 ms
Total hot run time: 31557 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5182	5189	5133	5133
q2	236	339	234	234
q3	2146	2673	2278	2278
q4	1412	1823	1397	1397
q5	4234	4130	4153	4130
q6	208	160	125	125
q7	1849	1825	1669	1669
q8	2635	2661	2518	2518
q9	7391	7225	7243	7225
q10	3006	3152	2782	2782
q11	563	518	498	498
q12	687	762	593	593
q13	3511	3933	3186	3186
q14	278	302	281	281
q15	513	455	468	455
q16	627	669	654	654
q17	1132	1541	1366	1366
q18	7599	7473	7404	7404
q19	782	849	970	849
q20	2001	1991	1890	1890
q21	5345	4925	4855	4855
q22	609	580	541	541
Total cold run time: 51946 ms
Total hot run time: 50063 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 184789 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ee4b33e6b5a433f1a8f3b9cdcc4b2c8898fe06fa, data reload: false

query1	995	404	374	374
query2	6540	1923	1842	1842
query3	6798	217	212	212
query4	26649	23852	23600	23600
query5	4917	693	491	491
query6	312	220	191	191
query7	4619	498	291	291
query8	304	252	243	243
query9	8638	2545	2528	2528
query10	506	304	271	271
query11	15329	16500	14984	14984
query12	176	111	108	108
query13	1663	511	402	402
query14	9521	7487	6258	6258
query15	203	193	181	181
query16	7550	618	475	475
query17	1551	751	570	570
query18	1966	401	311	311
query19	227	194	162	162
query20	120	120	119	119
query21	213	125	105	105
query22	4457	4534	4669	4534
query23	34331	33327	33063	33063
query24	7335	2370	2397	2370
query25	528	458	387	387
query26	1218	277	150	150
query27	2024	513	331	331
query28	3834	2435	2379	2379
query29	720	533	421	421
query30	239	182	151	151
query31	928	873	790	790
query32	76	60	62	60
query33	547	351	301	301
query34	764	848	496	496
query35	817	827	748	748
query36	941	976	863	863
query37	114	101	73	73
query38	4272	4266	4128	4128
query39	1468	1405	1372	1372
query40	211	113	103	103
query41	54	53	54	53
query42	125	101	99	99
query43	485	513	484	484
query44	1259	785	778	778
query45	179	170	166	166
query46	853	1036	638	638
query47	1761	1794	1729	1729
query48	394	405	312	312
query49	787	496	397	397
query50	657	733	430	430
query51	4178	4162	4101	4101
query52	103	104	91	91
query53	226	259	187	187
query54	489	491	420	420
query55	86	86	88	86
query56	273	274	249	249
query57	1156	1144	1101	1101
query58	252	235	242	235
query59	2560	2543	2437	2437
query60	289	298	262	262
query61	121	115	119	115
query62	805	743	697	697
query63	231	194	188	188
query64	4460	991	670	670
query65	3253	3130	3130	3130
query66	1058	410	307	307
query67	15821	15515	15360	15360
query68	8683	878	525	525
query69	467	297	261	261
query70	1190	1096	1092	1092
query71	480	280	280	280
query72	5257	3519	3657	3519
query73	764	719	345	345
query74	9257	8962	9025	8962
query75	3845	3161	2652	2652
query76	3737	1155	723	723
query77	795	367	287	287
query78	10079	10213	9306	9306
query79	2997	856	601	601
query80	661	520	462	462
query81	505	281	245	245
query82	721	126	92	92
query83	174	167	157	157
query84	239	100	78	78
query85	808	362	313	313
query86	395	299	332	299
query87	4372	4500	4450	4450
query88	3796	2188	2158	2158
query89	409	319	282	282
query90	1927	202	191	191
query91	140	136	110	110
query92	75	65	60	60
query93	1755	1074	585	585
query94	647	413	298	298
query95	358	262	258	258
query96	500	538	274	274
query97	3368	3407	3300	3300
query98	222	208	266	208
query99	1469	1420	1256	1256
Total cold run time: 275628 ms
Total hot run time: 184789 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.5 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ee4b33e6b5a433f1a8f3b9cdcc4b2c8898fe06fa, data reload: false

query1	0.03	0.04	0.03
query2	0.07	0.03	0.03
query3	0.24	0.08	0.08
query4	1.59	0.10	0.10
query5	0.57	0.57	0.55
query6	1.20	0.71	0.71
query7	0.02	0.02	0.01
query8	0.04	0.03	0.04
query9	0.59	0.54	0.54
query10	0.58	0.57	0.59
query11	0.15	0.11	0.11
query12	0.15	0.12	0.12
query13	0.63	0.60	0.60
query14	2.67	2.79	2.79
query15	0.94	0.85	0.85
query16	0.37	0.39	0.39
query17	1.00	1.00	1.01
query18	0.22	0.20	0.20
query19	2.10	1.79	1.79
query20	0.01	0.01	0.01
query21	15.38	0.90	0.54
query22	0.76	1.22	0.62
query23	14.96	1.40	0.64
query24	6.94	1.95	0.34
query25	0.39	0.34	0.10
query26	0.69	0.15	0.13
query27	0.06	0.05	0.06
query28	9.53	0.92	0.43
query29	12.57	3.99	3.35
query30	0.25	0.09	0.07
query31	2.84	0.59	0.39
query32	3.22	0.55	0.47
query33	2.97	3.02	3.00
query34	15.71	5.20	4.54
query35	4.53	4.52	4.58
query36	0.66	0.49	0.49
query37	0.09	0.06	0.06
query38	0.06	0.04	0.04
query39	0.03	0.03	0.02
query40	0.17	0.13	0.13
query41	0.08	0.02	0.03
query42	0.03	0.02	0.02
query43	0.04	0.02	0.02
Total cold run time: 105.13 s
Total hot run time: 30.5 s

@@ -209,6 +250,91 @@ void WorkloadGroup::add_mem_tracker_limiter(std::shared_ptr<MemTrackerLimiter> m
_mem_tracker_limiter_pool[group_num].trackers.end(), mem_tracker_ptr);
}

int64_t WorkloadGroup::free_overcommited_memory(int64_t need_free_mem, RuntimeProfile* profile) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO:WorkloadGroup::free_overcommited_memory 和 WorkloadGroup::gc_memory 合并成一个方法

auto& queries_list = it->second;
const auto& wg = it->first;

LOG_EVERY_T(INFO, 120) << "Paused queries count: " << queries_list.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这行日志没啥意思吧,都没打印 wg id

continue;
}
if (is_low_watermark) {
query_ctx->set_low_memory_mode();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么只有 wg 设置了 slot 策略且 is_low_watermark 为 true,才把所有 query 设置 low memory mode

<< " enabled hard limit, but the slot count < 1, could not take affect";
} else {
// If the query enable hard limit, then not use weighted info any more, just use the settings limit.
query_weighted_mem_limit = (int64_t)((wg_high_water_mark_except_load *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那如果 wg 的 slot_memory_policy() == TWgSlotMemoryPolicy::FIXED,这个 wg 所有 query 的 query_weighted_mem_limit 加起来会超过 wg_high_water_mark_except_load

// If there are some query in paused list, then limit should take effect.
expected_query_weighted_mem_limit =
total_used_slot_count > 0
? (int64_t)((wg_high_water_mark_except_load + total_used_slot_count) *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么 wg_high_water_mark_except_load + total_used_slot_count,这两个变量的单位都不同,一个是 bytes,一个是 slot count

}
}
// calculate per query weighted memory limit
debug_msg = "Query Memory Summary: \n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该是 debug_msg +=,上面对 debug_msg 初始化了 wg 的信息,这里覆盖了

// memory failed and we did not hanle it.
if (!query_ctx->is_pure_load_task()) {
query_ctx->set_mem_limit(query_weighted_mem_limit);
query_ctx->set_adjusted_mem_limit(expected_query_weighted_mem_limit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QueryContext 中没有解释 _adjusted_mem_limitmem_limit 的区别,建议加上注释,否则得把这个函数全看懂才知道他们的含义

然后可以考虑给 QueryContext 新增一个方法,比如:

void effect_adjusted_mem_limit() {
     set_mem_limit(query_ctx->adjusted_mem_limit())
{

这样更好理解二者的含义

return true;
} else if (time_in_queue >= config::spill_in_paused_queue_timeout_ms) {
// if cannot find any memory to release, then let the query continue to run as far as possible
// or cancelled by gc if memory is really not enough.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实让 query 继续执行,并 disable_reserve_memory 后,query 大概率会在 Allocator 中检查 query memory tracker limit 时抛出异常并终止。只有在进程内存超限时才会被 gc cancel。
注释问题,也可不改

}
} else if (query_ctx->paused_reason().is<ErrorCode::WORKLOAD_GROUP_MEMORY_EXCEEDED>()) {
// Only deal with non overcommit workload group.
if (wg->enable_memory_overcommit()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

允许 overcommit 的 wg 中的 query 就不应该因为 WORKLOAD_GROUP_MEMORY_EXCEEDED 而被放入 paused list 中吧。

这里将 paused query 重新 set_memory_sufficient 后执行,还会再次因为 WORKLOAD_GROUP_MEMORY_EXCEEDED 而被放入 paused list 中么。

另一个问题,一个 Query 会多次进入 paused list 么,如果是的话,那可能导致频繁暂停和继续执行。

// should set the query's limit only.
// Check the query's reserve with expected limit.
if (query_ctx->adjusted_mem_limit() <
query_ctx->get_mem_tracker()->consumption() + query_it->reserve_size_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query 因为 WORKLOAD_GROUP_MEMORY_EXCEEDED 被暂停后,wg不允许overcommit,此时 adjusted_mem_limit > query memory used + reserve_size ,为什么修改 query mem limit 后让 query pipeline task 继续执行?

继续执行后 query 只会再次进入 paused list 暂停或 set_low_memory_mode,应该调用 handle_single_query_ 去 spill 或 cancel 吧

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我知道了,修改 query mem limit 后,query再次进入暂停状态的原因预期是 QUERY_MEMORY_EXCEEDED,将调用 handle_single_query_ 去 spill。


// Return the expected free bytes if memtable could flush
int64_t WorkloadGroupMgr::flush_memtable_from_current_group_(WorkloadGroupPtr wg,
int64_t need_free_mem) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参数 need_free_mem 没有用到

// should set the query's limit only.
// Check the query's reserve with expected limit.
if (query_ctx->adjusted_mem_limit() <
query_ctx->get_mem_tracker()->consumption() + query_it->reserve_size_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我知道了,修改 query mem limit 后,query再次进入暂停状态的原因预期是 QUERY_MEMORY_EXCEEDED,将调用 handle_single_query_ 去 spill。

continue;
}
if (!has_changed_hard_limit) {
update_queries_limit_(wg, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 update_queries_limit_ 有必要么,每20ms调用一次 refresh_wg_weighted_memory_limit 就会 update。

就算这里要 update,也应该挪到下面这段代码的上面:

if (query_ctx->adjusted_mem_limit() <
                    query_ctx->get_mem_tracker()->consumption() + query_it->reserve_size_) {

query_ctx->get_mem_tracker()->consumption())
<< ", wg: " << wg->debug_string();
}
if (wg->slot_memory_policy() == TWgSlotMemoryPolicy::NONE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当多个query因为WORKLOAD_GROUP_MEMORY_EXCEEDED暂停且 slot_memory_policy 等于 None 时,逻辑似乎不合理

slot_memory_policy 不等于 None 时
第一次 handle_paused_queries 时,会将 memory used + reserve > adjusted_mem_limit 的 query 的 adjusted_mem_limit 更新为 mem_limit 后继续执行,其他 query 继续等待。
第二次 handle_paused_queries 时内存超限的 query 预期 会因为 QUERY_MEMORY_EXCEEDED 进入暂停状态,然后将 spill 或者cancel,等内存释放后其他 Query 继续执行。

slot_memory_policy 等于 None 时
第一次 handle_paused_queries 时,同样会将 memory used + reserve > adjusted_mem_limit 的 query 的 adjusted_mem_limit 更新为 mem_limit 后继续执行。其他 query 则会直接调用 handle_single_query_ 去 spill 或 cancel。

不合理在于,slot_memory_policy 等于 None 时,内存超限的 query 继续执行(虽然预期会因为 QUERY_MEMORY_EXCEEDED 再次暂停),而没有内存超限的 query 则直接 spill 了。

@jacktengg
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 31797 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit fda7c1ab32dc1efb6fd20744a388f5256d24d4d7, data reload: false

------ Round 1 ----------------------------------
q1	17723	5409	5205	5205
q2	2054	305	164	164
q3	10756	1336	722	722
q4	10288	1022	536	536
q5	8193	2526	2325	2325
q6	188	164	131	131
q7	936	762	611	611
q8	9307	1289	1117	1117
q9	5066	4570	4869	4570
q10	6806	2312	1880	1880
q11	473	284	262	262
q12	346	358	222	222
q13	17789	3766	3109	3109
q14	243	220	204	204
q15	509	446	464	446
q16	646	644	605	605
q17	569	884	346	346
q18	6735	6216	6254	6216
q19	1618	960	559	559
q20	321	318	194	194
q21	2915	2222	2062	2062
q22	374	346	311	311
Total cold run time: 103855 ms
Total hot run time: 31797 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5292	5225	5194	5194
q2	238	328	227	227
q3	2127	2725	2277	2277
q4	1465	1802	1407	1407
q5	4304	4118	4175	4118
q6	209	161	124	124
q7	1863	1924	1793	1793
q8	2659	2572	2690	2572
q9	7222	7061	7105	7061
q10	3009	3231	2784	2784
q11	574	514	486	486
q12	684	821	584	584
q13	3555	3958	3230	3230
q14	289	283	261	261
q15	517	475	478	475
q16	649	654	665	654
q17	1160	1606	1350	1350
q18	7567	7355	7092	7092
q19	827	874	1007	874
q20	2015	2039	1900	1900
q21	5521	5091	4958	4958
q22	657	600	553	553
Total cold run time: 52403 ms
Total hot run time: 49974 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 184125 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit fda7c1ab32dc1efb6fd20744a388f5256d24d4d7, data reload: false

query1	976	390	379	379
query2	6550	1937	1853	1853
query3	6790	218	229	218
query4	26259	24224	23112	23112
query5	5367	681	500	500
query6	320	218	172	172
query7	4617	505	293	293
query8	297	235	225	225
query9	8652	2515	2529	2515
query10	495	317	258	258
query11	15857	15369	14913	14913
query12	164	113	105	105
query13	1651	521	387	387
query14	9329	6945	6396	6396
query15	207	184	176	176
query16	7238	646	482	482
query17	1164	717	570	570
query18	1953	410	314	314
query19	205	195	163	163
query20	122	117	111	111
query21	216	127	108	108
query22	4028	4413	4494	4413
query23	34632	33426	32837	32837
query24	7655	2402	2441	2402
query25	537	499	409	409
query26	1233	282	159	159
query27	2448	490	333	333
query28	4185	2415	2385	2385
query29	766	638	407	407
query30	242	183	161	161
query31	948	861	789	789
query32	81	69	62	62
query33	550	359	318	318
query34	817	844	507	507
query35	795	802	766	766
query36	950	991	883	883
query37	126	104	122	104
query38	4071	4233	4044	4044
query39	1443	1410	1431	1410
query40	211	117	103	103
query41	53	50	50	50
query42	116	106	103	103
query43	502	528	477	477
query44	1316	786	785	785
query45	180	177	165	165
query46	885	1061	661	661
query47	1756	1772	1697	1697
query48	385	407	306	306
query49	751	514	436	436
query50	705	728	434	434
query51	4186	4173	4138	4138
query52	108	103	94	94
query53	229	257	188	188
query54	486	494	404	404
query55	81	81	80	80
query56	273	273	262	262
query57	1122	1119	1064	1064
query58	260	248	239	239
query59	2593	2723	2439	2439
query60	288	282	254	254
query61	120	148	118	118
query62	800	723	686	686
query63	240	197	198	197
query64	4324	974	648	648
query65	3239	3183	3164	3164
query66	1086	393	304	304
query67	15763	15503	15378	15378
query68	8650	897	519	519
query69	495	305	277	277
query70	1223	1153	1121	1121
query71	471	301	263	263
query72	5432	3685	3858	3685
query73	805	755	345	345
query74	8978	8903	8818	8818
query75	4023	3184	2717	2717
query76	3758	1178	747	747
query77	791	406	286	286
query78	10074	10322	9342	9342
query79	2614	829	599	599
query80	622	526	465	465
query81	510	285	241	241
query82	727	125	99	99
query83	176	172	155	155
query84	241	96	76	76
query85	755	360	302	302
query86	386	312	286	286
query87	4513	4573	4334	4334
query88	3634	2174	2159	2159
query89	386	314	290	290
query90	1943	196	192	192
query91	137	141	113	113
query92	70	65	53	53
query93	1690	1065	585	585
query94	654	399	299	299
query95	357	266	262	262
query96	486	555	267	267
query97	3330	3381	3252	3252
query98	228	209	201	201
query99	1604	1402	1262	1262
Total cold run time: 275757 ms
Total hot run time: 184125 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.72 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit fda7c1ab32dc1efb6fd20744a388f5256d24d4d7, data reload: false

query1	0.04	0.03	0.04
query2	0.07	0.04	0.03
query3	0.23	0.07	0.07
query4	1.61	0.10	0.10
query5	0.56	0.55	0.56
query6	1.19	0.73	0.71
query7	0.02	0.02	0.02
query8	0.04	0.03	0.04
query9	0.58	0.55	0.52
query10	0.58	0.59	0.58
query11	0.16	0.11	0.11
query12	0.14	0.11	0.11
query13	0.61	0.60	0.59
query14	2.69	2.83	2.70
query15	0.92	0.85	0.85
query16	0.38	0.38	0.38
query17	1.06	1.07	1.05
query18	0.22	0.20	0.19
query19	1.94	1.76	2.01
query20	0.01	0.02	0.01
query21	15.36	0.91	0.56
query22	0.76	1.23	0.75
query23	14.81	1.38	0.64
query24	7.05	2.45	0.63
query25	0.55	0.07	0.30
query26	0.59	0.16	0.14
query27	0.05	0.05	0.05
query28	9.79	0.84	0.44
query29	12.52	3.93	3.29
query30	0.25	0.09	0.07
query31	2.82	0.60	0.39
query32	3.23	0.55	0.46
query33	2.97	3.01	3.02
query34	15.78	5.16	4.51
query35	4.53	4.53	4.57
query36	0.66	0.50	0.48
query37	0.10	0.06	0.06
query38	0.04	0.04	0.04
query39	0.03	0.02	0.03
query40	0.16	0.13	0.12
query41	0.07	0.03	0.02
query42	0.03	0.03	0.02
query43	0.03	0.03	0.03
Total cold run time: 105.23 s
Total hot run time: 30.72 s

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 44.62% (11895/26660)
Line Coverage: 34.10% (99550/291909)
Region Coverage: 33.24% (50913/153145)
Branch Coverage: 28.81% (25568/88736)
Coverage Report: http://coverage.selectdb-in.cc/coverage/fda7c1ab32dc1efb6fd20744a388f5256d24d4d7_fda7c1ab32dc1efb6fd20744a388f5256d24d4d7/report/index.html

}
}

if (has_query_exceed_process_memlimit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该是 if (!has_query_exceed_process_memlimit) 吧,取反

0.05 &&
doris::GlobalMemoryArbitrator::last_wg_trigger_cache_capacity_adjust_weighted >
0.05) {
doris::GlobalMemoryArbitrator::last_wg_trigger_cache_capacity_adjust_weighted =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO:对 cache 容量的调整放在这里不太好,后面看有没有更好的实现方式

// need to check config::disable_memory_gc here, if not, when config::disable_memory_gc == true,
// cache is not adjusted, query_it->cache_ratio_ will always be 1, and this if branch will nenver
// execute, this query will never be resumed, and will deadlock here
if ((!config::disable_memory_gc && query_it->cache_ratio_ < 0.05) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

实际上 cache 释放需要时间,几十G的 page cache 全部释放需要接近1s,

这里 cache_ratio_ < 0.05 期望的语义是 "当前所有Cache已经都释放完了,如果其他地方不能释放内存就 spill",但实际上 Cache 可能并没有释放完。

TODO:给 CacheManager 加个方法,返回所有 Cache 的使用率,若低于某个值,则认为 "所有Cache释放完了,不必等待Cache释放内存了"


// 1. Sort all memory limiter in all overcommit wg, and cancel the top usage task that with most memory.
// 2. Maybe not valid because it's memory not exceed limit.
int64_t WorkloadGroupMgr::cancel_top_query_in_overcommit_group_(int64_t need_free_mem,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO:我后面会实现,但如上所说,revoke overcommited wg 的触发位置有待商榷

}
if (hard_limit) {
freed_mem = cancel_top_query_in_overcommit_group_(need_free_mem - total_freed_mem,
doris::QUERY_MIN_MEMORY, profile.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为啥要判断 hard_limit,然后设置一个 cancel 的 lower_bound

return 0;
}

int64_t WorkloadGroupMgr::revoke_memory_from_other_group_(std::shared_ptr<QueryContext> requestor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

所以支持 overcommit 的 wg 内存 overcommited 后,只能是 process memory 超过 soft limit 后,由其他 wg 的 query 去释放这个 overcommited 的 wg 对吧。

这个调用关系看似合理,但会不会导致频繁调用呢,比如 100个暂停的 query 都去 revoke 其他 wg。
类似的需求,我第一反应在 GC 线程里通过一定策略去 revoke overcommited 的 wg。

continue;
}
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果已经有 query A 去 revoke 了其他 overcommited wg,并将 has_revoked_from_other_group 设为 true

此处让其他 query 直接继续执行不合理,因为 query A 的 reserve size 可能很小,所以 revoke 其他 overcommited wg 很小一块内存。

其他 query 继续执行的语义是 "所有能释放的内存都释放了,所以不管三七二十一全跑起来吧! ",但可能还有 overcommited wg 没有全部释放 overcommited 的部分。

VLOG_DEBUG << "Query: " << print_id(query_ctx->query_id())
<< " is resumed after revoke memory from other group.";
query_it = queries_list.erase(query_it);
// Do not care if the revoked_size > reserve size, and try to run again.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不管 revoked_size 是否大于 reserve size,都继续执行 query,真的没问题么。

因为没有让 adjusted_mem_limit 生效,所以 query 重新执行后不会因为 QUERY_MEMORY_EXCEEDED 进入暂停状态

Query 再次进入暂停状态大概率还是因为 process memory exceed,那就会不断 revoke overcommited 的 wg,直到 revoke = 0,没有 overcommited 的 wg 后,才会去 spill 当前 query。看似合理,但总感觉线上会有坑,比如 query 看起来一卡一卡

}
if (doris::GlobalMemoryArbitrator::last_affected_cache_capacity_adjust_weighted <
0.05 &&
query_it->cache_ratio_ > 0.05) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的问题和上面对 cache_ratio_ < 0.05 含义的分析一样,

TODO:给 CacheManager 加个方法,返回所有 Cache 的使用率,若低于某个值,则认为 "所有Cache释放完了,不必等待Cache释放内存了"

@jacktengg
Copy link
Contributor

run cloudut

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 82.25% (1061/1290)
Line Coverage: 65.73% (17580/26746)
Region Coverage: 65.27% (8662/13271)
Branch Coverage: 55.14% (4666/8462)
Coverage Report: http://coverage.selectdb-in.cc/coverage/fda7c1ab32dc1efb6fd20744a388f5256d24d4d7_fda7c1ab32dc1efb6fd20744a388f5256d24d4d7_cloud/report/index.html

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 24, 2025
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@yiguolei yiguolei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yiguolei yiguolei merged commit 34756a4 into apache:master Feb 25, 2025
26 of 30 checks passed
mrhhsg added a commit to mrhhsg/doris that referenced this pull request Feb 25, 2025
And also fix the WorkloadGroupTest::testCreateNormal

After merged the spill and reserve code(apache#47462), the properties of `WorkloadGroup` was changed.
hello-stephen pushed a commit that referenced this pull request Feb 25, 2025
…48300)

### What problem does this PR solve?

And also fix the WorkloadGroupTest::testCreateNormal
    
After merged the spill and reserve code(#47462), the properties of
`WorkloadGroup` was changed.
Gabriel39 pushed a commit that referenced this pull request Feb 26, 2025
…k_ptr" thrown" in DataStreamRecvrTest (#48319)

There are conflicts between the following two PRs. 
One PR uses unique_ptr to construct query_ctx, while the other PR uses
shared_from_this

#48188
#47462
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants