Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enhancement](cloud) reconnect after the RPC request to the meta service fails #45668

Merged
merged 8 commits into from
Jan 2, 2025

Conversation

luwei16
Copy link
Contributor

@luwei16 luwei16 commented Dec 19, 2024

No description provided.

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@luwei16
Copy link
Contributor Author

luwei16 commented Dec 19, 2024

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.87% (10132/26064)
Line Coverage: 29.79% (85203/285983)
Region Coverage: 28.90% (43455/150384)
Branch Coverage: 25.43% (22154/87134)
Coverage Report: http://coverage.selectdb-in.cc/coverage/efb1c77428e77bb213addbaba4bc017ce6b707bf_efb1c77428e77bb213addbaba4bc017ce6b707bf/report/index.html

}

private:
static Status get_pooled_client(std::shared_ptr<MetaService_Stub>* stub) {
static Status get_pooled_client(std::shared_ptr<MetaService_Stub>* stub,
MetaServiceProxy** proxy) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try not to use double stars. is it possible to use a pointer to an unique_ptr?

@@ -447,6 +481,7 @@ Status CloudMetaMgr::sync_tablet_rowsets(CloudTablet* tablet, bool warmup_delta_
.tag("partition_id", tablet->partition_id())
.tag("tried", tried)
.tag("sleep", duration_ms);
proxy->set_unhealthy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not set_unhealthy() right after we get if (cntl.Failed())? it should not rely on retry_times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not set_unhealthy() right after we get if (cntl.Failed())? it should not rely on retry_times

done

return Status::RpcError("failed to get delete bitmap: {}", cntl.ErrorText());

int retry_times = 0;
brpc::Controller cntl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put it into the while loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start and end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

std::shared_ptr<MetaService_Stub> stub;
RETURN_IF_ERROR(MetaServiceProxy::get_client(&stub));
MetaServiceProxy* proxy;
RETURN_IF_ERROR(MetaServiceProxy::get_proxy(&proxy));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use a new proxy every time we retry?

}

private:
static Status get_pooled_client(std::shared_ptr<MetaService_Stub>* stub) {
static Status get_pooled_client(std::shared_ptr<MetaService_Stub>* stub,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment for this methods including behavior and params

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

static Status get_proxy(MetaServiceProxy** proxy) {
std::shared_ptr<MetaService_Stub> stub;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment for this function, and the stub is a placeholder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -3307,4 +3307,6 @@ public static int metaServiceRpcRetryTimes() {
"For disabling certain SQL queries, the configuration item is a list of simple class names of AST"
+ "(for example CreateRepositoryStmt, CreatePolicyCommand), separated by commas."})
public static String block_sql_ast_names = "";

public static long ms_rpc_reconn_interval_ms = 20000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use full name meta_service_rpc_reconnect_interval_ms and add comment for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -111,5 +111,9 @@ DECLARE_mBool(enable_use_cloud_unique_id_from_fe);

DECLARE_Bool(enable_cloud_tablet_report);

DECLARE_mInt32(delete_bitmap_rpc_retry_times);

DECLARE_mInt64(ms_rpc_reconn_interval_ms);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use full name meta_service_rpc_reconnect_interval_ms and add comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@luwei16
Copy link
Contributor Author

luwei16 commented Dec 31, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32519 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit adf07dd8043524e5a154d8508d06e4e0517b633b, data reload: false

------ Round 1 ----------------------------------
q1	17680	6123	6011	6011
q2	2048	317	184	184
q3	10529	1225	706	706
q4	10203	873	432	432
q5	7534	2171	1998	1998
q6	202	182	153	153
q7	915	752	602	602
q8	9242	1359	1115	1115
q9	5145	4968	4925	4925
q10	6777	2307	1857	1857
q11	485	282	255	255
q12	344	359	220	220
q13	17790	3616	2966	2966
q14	247	237	217	217
q15	565	501	494	494
q16	652	636	595	595
q17	561	835	325	325
q18	6993	6487	6444	6444
q19	2042	975	537	537
q20	301	319	184	184
q21	2855	2158	1991	1991
q22	363	341	308	308
Total cold run time: 103473 ms
Total hot run time: 32519 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6260	6195	6177	6177
q2	232	324	239	239
q3	2267	2640	2365	2365
q4	1464	1822	1406	1406
q5	4365	4775	4765	4765
q6	187	177	141	141
q7	2083	2018	1845	1845
q8	2633	2837	2718	2718
q9	7375	7378	7381	7378
q10	3063	3358	2751	2751
q11	583	536	510	510
q12	696	822	606	606
q13	3348	3750	3046	3046
q14	286	318	280	280
q15	558	515	492	492
q16	654	683	654	654
q17	1259	1727	1253	1253
q18	7748	7400	7488	7400
q19	838	1133	1075	1075
q20	1982	2061	1924	1924
q21	5746	5221	4857	4857
q22	646	599	614	599
Total cold run time: 54273 ms
Total hot run time: 52481 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 195756 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit adf07dd8043524e5a154d8508d06e4e0517b633b, data reload: false

query1	1311	922	897	897
query2	6511	2362	2405	2362
query3	10949	4736	4616	4616
query4	33087	23462	23596	23462
query5	3827	596	468	468
query6	277	203	189	189
query7	3983	487	309	309
query8	309	245	243	243
query9	9358	2737	2707	2707
query10	468	316	243	243
query11	17858	15337	15134	15134
query12	156	111	107	107
query13	1581	541	412	412
query14	9974	6754	7811	6754
query15	223	213	204	204
query16	8005	667	471	471
query17	1610	788	601	601
query18	2074	446	364	364
query19	223	174	156	156
query20	129	121	117	117
query21	202	122	110	110
query22	4761	4642	4754	4642
query23	34956	33721	33258	33258
query24	6387	2269	2292	2269
query25	451	443	396	396
query26	698	247	152	152
query27	1944	474	341	341
query28	5510	2460	2459	2459
query29	561	503	437	437
query30	203	185	155	155
query31	971	955	841	841
query32	78	56	59	56
query33	483	354	291	291
query34	763	851	530	530
query35	813	847	764	764
query36	1052	1095	980	980
query37	116	105	75	75
query38	4109	4301	4144	4144
query39	1536	1501	1487	1487
query40	199	120	104	104
query41	53	49	45	45
query42	120	106	104	104
query43	543	539	495	495
query44	1290	820	832	820
query45	195	183	170	170
query46	894	1060	658	658
query47	2019	2027	1969	1969
query48	380	405	317	317
query49	701	488	394	394
query50	650	646	395	395
query51	7293	7264	7174	7174
query52	104	104	92	92
query53	231	256	181	181
query54	489	523	411	411
query55	82	85	77	77
query56	260	256	237	237
query57	1232	1225	1184	1184
query58	238	229	240	229
query59	3359	3266	3122	3122
query60	272	269	252	252
query61	115	106	144	106
query62	866	804	757	757
query63	231	190	184	184
query64	2915	1061	646	646
query65	3338	3247	3329	3247
query66	776	465	307	307
query67	16528	15752	15517	15517
query68	8363	738	522	522
query69	440	293	247	247
query70	1175	1172	1126	1126
query71	414	281	242	242
query72	6228	3880	3846	3846
query73	639	752	358	358
query74	10109	9222	8853	8853
query75	4363	3140	2677	2677
query76	4699	1183	749	749
query77	914	344	281	281
query78	10199	10408	9412	9412
query79	5294	891	577	577
query80	738	501	449	449
query81	487	325	236	236
query82	645	158	120	120
query83	192	160	151	151
query84	274	90	79	79
query85	732	377	317	317
query86	351	322	311	311
query87	4582	4431	4274	4274
query88	3706	2266	2222	2222
query89	434	340	296	296
query90	2101	190	192	190
query91	134	132	176	132
query92	64	57	51	51
query93	2857	879	523	523
query94	660	396	269	269
query95	345	263	242	242
query96	488	614	279	279
query97	2750	2792	2701	2701
query98	211	212	195	195
query99	1664	1532	1432	1432
Total cold run time: 299132 ms
Total hot run time: 195756 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.89% (10122/26027)
Line Coverage: 29.89% (85540/286210)
Region Coverage: 29.01% (43723/150700)
Branch Coverage: 25.54% (22301/87316)
Coverage Report: http://coverage.selectdb-in.cc/coverage/adf07dd8043524e5a154d8508d06e4e0517b633b_adf07dd8043524e5a154d8508d06e4e0517b633b/report/index.html

@doris-robot
Copy link

ClickBench: Total hot run time: 31.37 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit adf07dd8043524e5a154d8508d06e4e0517b633b, data reload: false

query1	0.04	0.04	0.04
query2	0.08	0.03	0.03
query3	0.25	0.07	0.07
query4	1.61	0.10	0.10
query5	0.40	0.42	0.40
query6	1.16	0.66	0.64
query7	0.02	0.02	0.01
query8	0.04	0.03	0.04
query9	0.60	0.49	0.50
query10	0.54	0.55	0.54
query11	0.15	0.11	0.11
query12	0.14	0.12	0.11
query13	0.60	0.60	0.60
query14	2.75	2.75	2.89
query15	0.92	0.83	0.84
query16	0.39	0.39	0.37
query17	1.01	1.07	1.03
query18	0.21	0.21	0.21
query19	1.97	1.87	2.00
query20	0.01	0.01	0.01
query21	15.36	0.94	0.58
query22	0.76	0.85	0.74
query23	15.20	1.48	0.52
query24	2.93	0.99	2.25
query25	0.17	0.08	0.10
query26	0.30	0.14	0.13
query27	0.05	0.06	0.04
query28	14.38	1.52	1.05
query29	12.61	3.92	3.24
query30	0.25	0.09	0.07
query31	2.82	0.64	0.39
query32	3.22	0.54	0.46
query33	3.12	3.08	3.15
query34	16.92	5.07	4.52
query35	4.48	4.45	4.50
query36	0.67	0.49	0.48
query37	0.09	0.06	0.06
query38	0.04	0.04	0.03
query39	0.03	0.02	0.02
query40	0.17	0.14	0.13
query41	0.07	0.03	0.02
query42	0.04	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 106.6 s
Total hot run time: 31.37 s

be/src/cloud/config.cpp Outdated Show resolved Hide resolved
be/src/cloud/config.h Outdated Show resolved Hide resolved
@gavinchou
Copy link
Contributor

run buildall

Copy link
Contributor

github-actions bot commented Jan 1, 2025

PR approved by at least one committer and no changes requested.

Copy link
Contributor

github-actions bot commented Jan 1, 2025

PR approved by anyone and no changes requested.

gavinchou
gavinchou previously approved these changes Jan 1, 2025
@gavinchou
Copy link
Contributor

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Jan 1, 2025
@gavinchou
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32626 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8f6312ed8f8b276c55b5cd33c326de6762a26598, data reload: false

------ Round 1 ----------------------------------
q1	17664	6115	6029	6029
q2	2053	320	168	168
q3	10656	1255	759	759
q4	10208	876	439	439
q5	7502	2167	1981	1981
q6	200	184	147	147
q7	887	755	599	599
q8	9237	1366	1154	1154
q9	5311	4923	4906	4906
q10	6757	2309	1855	1855
q11	465	283	263	263
q12	352	363	211	211
q13	17761	3554	3017	3017
q14	222	235	216	216
q15	553	503	503	503
q16	644	629	581	581
q17	580	842	328	328
q18	6994	6445	6406	6406
q19	1778	979	541	541
q20	310	310	185	185
q21	2968	2202	2027	2027
q22	358	337	311	311
Total cold run time: 103460 ms
Total hot run time: 32626 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6289	6202	6244	6202
q2	238	331	239	239
q3	2218	2673	2364	2364
q4	1446	1843	1357	1357
q5	4325	4746	4794	4746
q6	184	184	140	140
q7	2040	1998	1803	1803
q8	2648	2798	2654	2654
q9	7433	7246	7287	7246
q10	3082	3343	2865	2865
q11	586	509	488	488
q12	638	735	576	576
q13	3368	3794	3114	3114
q14	308	317	292	292
q15	559	527	521	521
q16	667	695	641	641
q17	1232	1734	1275	1275
q18	7649	7554	7262	7262
q19	819	1215	1137	1137
q20	2011	2023	1907	1907
q21	5837	5279	4735	4735
q22	612	621	577	577
Total cold run time: 54189 ms
Total hot run time: 52141 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 196710 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8f6312ed8f8b276c55b5cd33c326de6762a26598, data reload: false

query1	1275	971	934	934
query2	6491	2344	2315	2315
query3	11016	4720	4724	4720
query4	32990	23648	23281	23281
query5	4665	613	430	430
query6	297	203	197	197
query7	3994	491	320	320
query8	311	243	240	240
query9	9363	2653	2632	2632
query10	456	303	254	254
query11	18210	15767	15327	15327
query12	152	110	102	102
query13	1690	539	412	412
query14	11296	7391	6913	6913
query15	257	229	189	189
query16	7882	685	487	487
query17	1529	759	614	614
query18	2122	425	355	355
query19	203	173	158	158
query20	121	112	118	112
query21	215	122	109	109
query22	4893	4607	4595	4595
query23	35170	33682	33647	33647
query24	6234	2373	2362	2362
query25	473	460	402	402
query26	786	278	153	153
query27	2457	472	336	336
query28	5199	2476	2474	2474
query29	612	548	432	432
query30	217	182	150	150
query31	970	940	832	832
query32	72	61	59	59
query33	467	347	313	313
query34	768	889	534	534
query35	810	820	754	754
query36	1027	1054	952	952
query37	116	96	72	72
query38	4405	4369	4216	4216
query39	1536	1452	1483	1452
query40	209	116	101	101
query41	43	42	45	42
query42	128	103	102	102
query43	525	535	501	501
query44	1363	837	853	837
query45	186	173	176	173
query46	896	1089	667	667
query47	2025	1986	1925	1925
query48	384	403	315	315
query49	706	482	411	411
query50	689	663	397	397
query51	7276	7410	7232	7232
query52	102	99	93	93
query53	233	255	188	188
query54	479	506	416	416
query55	86	78	78	78
query56	254	253	240	240
query57	1262	1247	1161	1161
query58	245	228	244	228
query59	3366	3171	3087	3087
query60	276	265	254	254
query61	117	108	108	108
query62	871	826	769	769
query63	237	227	184	184
query64	3277	1054	660	660
query65	3320	3235	3299	3235
query66	816	414	300	300
query67	16457	15825	15461	15461
query68	9388	755	516	516
query69	495	293	254	254
query70	1214	1162	1177	1162
query71	462	291	261	261
query72	5893	3881	3854	3854
query73	688	760	371	371
query74	10574	9259	9114	9114
query75	4732	3158	2667	2667
query76	5553	1183	798	798
query77	1050	353	303	303
query78	10141	10348	9343	9343
query79	4444	878	588	588
query80	762	546	424	424
query81	478	276	234	234
query82	701	151	124	124
query83	192	164	143	143
query84	283	99	76	76
query85	743	375	302	302
query86	357	313	263	263
query87	4627	4392	4384	4384
query88	3541	2213	2216	2213
query89	459	338	293	293
query90	2129	186	181	181
query91	134	128	109	109
query92	65	55	58	55
query93	2785	877	538	538
query94	659	421	281	281
query95	331	270	254	254
query96	477	622	273	273
query97	2697	2807	2725	2725
query98	214	196	194	194
query99	1717	1551	1412	1412
Total cold run time: 304560 ms
Total hot run time: 196710 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.86 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 8f6312ed8f8b276c55b5cd33c326de6762a26598, data reload: false

query1	0.03	0.03	0.03
query2	0.07	0.03	0.03
query3	0.23	0.07	0.07
query4	1.60	0.10	0.10
query5	0.42	0.41	0.40
query6	1.14	0.66	0.64
query7	0.03	0.02	0.02
query8	0.04	0.03	0.03
query9	0.60	0.49	0.50
query10	0.56	0.56	0.55
query11	0.15	0.10	0.09
query12	0.14	0.11	0.11
query13	0.62	0.61	0.60
query14	2.78	2.72	2.75
query15	0.89	0.83	0.82
query16	0.39	0.37	0.37
query17	1.07	1.07	1.07
query18	0.23	0.20	0.21
query19	1.89	1.83	2.00
query20	0.01	0.01	0.01
query21	15.38	0.89	0.59
query22	0.75	0.79	0.61
query23	15.35	1.46	0.54
query24	3.26	1.71	0.65
query25	0.17	0.22	0.10
query26	0.30	0.13	0.14
query27	0.06	0.04	0.05
query28	13.76	1.51	1.04
query29	13.00	4.00	3.29
query30	0.24	0.09	0.06
query31	2.83	0.57	0.38
query32	3.23	0.54	0.45
query33	3.04	3.07	3.12
query34	16.54	5.14	4.46
query35	4.50	4.48	4.51
query36	0.67	0.51	0.48
query37	0.10	0.07	0.05
query38	0.05	0.04	0.04
query39	0.03	0.02	0.02
query40	0.17	0.13	0.12
query41	0.08	0.03	0.03
query42	0.04	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 106.47 s
Total hot run time: 30.86 s

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.90% (10123/26024)
Line Coverage: 29.89% (85553/286186)
Region Coverage: 29.01% (43710/150655)
Branch Coverage: 25.54% (22304/87320)
Coverage Report: http://coverage.selectdb-in.cc/coverage/8f6312ed8f8b276c55b5cd33c326de6762a26598_8f6312ed8f8b276c55b5cd33c326de6762a26598/report/index.html

@gavinchou gavinchou added the p0_c label Jan 2, 2025
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 2, 2025
Copy link
Collaborator

@TangSiyang2001 TangSiyang2001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

github-actions bot commented Jan 2, 2025

PR approved by at least one committer and no changes requested.

@gavinchou gavinchou merged commit 45f440d into apache:master Jan 2, 2025
25 of 27 checks passed
luwei16 added a commit to luwei16/incubator-doris that referenced this pull request Jan 3, 2025
gavinchou pushed a commit that referenced this pull request Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.4-merged p0_c reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants