[CARBONDATA-4263]support query with latestSegment #4189

MarvinLitt · 2021-07-30T14:14:53Z

support query with latest segment, when configure the TBLPROPERTIES that include "query_latest_segment".

Why is this PR needed?

some scenarios:
the number of data rows does not change.
The data of each column is increasing and changing.
At this scenario, it is faster load full data each time using load command.
In this case, the query only needs to query the latest segment
Need a way to control the table do like this.

What changes were proposed in this PR?

add a new property :query_latest_segment
when set 'true' ,it will get the latestSegment for query
when set 'false' or not set， there will be no impact.

Does this PR introduce any user interface change?

No

Is any new testcase added?

Yes add new LatestSegmentTestCases

CarbonDataQA2 · 2021-07-30T15:35:07Z

Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/211/

CarbonDataQA2 · 2021-07-30T15:36:13Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5806/

CarbonDataQA2 · 2021-07-30T15:39:34Z

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4063/

MarvinLitt · 2021-08-07T14:43:15Z

retest this please

CarbonDataQA2 · 2021-08-07T14:58:20Z

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4079/

CarbonDataQA2 · 2021-08-07T14:59:13Z

Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/227/

CarbonDataQA2 · 2021-08-07T15:00:03Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5824/

MarvinLitt · 2021-08-07T15:13:50Z

retest this please

CarbonDataQA2 · 2021-08-07T17:23:37Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5825/

CarbonDataQA2 · 2021-08-07T17:26:04Z

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4080/

MarvinLitt · 2021-08-07T17:26:57Z

retest this please

CarbonDataQA2 · 2021-08-07T19:39:42Z

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5826/

CarbonDataQA2 · 2021-08-07T19:43:22Z

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4081/

CarbonDataQA2 · 2021-08-07T19:50:57Z

Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/229/

CarbonDataQA2 · 2021-08-08T16:08:22Z

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4086/

brijoobopanna · 2021-08-09T06:10:35Z

please litao check if this solution can be made generic using the UDF 'insegment' already available in code and expose to the user in query statement rather than a config property @kunal642 @jackylk @ajantha-bhat @akashrn5 @QiangCai

the number of data rows does not change. The data of each column is increasing and changing. At this scenario, it is faster load full data each time using load command. In this case, the query only needs to query the latest segment Need a way to control the table do like this.

MarvinLitt · 2021-08-09T06:20:48Z

retest this please

MarvinLitt · 2021-08-09T07:03:58Z

please litao check if this solution can be made generic using the UDF 'insegment' already available in code and expose to the user in query statement rather than a config property @kunal642 @jackylk @ajantha-bhat @akashrn5 @QiangCai

hi brijoo, i check the doc of SEGMENT MANAGEMENT. This ability can not meet the demands, and I have no way to increase the table configuration. The segment manager that use set to configure, but not all the tables need quey latest segment. And the business not known the query that should using latest segment or whole segments. So I can't think of any other method except specifying the configuration when creating a table。

CarbonDataQA2 · 2021-08-09T08:31:10Z

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5833/

CarbonDataQA2 · 2021-08-09T08:44:49Z

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4089/

CarbonDataQA2 · 2021-08-09T08:46:23Z

Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/236/

kunal642 · 2021-08-11T15:13:51Z

hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java

+   * @param validSegments the in put segment for search
+   * @return the latest segment for query
+   */
+  public List<Segment> getLatestSegment(List<Segment> validSegments) {


if we need a single segment, then why return type is List?

In order to be consistent with the external interfaces, in addition, if the latest segments are required, they also have consistency

kunal642 · 2021-08-11T15:20:27Z

hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java

+   */
+  public Segment[] getSegmentsToAccess(JobContext job, ReadCommittedScope readCommittedScope,
+                                       List<Segment> validSegments) {
+    String segmentString = job.getConfiguration().get(INPUT_SEGMENT_NUMBERS, "");


call getSegmentsToAccess(JobContext job, ReadCommittedScope readCommittedScope) to get the segments set in configuration, instead of writing the code again

the old getSegmentsToAccess fun just use INPUT_SEGMENT_NUMBERS for input to get the segment List.
But now we need get segment not just INPUT_SEGMENT_NUMBERS but alse latest segment. the validSegments is need to use.
if use getSegmentsToAccess(JobContext job, ReadCommittedScope readCommittedScope) we need to analysis readCommittedScope to validSegments that the external functions have been implemented.
so i choose func overload to do this function.

kunal642 · 2021-08-11T15:22:02Z

@MarvinLitt

the new csv is not needed as we already have several csv in our resources, use any one of the existing one(this is not needed if point 2 is done)
better to use insert instead of load and no need for filter condition, just use select * because this is not related to any filter.
I can see only 2-3 conditions in your test cases, better to add them in a existing test suite, please avoid creating new test files unless absolutely necessary

MarvinLitt · 2021-08-12T02:06:22Z

the new csv is not needed as we already have several csv in our resources, use any one of the existing one(this is not needed if point 2 is done)

better to use insert instead of load and no need for filter condition, just use select * because this is not related to any filter.

I can see only 2-3 conditions in your test cases, better to add them in a existing test suite, please avoid creating new test files unless absolutely necessary

to use load command in test case because there are many scenarios for using load, hope to take care of this command in the test case.
new csv file is because latest-table-data.csv is not same with the old one. I removed some values from some columns， to check whether the latest segment is right or not.
yes, of couse the test case can move to an exists test case file, if need i will do.

@kunal642

QiangCai · 2021-08-13T01:49:28Z

can we check why "overwrite data" is much slower than "load data"?

MarvinLitt · 2021-08-13T02:57:22Z

can we check why "overwrite data" is much slower than "load data"?

it is obviously， overwrite need to check the data in segments, load do not need, they have great differences.
In principle, there is a huge performance gap between overwrite and load.
if use insert overwrite table from select * from another, need to load csv data as temp table, and select all data, all of that may take more time.
this scenarios is very special，the performance is key poit.
In addition, if insert overwrite is used, it will consume more time when querying
@QiangCai

QiangCai · 2021-08-16T09:27:08Z

if we can fix the performance issue of load overwrite, does it satisfy your requirement?

MarvinLitt · 2021-08-16T11:09:48Z

if we can fix the performance issue of load overwrite, does it satisfy your requirement?

yes，if we can mkae the command of "insert overwrite" as qucikly as the command of "load", it can slove the problem.
But when can we achieve consistent performance，it may take some time. It is also difficult to achieve in the short term。
In order not to lose customers, do we need to merge this pr first.
When the insert overwrite performance is implemented, it can be switched seamlessly.
The customer doesn't focus on what commands to use, but on performance.
what do you think about @QiangCai

jackylk · 2021-08-18T07:41:01Z

I suggest we locate the performance issue in INSERT OVERWIRTE and fix it in the first place. Instead of creating a patch solution, which we may remove it later and create compactibility problem.

kunal642 · 2021-08-19T13:47:39Z

I agree with jacky

vikramahuja1001 · 2021-08-20T12:10:53Z

I had a discussion with @MarvinLitt and it seems that the performance issue in OVERWRITE is related to the environment and after the environment is fixed, the performance issue/degradation is not observed. @MarvinLitt to discuss in community if the requirement is needed.

MarvinLitt · 2021-08-24T14:24:15Z

The command with load overwrite that can make a same result with this pr. i just test for that the performance between load and load overwrite is the same.
So this scenario can use load overwrite.
About the insert overwirte this command performance is not well because it need do some thing like update.
it is helpful with the update scenario, we can make a plan to improve the insert overwrite performance.
@jackylk @kunal642 @vikramahuja1001

kunal642 · 2021-08-25T14:35:48Z

@MarvinLitt You can raise a JIRA for insert overwrite performance issue so that someone in the community can pick it up. Please close this PR as your scenario can be handled through Load overwrite for now

MarvinLitt force-pushed the latestSegment branch from 6a00ec2 to c7d4143 Compare August 7, 2021 14:34

MarvinLitt force-pushed the latestSegment branch from c7d4143 to 81b25c3 Compare August 7, 2021 15:12

MarvinLitt force-pushed the latestSegment branch from 81b25c3 to abc5876 Compare August 9, 2021 06:14

kunal642 reviewed Aug 11, 2021

View reviewed changes

[CARBONDATA-4263]support query with latestSegment #4189

Are you sure you want to change the base?

[CARBONDATA-4263]support query with latestSegment #4189

Conversation

MarvinLitt commented Jul 30, 2021

Why is this PR needed?

What changes were proposed in this PR?

Does this PR introduce any user interface change?

Is any new testcase added?

CarbonDataQA2 commented Jul 30, 2021

CarbonDataQA2 commented Jul 30, 2021

CarbonDataQA2 commented Jul 30, 2021

MarvinLitt commented Aug 7, 2021

CarbonDataQA2 commented Aug 7, 2021

CarbonDataQA2 commented Aug 7, 2021

CarbonDataQA2 commented Aug 7, 2021

MarvinLitt commented Aug 7, 2021

CarbonDataQA2 commented Aug 7, 2021

CarbonDataQA2 commented Aug 7, 2021

MarvinLitt commented Aug 7, 2021

CarbonDataQA2 commented Aug 7, 2021

CarbonDataQA2 commented Aug 7, 2021

CarbonDataQA2 commented Aug 7, 2021

CarbonDataQA2 commented Aug 8, 2021

brijoobopanna commented Aug 9, 2021

MarvinLitt commented Aug 9, 2021

MarvinLitt commented Aug 9, 2021

CarbonDataQA2 commented Aug 9, 2021

CarbonDataQA2 commented Aug 9, 2021

CarbonDataQA2 commented Aug 9, 2021

kunal642 Aug 11, 2021

Choose a reason for hiding this comment

MarvinLitt Aug 12, 2021

Choose a reason for hiding this comment

kunal642 Aug 11, 2021

Choose a reason for hiding this comment

MarvinLitt Aug 12, 2021

Choose a reason for hiding this comment

kunal642 commented Aug 11, 2021

MarvinLitt commented Aug 12, 2021 • edited Loading

QiangCai commented Aug 13, 2021

MarvinLitt commented Aug 13, 2021 • edited Loading

QiangCai commented Aug 16, 2021

MarvinLitt commented Aug 16, 2021 • edited Loading

jackylk commented Aug 18, 2021

kunal642 commented Aug 19, 2021

vikramahuja1001 commented Aug 20, 2021

MarvinLitt commented Aug 24, 2021 • edited Loading

kunal642 commented Aug 25, 2021

MarvinLitt commented Aug 12, 2021 •

edited

Loading

MarvinLitt commented Aug 13, 2021 •

edited

Loading

MarvinLitt commented Aug 16, 2021 •

edited

Loading

MarvinLitt commented Aug 24, 2021 •

edited

Loading