Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-4263]support query with latestSegment #4189

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MarvinLitt
Copy link
Contributor

support query with latest segment, when configure the TBLPROPERTIES that include "query_latest_segment".

Why is this PR needed?

some scenarios:
the number of data rows does not change.
The data of each column is increasing and changing.
At this scenario, it is faster load full data each time using load command.
In this case, the query only needs to query the latest segment
Need a way to control the table do like this.

What changes were proposed in this PR?

add a new property :query_latest_segment
when set 'true' ,it will get the latestSegment for query
when set 'false' or not set, there will be no impact.

Does this PR introduce any user interface change?

  • No

Is any new testcase added?

  • Yes add new LatestSegmentTestCases

@CarbonDataQA2
Copy link

Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/211/

@CarbonDataQA2
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5806/

@CarbonDataQA2
Copy link

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4063/

@MarvinLitt
Copy link
Contributor Author

retest this please

@CarbonDataQA2
Copy link

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4079/

@CarbonDataQA2
Copy link

Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/227/

@CarbonDataQA2
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5824/

@MarvinLitt
Copy link
Contributor Author

retest this please

@CarbonDataQA2
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5825/

@CarbonDataQA2
Copy link

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4080/

@MarvinLitt
Copy link
Contributor Author

retest this please

@CarbonDataQA2
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5826/

@CarbonDataQA2
Copy link

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4081/

@CarbonDataQA2
Copy link

Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/229/

@CarbonDataQA2
Copy link

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4086/

@brijoobopanna
Copy link
Contributor

please litao check if this solution can be made generic using the UDF 'insegment' already available in code and expose to the user in query statement rather than a config property @kunal642 @jackylk @ajantha-bhat @akashrn5 @QiangCai

the number of data rows does not change.
The data of each column is increasing and changing.
At this scenario, it is faster load full data each time using load command.
In this case, the query only needs to query the latest segment
Need a way to control the table do like this.
@MarvinLitt
Copy link
Contributor Author

retest this please

@MarvinLitt
Copy link
Contributor Author

please litao check if this solution can be made generic using the UDF 'insegment' already available in code and expose to the user in query statement rather than a config property @kunal642 @jackylk @ajantha-bhat @akashrn5 @QiangCai

hi brijoo, i check the doc of SEGMENT MANAGEMENT. This ability can not meet the demands, and I have no way to increase the table configuration. The segment manager that use set to configure, but not all the tables need quey latest segment. And the business not known the query that should using latest segment or whole segments. So I can't think of any other method except specifying the configuration when creating a table。

@CarbonDataQA2
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5833/

@CarbonDataQA2
Copy link

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4089/

@CarbonDataQA2
Copy link

Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/236/

* @param validSegments the in put segment for search
* @return the latest segment for query
*/
public List<Segment> getLatestSegment(List<Segment> validSegments) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we need a single segment, then why return type is List?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to be consistent with the external interfaces, in addition, if the latest segments are required, they also have consistency

*/
public Segment[] getSegmentsToAccess(JobContext job, ReadCommittedScope readCommittedScope,
List<Segment> validSegments) {
String segmentString = job.getConfiguration().get(INPUT_SEGMENT_NUMBERS, "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call getSegmentsToAccess(JobContext job, ReadCommittedScope readCommittedScope) to get the segments set in configuration, instead of writing the code again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old getSegmentsToAccess fun just use INPUT_SEGMENT_NUMBERS for input to get the segment List.
But now we need get segment not just INPUT_SEGMENT_NUMBERS but alse latest segment. the validSegments is need to use.
if use getSegmentsToAccess(JobContext job, ReadCommittedScope readCommittedScope) we need to analysis readCommittedScope to validSegments that the external functions have been implemented.
so i choose func overload to do this function.

@kunal642
Copy link
Contributor

@MarvinLitt

  1. the new csv is not needed as we already have several csv in our resources, use any one of the existing one(this is not needed if point 2 is done)
  2. better to use insert instead of load and no need for filter condition, just use select * because this is not related to any filter.
  3. I can see only 2-3 conditions in your test cases, better to add them in a existing test suite, please avoid creating new test files unless absolutely necessary

@MarvinLitt
Copy link
Contributor Author

MarvinLitt commented Aug 12, 2021

  • the new csv is not needed as we already have several csv in our resources, use any one of the existing one(this is not needed if point 2 is done)
  • better to use insert instead of load and no need for filter condition, just use select * because this is not related to any filter.
  • I can see only 2-3 conditions in your test cases, better to add them in a existing test suite, please avoid creating new test files unless absolutely necessary
  1. to use load command in test case because there are many scenarios for using load, hope to take care of this command in the test case.
  2. new csv file is because latest-table-data.csv is not same with the old one. I removed some values from some columns, to check whether the latest segment is right or not.
  3. yes, of couse the test case can move to an exists test case file, if need i will do.

@kunal642

@QiangCai
Copy link
Contributor

can we check why "overwrite data" is much slower than "load data"?

@MarvinLitt
Copy link
Contributor Author

MarvinLitt commented Aug 13, 2021

can we check why "overwrite data" is much slower than "load data"?

it is obviously, overwrite need to check the data in segments, load do not need, they have great differences.
In principle, there is a huge performance gap between overwrite and load.
if use insert overwrite table from select * from another, need to load csv data as temp table, and select all data, all of that may take more time.
this scenarios is very special,the performance is key poit.
In addition, if insert overwrite is used, it will consume more time when querying
@QiangCai

@QiangCai
Copy link
Contributor

if we can fix the performance issue of load overwrite, does it satisfy your requirement?

@MarvinLitt
Copy link
Contributor Author

MarvinLitt commented Aug 16, 2021

if we can fix the performance issue of load overwrite, does it satisfy your requirement?

yes,if we can mkae the command of "insert overwrite" as qucikly as the command of "load", it can slove the problem.
But when can we achieve consistent performance,it may take some time. It is also difficult to achieve in the short term。
In order not to lose customers, do we need to merge this pr first.
When the insert overwrite performance is implemented, it can be switched seamlessly.
The customer doesn't focus on what commands to use, but on performance.
what do you think about @QiangCai

@jackylk
Copy link
Contributor

jackylk commented Aug 18, 2021

I suggest we locate the performance issue in INSERT OVERWIRTE and fix it in the first place. Instead of creating a patch solution, which we may remove it later and create compactibility problem.

@kunal642
Copy link
Contributor

I agree with jacky

@vikramahuja1001
Copy link
Contributor

I had a discussion with @MarvinLitt and it seems that the performance issue in OVERWRITE is related to the environment and after the environment is fixed, the performance issue/degradation is not observed. @MarvinLitt to discuss in community if the requirement is needed.

@MarvinLitt
Copy link
Contributor Author

MarvinLitt commented Aug 24, 2021

The command with load overwrite that can make a same result with this pr. i just test for that the performance between load and load overwrite is the same.
So this scenario can use load overwrite.
About the insert overwirte this command performance is not well because it need do some thing like update.
it is helpful with the update scenario, we can make a plan to improve the insert overwrite performance.
@jackylk @kunal642 @vikramahuja1001

@kunal642
Copy link
Contributor

@MarvinLitt You can raise a JIRA for insert overwrite performance issue so that someone in the community can pick it up. Please close this PR as your scenario can be handled through Load overwrite for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants