Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge from apache master #7

Open
wants to merge 3,218 commits into
base: master
Choose a base branch
from
Open

Conversation

mayunSaicmotor
Copy link
Owner

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

  • Make sure the PR title is formatted like:
    [CARBONDATA-<Jira issue #>] Description of pull request

  • Make sure tests pass via mvn clean verify. (Even better, enable
    Travis-CI on your fork and ensure the whole test matrix passes).

  • Replace <Jira issue #> in the title with the actual Jira issue
    number, if there is one.

  • If this contribution is large, please file an Apache
    Individual Contributor License Agreement.

  • Testing done

     Please provide details on 
     - Whether new unit test cases have been added or why no new tests are required?
     - What manual testing you have done?
     - Any additional information to help reviewers in testing this change.
    
  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


vikramahuja1001 and others added 16 commits December 21, 2020 16:32
…V flow

Why is this PR needed?
They are multiple issues with the Delete segment API:

Not using the latest loadmetadatadetails while writing to table status file, thus can remove table status entry of any concurrently loaded Insert In progress/success segment.
The code reads the table status file 2 times
When in concurrent queries, they both access checkAndReloadSchema for MV on all databases, 2 different queries try to create a file on same location, HDFS takes the lock for one and fails for another, thus failing the query

What changes were proposed in this PR?
Only reading the table status file once.
Using the latest tablestatus to mark the segment Marked for delete, thus no concurrent issues will come
Made touchMDT and checkAndReloadSchema methods syncronized, so that only instance can access it at one time.

Does this PR introduce any user interface change?
No

Is any new testcase added?
No

This closes #4059
…Sync during query

Why is this PR needed?
Added logs for MV and method to verify if mv is in Sync during query

What changes were proposed in this PR?
1. Move MV Enable Check to beginning to avoid transform logical plan
2. Add Logger if exception is occurred during fetching mv schema
3. Check if MV is in Sync and allow Query rewrite
4. Reuse reading LoadMetadetails to get mergedLoadMapping
5. Set NO-Dict Schema types for insert-partition flow - missed from [CARBONDATA-4077]

Does this PR introduce any user interface change?
No

Is any new testcase added?
Yes

This closes #4060
…h index server

Why is this PR needed?
The used asJava converts to java "in place", without copying the whole data to save
time and memory and it just simply wraps the scala collection with a class that
conforms to the java interface and thus java serializer is not able to serialize it.

What changes were proposed in this PR?
Converting it to list, so that it is able to serialize a list.

Does this PR introduce any user interface change?
No

Is any new testcase added?
No

This closes #4061
… have scheme, the default will be local file system, which is not the file system defined by fs.defaultFS

Why is this PR needed?
Create table with location, if the location doesn't have scheme, the default will be local file system, which is not the file system defined by fs.defaultFS.

What changes were proposed in this PR?
If the location doesn't have scheme, add the fs.defaultFS scheme to the beginning of the location.

Does this PR introduce any user interface change?
No

Is any new testcase added?
Yes

This closes #4065
…rift is Set

Why is this PR needed?
After converting expression to IN Expression for maintable with SI, expression
is not processed if ColumnDrift is enabled. Query fails with NPE during
resolveFilter. Exception is added in JIRA

What changes were proposed in this PR?
Process the filter expression after adding implicit expression

Does this PR introduce any user interface change?
No

Is any new testcase added?
Yes

This closes #4063
…which leads to memory leak

Why is this PR needed?
When there are two spark applications, one drop a table, some cache information of this table
stay in another application and cannot be removed with any method like "Drop metacache" command.
This leads to memory leak. With the passage of time, memory leak will also accumulate which
finally leads to driver OOM. Following are the leak points:
1) tableModifiedTimeStore in CarbonFileMetastore;
2) segmentLockMap in BlockletDataMapIndexStore;
3) absoluteTableIdentifierByteMap in SegmentPropertiesAndSchemaHolder;
4) tableInfoMap in CarbonMetadata.

What changes were proposed in this PR?
Using expiring map to cache the table information in CarbonMetadata and modified time in
CarbonFileMetaStore so that stale information will be cleared automatically after the expiration
time. Operations in BlockletDataMapIndexStore no need to be locked, remove all the logic
related to segmentLockMap.

Does this PR introduce any user interface change?
New configuration carbon.metacache.expiration.seconds is added.

Is any new testcase added?
No

This closes #4057
… case of concurrent load,

compact and clean files operation

Why is this PR needed?
There were 2 issues in the clean files post event listener:

1. In concurrent cases, while writing entry back to the table status file, wrong path was given,
due to which table status file was not updated in the case of SI table.
2. While writing the loadmetadetails to the table status file during concurrent scenarios,
we were only writing the unwanted segments and not all the segments, which could make segments
stale in the SI table
Due to these 2 issues, when selet query is executed on SI table, the tablestatus would have entry
for a segment but it's carbondata file would be deleted, thus throwing an IO Exception.
3. Segment ID is null when writing hive table

What changes were proposed in this PR?
1.& 2. Added correct table status path as well sending the correct loadmetadatadetails to be updated in
the table status file. Now when select query is fired on the SI table, it will not throw
carbondata file not found exception
3. set the load model after setup job of committer

Does this PR introduce any user interface change?
No

Is any new testcase added?
No

This closes #4066
…table after concurrent Load & Compaction operation

Why is this PR needed?
When Concurrent LOAD and COMPACTION is in progress on main table having SI, SILoadEventListenerForFailedSegments listener is called to repair SI failed segments if any. It will compare SI and main table segment status, if there is a mismatch, then it will add that specific load to failedLoads to be re-loaded again.

During Compaction, SI will be updated first and then maintable. So, in some cases, SI segment will be in compacted state and main table will be in SUCCESS state(the compaction can be still in progress or due to some operation failure). SI index repair will add those segments to failedLoads, by checking if segment lock can be acquired. But, if maintable compaction is finished by the time, SI repair comparison is done, then also, it can acquire segment lock and add those load to failedLoad(even though main table load is COMPACTED). After the concurrent operation is finished, some segments of SI are marked as INSERT_IN_PROGRESS. This will lead to inconsistent state between SI and mainTable segments.

What changes were proposed in this PR?
Acquire compaction lock on maintable(to ensure compaction is not running), and then compare SI and main table load details, to repair SI segments.

Does this PR introduce any user interface change?
No

Is any new testcase added?
No (concurrent scenario)

This closes #4067
…e in Presto integration

Why is this PR needed?
FT for following cases has been added. Here store is created by spark and it is read by Presto.

update without local-dict
delete operations on table
minor, major, custom compaction
add and delete segments
test update with inverted index
read with partition columns
Filter on partition columns
Bloom index
test range columns
read streaming data

Does this PR introduce any user interface change?
No

Is any new testcase added?
Yes

This closes #4031
…der in SDK

Why is this PR needed?
Currently, SDK pagination reader is not supported for the filter expression and also returning the wrong result after performing IUD operation through SDK.

What changes were proposed in this PR?
In case of filter present or update/delete operation get the total rows in splits after building the carbon reader else get the row count from the details info of each splits.
Handled ArrayIndexOutOfBoundException and return zero in case of rowCountInSplits.size() == 0

Does this PR introduce any user interface change?
No

Is any new testcase added?
Yes

This closes #4068
Why is this PR needed?
1. Block SI creation on binary column.
2. Block alter table drop column directly on SI table.
3. Create table as like should not be allowed for SI tables.
4. Filter with like should not scan SI table.
5. Currently compaction is allowed on SI table. Because of this if only SI table
is compacted and running filter query query on main table is causing more data
scan of SI table which will causing performance degradation.

What changes were proposed in this PR?
1. Blocked SI creation on binary column.
2. Blocked alter table drop column directly on SI table.
3. Handled Create table as like for SI tables.
4. Handled filter with like to not scan SI table.
5. Block the direct compaction on SI table and add FTs for compaction scenario of SI.
6. Added FT for compression and range column on SI table.

Does this PR introduce any user interface change?
No

Is any new testcase added?
Yes

This closes #4037
Why is this PR needed?
In order to support MERGE INTO SQL Command in Carbondata
The previous Scala Parser having trouble to parse the complicated Merge Into SQL Command

What changes were proposed in this PR?
Add an ANTLR parser, and support parse MERGE INTO SQL Command to DataSet Command

Does this PR introduce any user interface change?
Yes.
The PR introduces the MERGE INTO SQL Command.

Is any new testcase added?
Yes

This closes #4032

Co-authored-by: Zhangshunyu <[email protected]>
Why is this PR needed?
since version 2.0, carbon supports starting spark ThriftServer with CarbonExtensions.

What changes were proposed in this PR?
add the document to start spark ThriftServer with CarbonExtensions.

Does this PR introduce any user interface change?
No

Is any new testcase added?
No

This closes #4077
entry when there is no update/insert data

Why is this PR needed?
1. After #3999 when an update happens on the table, a new segment
is created for updated data. But when there is no data to update,
still the segments are created and the table status has in progress
entries for those empty segments. This leads to unnecessary segment
dirs and an increase in table status entries.
2. after this, clean files don't clean these empty segments.
3. when the source table do not have data, CTAS will result in same
problem mentioned.

What changes were proposed in this PR?
when the data is not present during update, make the segment as marked
for delete so that the clean files take care to delete the segment,
for cats already handled, added test cases.

This closes #4018
…ry on

sort column giving wrong result with IndexServer

Why is this PR needed?
1. Create a table and read from sdk written files fails in cluster with
java.nio.file.NoSuchFileException: hdfs:/hacluster/user/hive/warehouse/carbon.store/default/sdk.
2. After fixing the above path issue, filter query on sort column gives
the wrong result with IndexServer.

What changes were proposed in this PR?
1. In getAllDeleteDeltaFiles , used CarbonFiles.listFiles instead of Files.walk
to handle custom file types.
2. In PruneWithFilter , isResolvedOnSegment is used in filterResolver step.
Have set table and expression on executor side, so indexserver can use this
in filterResolver step.

This closes #4064
…hancement

Why is this PR needed?
Spatial index feature optimization of CarbonData

What changes were proposed in this PR?
1. Update spatial index encoded algorithm, which can reduce the required properties of creating geo table
2. Enhance geo query UDFs, support querying geo table with polygon list, polyline list, geoId range list. And add some geo transforming util UDFs.
3. Load data (include LOAD and INSERT INTO) allows user to input spatial index, which column will still generated internally when user does not give.

Does this PR introduce any user interface change?
No

Is any new testcase added?
Yes

This closes #4012
chenliang613 and others added 30 commits October 10, 2023 22:28
#117, The longitude is six decimal places and the dimension is five digits. Why is it the same length after conversion?
…a types (#4263)

Why is this PR needed?
CHAR and VARCHAR as String data types are no longer supported in Carbon. They should be deleted from doc's desc.

What changes were proposed in this PR?
CHAR and VARCHAR stop appearing as two String data types in doc.

Does this PR introduce any user interface change?
No

Is any new testcase added?
No

Co-authored-by: tangchuan <[email protected]>
Bumps [pyarrow](https://github.com/apache/arrow) from 0.11.1 to 14.0.1.
- [Commits](apache/arrow@apache-arrow-0.11.1...go/v14.0.1)

---
updated-dependencies:
- dependency-name: pyarrow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Minor refactor the build docs

* Fix review comments

* Update build/README.md
* upgrade thrift version

* change to use 0.20.0

---------

Co-authored-by: jacky <[email protected]>
Bumps org.apache.commons:commons-compress from 1.4.1 to 1.26.0.

---
updated-dependencies:
- dependency-name: org.apache.commons:commons-compress
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* add github action for building

* Revert "[WIP] Optimize geo module, the feature seems less be used (#4353)"

This reverts commit 29607c3.

* Revert "[WIP] Optimize geo module, the feature seems less be used"

This reverts commit 71abab0.

* cache thrift
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.