merge from apache master #7

…V flow Why is this PR needed? They are multiple issues with the Delete segment API: Not using the latest loadmetadatadetails while writing to table status file, thus can remove table status entry of any concurrently loaded Insert In progress/success segment. The code reads the table status file 2 times When in concurrent queries, they both access checkAndReloadSchema for MV on all databases, 2 different queries try to create a file on same location, HDFS takes the lock for one and fails for another, thus failing the query What changes were proposed in this PR? Only reading the table status file once. Using the latest tablestatus to mark the segment Marked for delete, thus no concurrent issues will come Made touchMDT and checkAndReloadSchema methods syncronized, so that only instance can access it at one time. Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4059

…Sync during query Why is this PR needed? Added logs for MV and method to verify if mv is in Sync during query What changes were proposed in this PR? 1. Move MV Enable Check to beginning to avoid transform logical plan 2. Add Logger if exception is occurred during fetching mv schema 3. Check if MV is in Sync and allow Query rewrite 4. Reuse reading LoadMetadetails to get mergedLoadMapping 5. Set NO-Dict Schema types for insert-partition flow - missed from [CARBONDATA-4077] Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4060

…h index server Why is this PR needed? The used asJava converts to java "in place", without copying the whole data to save time and memory and it just simply wraps the scala collection with a class that conforms to the java interface and thus java serializer is not able to serialize it. What changes were proposed in this PR? Converting it to list, so that it is able to serialize a list. Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4061

… have scheme, the default will be local file system, which is not the file system defined by fs.defaultFS Why is this PR needed? Create table with location, if the location doesn't have scheme, the default will be local file system, which is not the file system defined by fs.defaultFS. What changes were proposed in this PR? If the location doesn't have scheme, add the fs.defaultFS scheme to the beginning of the location. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4065

…rift is Set Why is this PR needed? After converting expression to IN Expression for maintable with SI, expression is not processed if ColumnDrift is enabled. Query fails with NPE during resolveFilter. Exception is added in JIRA What changes were proposed in this PR? Process the filter expression after adding implicit expression Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4063

…which leads to memory leak Why is this PR needed? When there are two spark applications, one drop a table, some cache information of this table stay in another application and cannot be removed with any method like "Drop metacache" command. This leads to memory leak. With the passage of time, memory leak will also accumulate which finally leads to driver OOM. Following are the leak points: 1) tableModifiedTimeStore in CarbonFileMetastore; 2) segmentLockMap in BlockletDataMapIndexStore; 3) absoluteTableIdentifierByteMap in SegmentPropertiesAndSchemaHolder; 4) tableInfoMap in CarbonMetadata. What changes were proposed in this PR? Using expiring map to cache the table information in CarbonMetadata and modified time in CarbonFileMetaStore so that stale information will be cleared automatically after the expiration time. Operations in BlockletDataMapIndexStore no need to be locked, remove all the logic related to segmentLockMap. Does this PR introduce any user interface change? New configuration carbon.metacache.expiration.seconds is added. Is any new testcase added? No This closes #4057

… case of concurrent load, compact and clean files operation Why is this PR needed? There were 2 issues in the clean files post event listener: 1. In concurrent cases, while writing entry back to the table status file, wrong path was given, due to which table status file was not updated in the case of SI table. 2. While writing the loadmetadetails to the table status file during concurrent scenarios, we were only writing the unwanted segments and not all the segments, which could make segments stale in the SI table Due to these 2 issues, when selet query is executed on SI table, the tablestatus would have entry for a segment but it's carbondata file would be deleted, thus throwing an IO Exception. 3. Segment ID is null when writing hive table What changes were proposed in this PR? 1.& 2. Added correct table status path as well sending the correct loadmetadatadetails to be updated in the table status file. Now when select query is fired on the SI table, it will not throw carbondata file not found exception 3. set the load model after setup job of committer Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4066

…table after concurrent Load & Compaction operation Why is this PR needed? When Concurrent LOAD and COMPACTION is in progress on main table having SI, SILoadEventListenerForFailedSegments listener is called to repair SI failed segments if any. It will compare SI and main table segment status, if there is a mismatch, then it will add that specific load to failedLoads to be re-loaded again. During Compaction, SI will be updated first and then maintable. So, in some cases, SI segment will be in compacted state and main table will be in SUCCESS state(the compaction can be still in progress or due to some operation failure). SI index repair will add those segments to failedLoads, by checking if segment lock can be acquired. But, if maintable compaction is finished by the time, SI repair comparison is done, then also, it can acquire segment lock and add those load to failedLoad(even though main table load is COMPACTED). After the concurrent operation is finished, some segments of SI are marked as INSERT_IN_PROGRESS. This will lead to inconsistent state between SI and mainTable segments. What changes were proposed in this PR? Acquire compaction lock on maintable(to ensure compaction is not running), and then compare SI and main table load details, to repair SI segments. Does this PR introduce any user interface change? No Is any new testcase added? No (concurrent scenario) This closes #4067

…e in Presto integration Why is this PR needed? FT for following cases has been added. Here store is created by spark and it is read by Presto. update without local-dict delete operations on table minor, major, custom compaction add and delete segments test update with inverted index read with partition columns Filter on partition columns Bloom index test range columns read streaming data Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4031

…der in SDK Why is this PR needed? Currently, SDK pagination reader is not supported for the filter expression and also returning the wrong result after performing IUD operation through SDK. What changes were proposed in this PR? In case of filter present or update/delete operation get the total rows in splits after building the carbon reader else get the row count from the details info of each splits. Handled ArrayIndexOutOfBoundException and return zero in case of rowCountInSplits.size() == 0 Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4068

Why is this PR needed? 1. Block SI creation on binary column. 2. Block alter table drop column directly on SI table. 3. Create table as like should not be allowed for SI tables. 4. Filter with like should not scan SI table. 5. Currently compaction is allowed on SI table. Because of this if only SI table is compacted and running filter query query on main table is causing more data scan of SI table which will causing performance degradation. What changes were proposed in this PR? 1. Blocked SI creation on binary column. 2. Blocked alter table drop column directly on SI table. 3. Handled Create table as like for SI tables. 4. Handled filter with like to not scan SI table. 5. Block the direct compaction on SI table and add FTs for compaction scenario of SI. 6. Added FT for compression and range column on SI table. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4037

Why is this PR needed? In order to support MERGE INTO SQL Command in Carbondata The previous Scala Parser having trouble to parse the complicated Merge Into SQL Command What changes were proposed in this PR? Add an ANTLR parser, and support parse MERGE INTO SQL Command to DataSet Command Does this PR introduce any user interface change? Yes. The PR introduces the MERGE INTO SQL Command. Is any new testcase added? Yes This closes #4032 Co-authored-by: Zhangshunyu <[email protected]>

Why is this PR needed? since version 2.0, carbon supports starting spark ThriftServer with CarbonExtensions. What changes were proposed in this PR? add the document to start spark ThriftServer with CarbonExtensions. Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4077

entry when there is no update/insert data Why is this PR needed? 1. After #3999 when an update happens on the table, a new segment is created for updated data. But when there is no data to update, still the segments are created and the table status has in progress entries for those empty segments. This leads to unnecessary segment dirs and an increase in table status entries. 2. after this, clean files don't clean these empty segments. 3. when the source table do not have data, CTAS will result in same problem mentioned. What changes were proposed in this PR? when the data is not present during update, make the segment as marked for delete so that the clean files take care to delete the segment, for cats already handled, added test cases. This closes #4018

…ry on sort column giving wrong result with IndexServer Why is this PR needed? 1. Create a table and read from sdk written files fails in cluster with java.nio.file.NoSuchFileException: hdfs:/hacluster/user/hive/warehouse/carbon.store/default/sdk. 2. After fixing the above path issue, filter query on sort column gives the wrong result with IndexServer. What changes were proposed in this PR? 1. In getAllDeleteDeltaFiles , used CarbonFiles.listFiles instead of Files.walk to handle custom file types. 2. In PruneWithFilter , isResolvedOnSegment is used in filterResolver step. Have set table and expression on executor side, so indexserver can use this in filterResolver step. This closes #4064

…hancement Why is this PR needed? Spatial index feature optimization of CarbonData What changes were proposed in this PR? 1. Update spatial index encoded algorithm, which can reduce the required properties of creating geo table 2. Enhance geo query UDFs, support querying geo table with polygon list, polyline list, geoId range list. And add some geo transforming util UDFs. 3. Load data (include LOAD and INSERT INTO) allows user to input spatial index, which column will still generated internally when user does not give. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4012

ColumnVectorWrapperDirect for alter tables Why is this PR needed? Direct filling of column vectors is not allowed for alter tables, But its column vectors were getting initialized as ColumnVectorWrapperDirect. What changes were proposed in this PR? Changed the initialization of column vectors to ColumnVectorWrapper for alter tables. This closes #4062

… handled Why is this PR needed? Filling of vectors in case of complex decimal type whose precision is greater than 18 is not handled properly. for ex- array<decimal(20,3)> What changes were proposed in this PR? Ensured proper vector filling considering it's page data type. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4073

…code Why is this PR needed? Few scenarios had missing coverage in presto-integration code. This PR aims to improve it by considering all such scenarios. Dead code- ObjectStreamReader.java was created with an aim to query complex types. Instead ComplexTypeStreamReader was created. Making ObjectStreamreader obsolete. What changes were proposed in this PR? Test cases added for scenarios that were not covered earlier in presto-integration code Removed dead code. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4074

Why is this PR needed? When the data files of a SI segment are merged. it results in having more number of rows in SI table than main table. What changes were proposed in this PR? CARBON_INPUT_SEGMENT property was not set before creating the dataframe from SI segment. So it was creating dataframe from all the rows in the table, not only from a particular segment. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4083

…ition.hive.direct is disabled Why is this PR needed? When carbon.read.partition.hive.direct is false then select queries on partition table result is invalid . For a single partition, partition values are appended to form the wrong path when loaded by the same segment. Ex: For partition on column b, path: /tablepath/b=1/b=2 What changes were proposed in this PR? In PartitionCacheManager, changes made to handle single and multiple partitions. Encoded the URI path to handle space values in the string. This closes #4084

…nt having delete delta files Why is this PR needed? When a segment is added to a carbon table by alter table add segment query and that segment also have a deleteDelta file present in it, then on querying the carbon table the deleted rows are coming in the result. What changes were proposed in this PR? Updating the tableStatus and tableUpdateStatus files in correct way for the segments having delta delta files. This closes #4070

… lock while touchMDTFile Why is this PR needed? 1. After MV support multi-tenancy PR, mv system folder is moved to database level. Hence, during each operation, insert/Load/IUD/show mv/query, we are listing all the databases in the system and collecting mv schemas and checking if there is any mv mapped to the table or not. This will degrade performance of the query, to collect mv schemas from all databases, even though the table has mv or not. 2. When different jvm process call touchMDTFile method, file creation and deletion can happen same time. This may fail the operation. What changes were proposed in this PR? 1. Added a table property relatedMVTablesMap to fact tables of MV during MV creation. During any operation, check if the table has MV or not using the added property and if it has, then collect schemas of only related databases. In this way, we can avoid collecting mv schemas for table which dont have MV. 2. Take a Global level lock on system folder location, to update last modified time. NOTE: For compatibilty scenarios, can perform refresh MV operation to update these table properties. Does this PR introduce any user interface change? Yes. For compatibilty scenarios, can perform refresh MV operation to update these table properties. Is any new testcase added? No This closes #4076

…nt to table having SI with Indexserver Why is this PR needed? When the index server is enabled, filter query on SI column after alter table add sdk segment to maintable throws NoSuchMethodException and the rows added by sdk segment are not returned in the result. What changes were proposed in this PR? Added segment path in index server flow, as it is used to identify external segment in filter resolver step. No need to load to SI, if it is an add load command. Default constructor for SegmentWrapperContainer declared. Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4080

Why is this PR needed? Added UT and FT to improve coverage of SI module and also removed the dead or unused code. What changes were proposed in this PR? Added UT and FT to improve coverage of SI module and also removed the dead or unused code. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4071

…nk CarbonLocalWriter Why is this PR needed? Currently, only two writer's(Local & S3) is supported for flink carbon streaming support. If user wants to ingest data from flink as a carbon format, directly into HDFS carbon table, there is no writer type to support it. What changes were proposed in this PR? Since the code for writing flink stage data will be same for Local and Hdfs FileSystems, we can use the existing CarbonLocalWriter to write data into hdfs, by using CarbonFile API instead of java File API. Changed code to use CarbonFile API instead of java.io.File. Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4090

Why is this PR needed? Currently, while upgrading table store with SI, we have to execute REFRESH tables and REGISTER INDEX command to refresh and register the index to main table. And also, while SI creation, we add a property 'indexTableExists' to main table, to identify if table has SI or not. If a table has SI, then we load the index information for that table from Hive {org.apache.spark.sql.secondaryindex.hive.CarbonInternalMetastore#refreshIndexInfo}. indexTableExists will be default 'false' to all tables which does not have SI and for SI tables, this property will not be added. {org.apache.spark.sql.secondaryindex.hive.CarbonInternalMetastore#refreshIndexInfo} will be called on any command to refresh indexInfo. indexTableExists property should be either true(Main table) or null (SI), in order to get index information from Hive and set it to carbon table. Issue 1: While upgarding tables with SI, after refresh main table and SI, If user does any operation like Select or Show cache, it is adding indexTableExists property to false. After register index and on doing any operation with SI(load or select), {org.apache.spark.sql.secondaryindex.hive.CarbonInternalMetastore#refreshIndexInfo} is not updating index information to SI table, since indexTableExists is false. Hence, load to SI will fail. Issue 2: While upgarding tables with SI, after refresh main table and SI, If user does any operation like Update, alter, delete to SI table, while registering it as a index, it is not validating the alter operations done on that table. What changes were proposed in this PR? Issue 1: While registering SI table as a index, check if SI table has indexTableExists proeprty and remove it. For already registered index, allow re-register index to remove the property. Issue 2: Added validations for checking if SI has undergone Load/Update/delete/alter opertaion before registering it as a index and throw exception. This closes #4087

Why is this PR needed? Refreshing MV which does not exist, is not throwing proper carbon error message. It throws Table NOT found message from Spark. This is because, getSchema is returning null, if schema is not present. What changes were proposed in this PR? 1. Check If getSchema is null and throw No such MV exception. 2. While drop table, drop mv and then drop fact table from metastore, to avoid getting Nullpointer exception, when trying to access fact table while drop MV. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4091

…ue with Index server Why is this PR needed? 1. Test cg index query with Index server fails with NPE. While initializing the index model, a parsing error is thrown when trying to uncompress with snappy. 2. Bloom index query with Index server giving incorrect results when splits have >1 blocklets. Blocklet level details are not serialized for index server as it is considered as block level cache. What changes were proposed in this PR? 1. Have set segment and schema details to BlockletIndexInputSplit object. While writing minmax object, write byte size instead of position. 2. Creating BlockletIndex when bloom filter is used, so in createBlocklet step isBlockCache is set to false. This closes #4089

…cture. Why is this PR needed? PR #3904 has added the code to remove fact directory and because of this concurrent load fails with file not found exception. What changes were proposed in this PR? Reverted PR 3904. This closes #4905

Why is this PR needed? Concurrent compaction was failing when run in parallel with load. During load we acquire SegmentLock for a particular segment, and when this same lock we try to acquire during compaction, we were not able to acquire this lock and compaction fails. What changes were proposed in this PR? Skipped compaction for segments for which we are not able to acquire the SegmentLock instead of throwing the exception. This closes #4093

Why is this PR needed? Prepriming is not working in Index Server. Server.getRemoteUser returns null value in async call of prepriming which results in NPE and crashes the indexServer application. Issue Induced after PR #3952 What changes were proposed in this PR? Computed the Server.getRemoteUser value before making the async prepriming call and then used the same value during async call. Code reset to code before PR #3952 This closes #4088

Why is this PR needed? Currently successful load and insert sql return empty Seq in carbondata, we need it to return the segment ID. What changes were proposed in this PR? Successful load and insert will return segment ID. Does this PR introduce any user interface change? Yes. (Successful load and insert will return segment ID.) Is any new testcase added? Yes This closes #4086

…ilter of Spark 3 Why is this PR needed? 1. In spark version 3, org.apache.spark.sql.sources.Filter is sealed, carbon can't extend it in carbon code. 2. The name of CarbonLateDecodeStrategy class is incorrect, the code is complex and hard to read 3. CarbonDataSourceScan can be the same for 2.3 and 2.4, and should support both batch reading and row reading. What changes were proposed in this PR? 1. translate spark Expression to carbon Expression directly, skip the spark Filter step. Remove all spark Filters in carbon code. old follow: Spark Expression => Spark Filter => Carbon Expression new follow: Spark Expression => Carbon Expression 2. Remove filter reorder, need to implement expression reorder (added CARBONDATA-4138). 3. separate CarbonLateDecodeStrategy to CarbonSourceStrategy and DMLStrategy, and simplify the code of CarbonSourceStrategy. 4. move CarbonDataSourceScan back to the source folder, use one CarbonDataSourceScan for all versions CarbonDataSourceScan supports both VectorReader and RowReader, Carbon will not use RowDataSourceScanExec. Does this PR introduce any user interface change? No Is any new testcase added? No

…n Index server fails Why is this PR needed? Concurrent Insert Overwrite with static partition on Index server fails. When index server and prepriming are enabled, prepriming is triggered even when load fails as it is in finally block. Performance degradation with indexserver due to #4080 What changes were proposed in this PR? Removed triggerPrepriming method from finally. Reverted 4080 and used a boolean flag to determine the external segment. Does this PR introduce any user interface change? No Is any new testcase added? No, tested in cluster. This closes #4096

…bles with sdk segments Why is this PR needed? Indexes cached in Executor cache are not dropped when drop table is called for external table with SDK segments. Because, external tables with sdk segments will not have metadata like table status file. So in drop table command we send zero segments to indexServer clearIndexes job, which clears nothing from executor side. So when we drop this type of table, executor side indexes are not dropped. Now when we again create external table with same location and do select * or select count(*), it will not cache the indexes for this table, because indexes with same loaction are already present. Now show metacache on this newly created table will use new tableId , but indexes present have the old tableId, whose table is already dropped. So show metacache will return nothing, because of tableId mismatch. What changes were proposed in this PR? Prepared the validSegments from indexFiles present at external table location and send it to IndexServer clearIndexes job through IndexInputFormat. This closes #4099

Why is this PR needed? withEvents method can simplify code to fire event What changes were proposed in this PR? Refactor code to use the withEvents method instead of fireEvent This closes #4078

…istics after clean files operation Why is this PR needed? Currently in the clean files operation the user does not know how much space will be freed. The idea is the add support for dry run in clean files which can tell the user how much space will be freed in the clean files operation without cleaning the actual data. What changes were proposed in this PR? This PR has the following changes: 1. Support dry run in clean files: It will show the user how much space will be freed by the clean files operation and how much space left (which can be released after expiration time) after the clean files operation. 2. Clean files output: Total size released during the clean files operation 3. Disable clean files Statistics option in case the user does not want clean files statistics 4. Clean files log: To enhance the clean files log to print the name of every file that is being deleted in the info log. This closes #4072

…not released in abnormal scenarios. Why is this PR needed? When compact operation fails, the segment lock of SI table is not released. Run compaction again, can not get the segment lock of the SI table and compation does nothing, but in the tablestatus file of SI table the merged segment status is set to success and the segmentfile is xxx_null.segments and the vaule of indexsize is 0. What changes were proposed in this PR? If an exception occurs, release the obtained segment locks. If getting segment locks failed, not update the segment status. Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4102

…xxx.carbondata" is displayed Why is this PR needed? If an exception occurs when the refresh index command is executed, a task has been successful. The new query will be failed. Reason: After the compaction task is executed successfully, the old carbondata files are deleted. If other exception occurs, the deleted files are missing. This PR will fix this issue. What changes were proposed in this PR? When all tasks are successful, the driver deletes the old carbondata files. Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4103

Why is this PR needed? Query with SI after add partition based on location on partition table gives incorrect results. 1. While pruning, if it's an external segment, it should use ExternalSegmentResolver , and no need to use ImplicitIncludeFilterExecutor as an external segment is not added in the SI table. 2. If the partition table has external partitions, after compaction the new files are loaded to the external path. 3. Data is not loaded to the child table(MV) after executing add partition command What changes were proposed in this PR? 1. add path to loadMetadataDetails for external partition. It is used to identify it as an external segment. 2. After compaction, to not maintain any link to the external partition, the compacted files will be added as a new partition in the table. To update partition spec details in hive metastore, (drop partition + add partition) operations performed. 3. Add Load Pre and Post listener's in CarbonAlterTableAddHivePartitionCommand to trigger data load to materialized view. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4107

Why is this PR needed? Reindex failed when SI has stale carbonindexmerge file, throw exception FileNotFoundException. This is because SegmentFileStore.getIndexFiles stores the mapping of indexfile to indexmergefile, when stale carbon indexmergefile exists, indexmergefile will not be null. During merging index file, new indexmergefile will be created with same name as before in the same location. At the end of CarbonIndexFileMergeWriter.writeMergeIndexFileBasedOnSegmentFile, carbon index file will be deleted. Since indexmergefile is stored in the indexFiles list, newly created indexmergefile will be delete also, which leads to FileNotFoundException. What changes were proposed in this PR? 1. SegmentFileStore.getIndexFiles stores the mapping of indexfile to indexmergefile which is redundant. 2. SegmentFileStore.getIndexOrMergeFiles returns both index file and index merge file, so function name is incorrect, rename to getIndexAndMergeFiles. 3. CarbonLoaderUtil.getActiveExecutor actually get active node, so function name is incorrect, rename to getActiveNode, together replace all "executor" with "node" in function assignBlocksByDataLocality. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4105

…tition table having sort column Why is this PR needed? After PR-3615, we have avoided rearranging catalog table schema if already re-arranged. For MV on a partition table, we always move the partition column to the end on a MV partition table. Catalog table will also have the column schema in same order(partition column at last). Hence, in this case, we do not re-arrange logical relation in a catalog table again. But, if there is a sort column present in MV table, then selected column schema and catalog table schema will not be in same order. In that case, we have to re-arrange the catalog table schema. Currently, we are using rearrangedIndex to re-arrange the catalog table logical relation, but rearrangedIndex will keep the index of partition column at the end, whereas, catalog table has partition column already at the end. Hence, we are re-arranging the partition column index again in catalog table relation, which leads to insertion failure. Example: Create MV on columns: c1, c2 (partition), c3(sort_column), c4 Problem: Create order: c1,c2,c3,c4 Create order index: 0,1,2,3 Rearranged Index: Existing Catalog table schema order: c1, c3, c4, c2 (for MV, partition column will be moved to Last) Rearrange index: 2,0,3,1 After Re-arrange catalog table order: c4,c2,c2, c3(which is wrong) Solution: Change MV create order as below New Create order: c1,c4,c3,c2 Create order index: 0,1,2,3 Rearranged Index: Existing Catalog table schema order: c1, c3, c4, c2 (for MV, partition column will be moved to Last) Rearrange index: 1,0,2,3 After Re-arrange catalog table order: c3,c1,c4,c2 What changes were proposed in this PR? In MV case, if there is any column schema order change apart from partition column, then re-arrange index of only those columns and use the same to re-arrange catalog table logical relation. This closes #4106

…e status" is displayed. query is normal after the "drop metacache on table" command is executed. Why is this PR needed? During compact execution, the status of the new segment is set to success before index files are merged. After index files are merged, the carbonindex files are deleted. As a result, the query task cannot find the cached carbonindex files. What changes were proposed in this PR? Set the status of the new segment to succeeded after index files are merged. Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4104

…n SI Why is this PR needed? NOT EQUAL TO filter on SI index column, should not be pushed down to SI table. Currently, where x!='2' is not pushing down to SI, but where x!=2 is pushed down to SI. This is because "x != 2" will be wrapped in a CAST expression like NOT EQUAL TO(cast(x as int) = 2). What changes were proposed in this PR? Handle CAST case while checking DONOT PUSH DOWN to SI This closes #4108

Why is this PR needed? PR-4076 has added a new table property to fact table. While executing create table like command, this property is not excluded, which leads to parsing exception. What changes were proposed in this PR? Remove MV related info from destination table properties This closes #4111

…cation Why is this PR needed? Query with SI after add partition based on empty location on partition table gives incorrect results. pr- 4107 fixes the issue for add partition if the location is not empty. What changes were proposed in this PR? while creating blockid, get segment number from the file name for the external partition. This blockid will be added to SI and used for pruning. To identify as an external partition during the compaction process, instead of checking with loadmetapath, checking with filepath.startswith(tablepath) format. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4112

…gment Why is this PR needed? PR-3999 has removed some code related to getting segment min max from all blocks. Because of this, if segment has more than one block, currently, it is writing min max considering one block only. What changes were proposed in this PR? Reverted specific code from above PR. Removed unwanted synchronization for some methods This closes #4101

Why is this PR needed? There are 2 issues in clean files operation when ran concurrently with multiple load operations: Dry run can show negative space freed for clean files with concurrent load. Accidental deletion of Insert in progress(ongoing load) during clean files operation. What changes were proposed in this PR? To solve the dry run negative result, saving the old metadatadetails before the clean files operation and comparing it with loadmetadetails after the clean files operation and just ignoring any new entry that has been added, basically doing an intersection of new and old metadatadetails to show the correct space freed. In case of load failure issue, there can be scenarios where load in going on(insert in progress state and segment lock is occupied) and as during clean files operation when the final table status lock is removed, there can be scenarios where the load has completed and the segment lock is released but in the clean files in the final list of loadmetadatadetails to be deleted, that load can still be in Insert In Progress state with segment lock released by the load. The clean files operation will delete such loads. To solve this issue, instead of sending a boolean which check if update is required or not in the tablestatus, can send a list of load numbers and will only delete those loadnumbers. Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4109

Enable github's merge function

Why is this PR needed? Currently describe formatted displays the column information of a table and some additional information. When complex types such as ARRAY, STRUCT, and MAP types are present in the table, column definition can be long and it’s difficult to read in a nested format. What changes were proposed in this PR? The DESCRIBE output can be formatted to avoid long lines for multiple fields. We can pass the column name to the command and visualize its structure with child fields. Does this PR introduce any user interface change? Yes , DDL Commands: DESCRIBE COLUMN fieldname ON [db_name.]table_name; DESCRIBE short [db_name.]table_name; Is any new testcase added? Yes This closes #4113

…y/struct) Why is this PR needed? This PR enables adding of single-level complex columns(only array and struct) to carbon table. Command - ALTER TABLE <table_name> ADD COLUMNS(arr1 ARRAY (double) ) ALTER TABLE <table_name> ADD COLUMNS(struct1 STRUCT<a:int, b:string>) The default value for the column in case of old rows will be null. What changes were proposed in this PR? 1. Create instances of ColumnSchema for each of the children, By doing this each child column will have its own ordinal. The new columns are first identified and stored in a flat structure. For example, for arr1 array(int) --> 2 column schemas are created - arr1 and arr1.val. First being the parent and second being its child. Each of which will have its own ordinals. 2. Later while updating the Schema evolution entry we only account for the newly added parent columns while discarding children columns (As they are no longer required. Otherwise we will have the child as a separate column in the schema ). 3. Using the schema evolution entry the final schema is updated. Since ColumnSchemas are stored as a flat structure we later convert them to a nested structure of type Dimensions. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4115

…secondary indexes for Presto queries Why is this PR needed? At present, secondary indexes are leveraged for query pruning via spark plan modification. This approach is tightly coupled with spark because the plan modification is specific to spark engine. In order to use secondary indexes for Presto or Hive queries, it is not feasible to modify the query plans as we desire in the current approach. Thus need arises for an engine agnostic approach to use secondary indexes in query pruning. What changes were proposed in this PR? 1. Add Secondary Index as a coarse grain index. 2. Add a new insegment() UDF to support query within the particular segments 3. Control the use of Secondary Index as a coarse grain index pruning with property('carbon.coarse.grain.secondary.index') 4. Use Index Server driver for Secondary Index pruning 5. Use Secondary Indexes with Presto Queries This closes #4110

Why is this PR needed? Currently, we update table status and segment files multiple times for a single iud/merge/compact operation and delete the index files immediately after merge. When concurrent queries are run, there may be situations like user query is trying to access the segment index files and they are not present, which is availability issue. What changes were proposed in this PR? 1. Generate segment file after merge index and update table status at beginning and after merge index. If mergeindex/ table status update fails , load will also fail. order: create table status file => index files => merge index => generate segment file => update table status * Same order is now maintained for SI, compaction, IUD, addHivePartition, addSegment scenarios. * Whenever segment file needs to be updated for main table, a new segment file is created instead of updating existing one. 2. When compact 'segment_index' is triggered, For new tables - if no index files to merge, then logs warn message and exits. For old tables - index files not deleted. 3. After SI small files merge, For newly loaded SI segments - DeleteOldIndexOrMergeFiles deletes immediately after merge. For segments that are already present (rebuild) - old index files and data files are not deleted. 4. Removed carbon.merge.index.in.segment property from config-parameters. This property to be used for debugging/test purposes. Note: Cleaning of stale index/segment files to be handled in - CARBONDATA -4074 This closes #3988

… handle exception for desc column Why is this PR needed? After creating an Inverted index on the dimension column, some of the filter queries give incorrect results. handle exception for higher level non-existing children column in desc column. What changes were proposed in this PR? While sorting byte arrays with inverted index, we use compareTo method of ByteArrayColumnWithRowId. Here, it was sorting based on the last byte only. Made changes to sort properly based on the entire byte length when dictionary is used. handled exception and added in testcase. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4124

…in projection returns incorrect results Why is this PR needed? After PR-3574, a scenario has been missed while code refactor. Currently, if select query has both Parent and its child struct column in projection, only child column is pushed down to carbon for filling result. For other columns in parent Struct, data output is null. What changes were proposed in this PR? If parent struct column is also present in projection, then push down only parent column to carbon. This closes #4123

…t validation for Geo values. Why is this PR needed? 1. SPATIAL_INDEX property, POLYGON, LINESTRING, and RANGELIST UDF's are case sensitive. 2. SPATIAL_INDEX.xx.gridSize and SPATIAL_INDEX.xxx.conversionRatio is accepting negative values. 3. Accepting invalid values in geo UDF's. What changes were proposed in this PR? 1. converted properties to lower case and made UDF's case insensitive. 2. added validation. 3. refactored readAllIIndexOfSegment Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4118

…ruct/map) Why is this PR needed? This PR supports dropping of parent complex columns (single and multi-level) from the carbon table. Dropping of parent column will in turn drop all of its children columns too. What changes were proposed in this PR? Children columns are prefixed with its parent column name. So the identified columns are added to the delete-column-list and the schema is updated based on that.Test cases have been written up to 3-levels. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4121

Why is this PR needed? hitcount link in readme md file is not working What changes were proposed in this PR? Remove the hitcount link as its not required. Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4128

Why is this PR needed? Currently, for IN_POLYGON_LIST and IN_POLYLINE_LIST udf’s, polygons need to be specified in SQL. If the polygon list grows in size, then the SQL will also be too long, which may affect query performance, as SQL analysing cost will be more. If Polygons are defined as a Column in a new dimension table, then, Spatial dimension table join can be supported in order to support aggregation on spatial table columns based on polygons. What changes were proposed in this PR? Support IN_POLYGON_LIST and IN_POLYLINE_LIST with SELECT QUERY on the polygon table. Support IN_POLYGON filter as join condition for spatial JOIN queries. Does this PR introduce any user interface change? Yes. Is any new testcase added? Yes This closes #4127

…ment level Why is this PR needed? In the existing architecture, if the parent(main) table and SI table don’t have the same valid segments then we disable the SI table. And then from the next query onwards, we scan and prune only the parent table until we trigger the next load or REINDEX command (as these commands will make the parent and SI table segments in sync). Because of this, queries take more time to give the result when SI is disabled. What changes were proposed in this PR? Instead of disabling the SI table(when parent and child table segments are not in sync) we will do pruning on SI tables for all the valid segments(segments with status success, marked for update and load partial success) and the rest of the segments will be pruned by the parent table. As of now, query on the SI table can be pruned in two ways: a) With SI as data map. b) WIth spark plan rewrite. This PR contains changes to support both methods of SI to leverage till segment level. This closes #4116

… alter add column Why is this PR needed? Select query on table with long string data type and small page size throws ArrayIndexOutOfBoudException after alter add columns. Query fails because after changing the schema, the number of rows set in bitsetGroup(RestructureIncludeFilterExecutorImpl.applyFilter()) for pages is not correct. What changes were proposed in this PR? Set the correct number of rows inside every page of bitsetGroup. This closes #4137

…rbondata Why is this PR needed? Heterogeneous format segments in carbondata documenation. What changes were proposed in this PR? Add segment feature background and impact on existed carbondata features This closes #4134

… unsupported datatype(complex_datatypes/Binary/Boolean/Decimal) as RANGE_COLUMN Why is this PR needed? Alter table set command was not validating unsupported dataTypes for range column. What changes were proposed in this PR? Added validation for unsupported dataTypes before setting range column value. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4133

Why is this PR needed? 1. Alter table duplicate columns check for dimensions/complex columns missed 2. Alter table properties with long strings for complex columns should not support What changes were proposed in this PR? 1. Changed the dimension columns list type in preparing dimensions columns [LinkedHashSet to Scala Seq] for handling the duplicate columns 2. Added check for throwing an exception in case of long strings for complex columns Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4138

Why is this PR needed? Currently, number of tasks for partition table local sort load, is decided based on input file size. In this case, the data will not be properly sorted, as tasks launched is more. For compaction, number of tasks is equal to number of partitions. If data is huge for a partition, then there can be chances, that compaction will fail with OOM with less memory configurations. What changes were proposed in this PR? When local sort task level property is enabled, For local sort load, divide input files based on the node locality (num of task = num of nodes), which will properly do the local sorting. For compaction, launch task based on task id for a partition, so the task launched for a partition will be more. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4130

…in local sort scope Why is this PR needed? Currently when we create table with partition column and put the same column as part of local sort scope then Insert query fails with ArrayIndexOutOfBounds exception. What changes were proposed in this PR? Handle ArrayIndexOutOfBound exception, earlier array size was not increasing because data was inconsistence and in the wrong order for sortcolumn and isDimNoDictFlags. This closes #4132

…complex child column name and primitive column name match Why is this PR needed? Update primitive column not working when complex column child name and primitive data type name same. When an update for primitive is received, we are checking in complex child columns if column name matches then returning UnsupportedOperationbException. What changes were proposed in this PR? Currently, we are ignoring the prefix of all columns and passing only columns/child column info to the update command. New Changes: Passing full column(alias name/table name.columnName) name which is given by the user and added checks for handling the unsupported update operation of complex columns. This closes #4139

…apache-carbondata.md Why is this PR needed? To improve the quality of README.md and how-to-contribute-to-apache-carbondata.md. What changes were proposed in this PR? Syntax and format changes. This closes #4136

…ssage correctly Why is this PR needed? Currently, when we check the exception message like below, it is not asserting/failing/ catching if the message content is different. `intercept[UnsupportedOperationException]( sql("update test set(a)=(4) where id=1").collect()).getMessage.contains("abc")` What changes were proposed in this PR? 1. Added assert condition like below for validating the exception message correctly `assert(intercept[UnsupportedOperationException]( sql("update test set(a)=(4) where id=1").collect()).getMessage.contains("abc"))` 2. Added assert condition to check exception message for some test cases which are not checking exception message 3. Fixed add segment doc heading related issues This closes #4140

Why is this PR needed? 1. When we perform compaction after alter add a complex column, the query fails with ArrayIndexOutOfBounds exception. While converting and adding row after merge step in WriteStepRowUtil.fromMergerRow, As complex dimension is present, the complexKeys array is accessed but doesnt have any values in array and throws exception. 2. Creating SI with globalsort on newly added complex column throws TreenodeException (Caused by: java.lang.RuntimeException: Couldn't find positionId#172 in [arr2#153]) What changes were proposed in this PR? 1. While restructuring row, added changes to fill complexKeys with default values(null values to children) according to the latest schema. In SI queryresultprocessor, used the column property isParentColumnComplex to identify any complex type. If complex index column not present in the parent table block, assigned the SI row value to empty bytes. 2. For SI with globalsort, In case of complex type projection, TableProperties object in carbonEnv is not same as in carbonTable object and hence requiredColumns is not updated with positionId. So updating tableproperties from carbon env itself. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4142

Why is this PR needed? Currently, regex of geo UDF is not allowing zero space between UDF name and parenthesis. It always expects a single space in between. For ex: linestring (120.184179 30.327465). Because of this sometimes using the UDFs without space is not giving the expected result. What changes were proposed in this PR? Allow zero space between UDFs and parenthesis. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4145

Why is this PR needed? enable to run UT with index server. Fix below issues: 1. With index server enabled, select query gives incorrect result with SI when parent and child table segments are not in sync. 2. When reindex is triggered, if stale files are present in the segment directory the segment file is being written with incorrect file names. (both valid index and stale mergeindex file names). As a result, duplicate data is present in SI table but there are no error/incorrect query results. What changes were proposed in this PR? usage of flag useIndexServer. excluded some of the test cases to not run with index server. 1. While pruning from index server, missingSISegments values were not getting considered. Have passed down and set those values to filter. 2. Before loading data to SI segment, added changes to delete the segment directory if already present. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4098

Why is this PR needed? This PR enables renaming of complex columns - parent as well as children columns with nested levels example: if the schema contains columns - str1 struct<a:int, b:string>, arr1 array<long> 1. alter table <table_name> change str1 str2 struct<a:int, b:string> 2. alter table <table_name> change arr1 arr2 array<long> 3. Changing parent name as well as child name 4. alter table <table_name> change str1 str2 struct<abc:int, b:string> NOTE- Rename operation fails if the structure of the complex column has been altered. This check ensures the old and new columns are compatible with each other. Meaning the number of children and complex levels should be unaltered while attempting to rename. What changes were proposed in this PR? 1. Parses the incoming new complex type. Create a nested DatatypeInfo structure. 2. This DatatypeInfo is then passed on to the AlterTableDataTypeChangeModel. 3. Validation for compatibility, duplicate columns happens here. 4. Add the parent column to the schema evolution entry. 5. Update the spark catalog table. Limitation - Renaming is not supported for Map types yet Does this PR introduce any user interface change? Yes Is any new testcase added? Yes This closes #4129

Why is this PR needed? When trying to register a table of old store which has MV, it fails parser error(syntax issue while creating table). It is trying to create table with relatedmvtablesmap property which is not valid. What changes were proposed in this PR? 1. Removed relatedmvtablesmap from table properties in RefreshCarbonTableCommand 2. After Main table has registered, to register MV made changes to get the schema from the system folder and register. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4147

Why is this PR needed? Currently rename SI table can succeed, but after rename, insert and query on main table failed, throw no such table exception. This is because after SI table renamed, main table's tblproperties didn't get update, it still stores the old SI table name, when refering to SI table, it tries to find the SI table by old name, which leads to no such table exception. What changes were proposed in this PR? After SI table renamed, update the main table's tblproperties with new SI information. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4149

…ing columns Why is this PR needed? When we create a table with complex columns with child columns with long string data type then receiving column not found in table exception. Normally it should throw an exception in the above case by saying that complex child columns will not support long string data type. What changes were proposed in this PR? Added a case if complex child column has long string data type then throw correct exception. Exception: MalformedCarbonCommandException Exception Message: Complex child column cannot be set as LONG_STRING_COLUMNS Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4150

…lias Table name Why is this PR needed? Update Query having Alias Table name, fails with Unsupported complex types error, even if table does not any. What changes were proposed in this PR? Check the columnName irrespective of case Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4152

Why is this PR needed? During update/delete, the segment file in the segment would come as an empty string due to which it was not able to read the segment file. What changes were proposed in this PR? 1. Changed the empty string to NULL 2. Added empty segment file condition while creating SegmentFileStore. Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4153

…atement contains multiple inserts Why is this PR needed? When multiple inserts with single query is used, it fails from SparkPlan with: java.lang.ClassCastException: GenericInternalRow cannot be cast to UnsafeRow. For every successful insert/load we return Segment ID as a row. For multiple inserts also, we are returning a row containing Segment ID but while processing in spark ClassCastException is thrown. What changes were proposed in this PR? When multiple insert query is given, it has Union node in the plan. Based on its presence, made changes to use flag isMultipleInserts to call class UnionCommandExec and implemented custom sideEffectResult which converts GenericInternalRow to UnsafeRow and return. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4151

…reflected issue Why is this PR needed? After one application rename SI table, other application cannot be reflected of this change, which leads to query on SI column failed. What changes were proposed in this PR? After update index info of parent table, persist schema info so that other applications can refresh table metadata in time. This closes #4155

…from FROM_UNIXTIME(0) Why is this PR needed? Filling null in case of timestamp value is received from FROM_UNIXTIME(0) as spark original insert rdd value[internalRow] received in this case zero. if the original column value[internalRow] is zero then in insert flow adding NULL and giving NULL to spark. When query happens on the same column received NULL value instead of timestamp value. Problem code: if (internalRow.getLong(index) == 0) { internalRow.setNullAt(index) } What changes were proposed in this PR? Removed the null filling check for zero value case and if internalRow value is non null/empty then only set the internalRow timestamp value. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4154

Why is this PR needed? To integrate Carbondata with Spark3.1.1 What changes were proposed in this PR? Refactored code to add changes to support Spark 3.1.1 along with Spark 2.3 and 2.4 versions Changes: 1. Compile Related Changes 1. New Spark package in MV, Streaming and spark-integration. 2. API wise changes as per spark changes 2. Spark has moved to Proleptic Gregorian Calendar, due to which timestamp related changes in carbondata are also required. 3. Show segment by select command refactor 4. Few Lucene test cases ignored due to the deadlock in spark DAGSchedular, which does not allow it to work. 5. Alter rename: Parser enabled in Carbon and check for carbon 6. doExecuteColumnar() changes in CarbonDataSourceScan.scala 7. char/varchar changes from spark side. 8. Rule name changed in MV 9. In univocity parser, CSVParser version changed. 10. New Configs added in SparkTestQueryExecutor to keep some behaviour same as 2.3 and 2.4 Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4141

…ction is enabled Why is this PR needed? 1. When auto-compaction is enabled, during update, we are trying to do compaction after Insert. Auto-Compaction throws exception, after multiple retries. Carbon does not allow concurrent compaction and Update. 2. dataframe.rdd.isEmpty will launch a Job. This code is called two times in code, which is not reused. What changes were proposed in this PR? 1. Avoid trying to do Auto-compaction during Update. 2. Reuse dataframe.rdd.isEmpty and avoided launching a Job. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4156

Due to apache jenkis CI address changed, correct CI build status.

comment is not working in carbon spark3.1 Why is this PR needed? 1. table properties storing with case-sensitive and when we query table properties with the small case then property not able to get hence table create command is failed. this is induced with spark 3.1 integration changes. 2. Table comment is displayed as byte code in spark 3.1 cluster. CommentSpecContext is changed in 3.1 What changes were proposed in this PR? 1. convert to small case and store in table properties. 2. Get string value from commentSpec and set as table comment Does this PR introduce any user interface change? No Is any new testcase added? No, already test case is present but not failed in local ut setup as create flow is different in local ut env and real cluster setup This closes #4163

Why is this PR needed? Due to wrong branch release, wrong pom changes are present. What changes were proposed in this PR? revert the pom changes. This closes #4167

Why is this PR needed? Documentation changes were not handled in PR 4116 What changes were proposed in this PR? Added missing documentation. This closes #4164

…x types Why is this PR needed? For 2.3 and 2.4 parsing of alter commands are done by spark. Which is not in the case of 3.1. What changes were proposed in this PR? So carbon is responsible for the parsing here. Previously ignored test cases due to this issue are now enabled. This closes #4162

index server failed testcases and dataload fail error on update Why is this PR needed? 1. When the path is empty in Carbon add segments then StringIndexOutOfBoundsException is thrown. 2. Index server UT failures fix. 3. Update fails with dataload fail error if set bad records action is specified to force with spark 3.1v. What changes were proposed in this PR? 1. Added check to see if the path is empty and then throw a valid error message. 2. Used checkAnswer instead of assert in test cases so that the order of rows returned would be same with or without index server. Excluded 2 test cases where explain with query statistics is used, as we are not setting any pruning info from index server. 3. On update command, dataframe.persist is called and with latest 3.1 spark changes, spark returns a cloned SparkSession from cacheManager with all specified configurations disabled. As now it's using a different sparkSession for 3.1 which is not initialized in CarbonEnv. So CarbonEnv.init is called where new CarbonSessionInfo is created with no sessionParams. So, the properties set were not accessible. When a new carbonSessionInfo object is getting created, made changes to set existing sessionparams from currentThreadSessionInfo. This closes #4157

Why is this PR needed? Presto test cases failing randomly and taking more time in CI verification for other PRs. What changes were proposed in this PR? Currently presto random test cases will be ignored and will be fixed with other JIRA raised. 1. JIRA [CARBONDATA-4250] raised for ignoring presto test cases currently as this random failures causing PR CI failures. 2. JIRA [CARBONDATA-4249] raised for fixing presto random tests in concurrent scenario. We can get more details on this JIRA about issue reproduce and problem snippet. 3. [CARBONDATA-4254] raised to fix Test alter add for structs enabling local dictionary and CarbonIndexFileMergeTestCaseWithSI.Verify command of index merge This closes #4176

Why is this PR needed? 1) When execute cleanfile command, it cleans up all the carbonindex and carbonmergeindex that once existed, even though carbonindex files have been merged into carbonergeindex and deleted. When there are tens of thousands of carbonindex that once existed after the completion of the compaction, the clean file command will take serveral hours to clean index files which actually doesn't exist. We just need to clean up the existing files, carbonmergeindex or carbonindex files 2) The rename command will list partitions of the table, but the partitions information is not actually used. If the table has hundreds of thousands partitions, the performance of rename table will degrade a lot What changes were proposed in this PR? 1) There is a variable indexOrMergeFiles, which means all existing indexfiles, CLEAN FILE commmand will delete all existing files instead of delete all files in 'indexFilesMap', which is actually all '.carbonindex' files once exists. Clean 'indexOrMergeFiles' helps to improve CLEAN FILES performance a lot. 2) The rename command will list partitions for the table, but the partitions information is not actually used. If the table has hundreds of thousands partitions, the performance of rename table will degrade a lot This closes #4183

Why is this PR needed? Explain command with upper case column name fails with key not found exception. What changes were proposed in this PR? Changed column name to lower case before conversion of spark data type to carbon data type. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4175

…sults for data before 1900 years with Spark 3.1 Why is this PR needed? 1. Spark 3.1, will store timestamp value as julian micros and rebase timestamp value from JulianToGregorianMicros during query. -> Since carbon parse and formats timestamp value with SimpleDateFormatter, query gives incorrect results, when rebased with JulianToGregorianMicros by spark. 2. CARBONDATA-4241 -> Global sort load and compaction fails on table having timestamp column What changes were proposed in this PR? 1. Use Java Instant to parse new timestamp values. For old stores and query with Spark 3.1, Rebase the timestamp value from Julian to Gregorian Micros 2. If timestamp value is of type Instant, then convert value to java timestamp. Does this PR introduce any user interface change? No Is any new testcase added? No (Existing testcase is sufficient) This closes #4177

…PSERT, DELETE, INSERT and UPDATE Why is this PR needed? 1. In the exiting solution, when we perform join of the source and target dataset for tagging records to delete, update and insert records, we were scanning all the data of target table and then perform join with source dataset. But it can happen that the source data is less and its range may cover only some 100s of Carbondata files out of 1000s of files in the target table. So pruning is main bottleneck here, so scanning all records and involving in join results in so much of shuffle and reduces performance. 2. Source data caching was not there, caching source data will help to improve its multiple scans and since input source data will be of less size, we can persist the dataset. 3. When we were performing join, we used to first get the Row object and then operate on it and then for each datatype casting happens to convert to spark datatype and then covert to InternalRow object for further processing of joined data. This will add extra deserializeToObject and map nodes in DAG and increase time. 4. Initially during tagging records(Join operation), we were preparing a new projection of required columns, which basically involves operations of preparing an internal row object as explained in point 3, and then apply eval function on each row to prepare a projection, so this basically applying same eval of expression on joined data, a repeated work and increases time. 5. In join operation we were using all the columns of source dataset and the required columns of target table like, join key column and other columns of tupleID, status_on_mergeds etc. So when we there will be so many columns in the table, then it will increase the execution time due to lot of data shuffling. 6. The current APIs of merge are little bit complex and generalized and confusing to user for simple Upsert, delete and insert operations. What changes were proposed in this PR? 1. Add a pruning logic before the join operations. Compare the incoming row with an interval based tree data structure which contains the Carbondata file path and min and max to identify the Carbondata file where the incoming row can be present, so that in some use case scenario which will be explained in later section, can give benefit and help to scan less files rather than blindly scanning all the Carbondata files in the target table. 2. Cache the incoming source dataset srcDS.cache(), so that the cached data will be used in all the operations and speed will be improved. Uncache() after the merge operation 3. Instead of operating on row object and then converting to InternalRow, directly operate on the InternalRow object to avoid the data type conversions. 4. Instead of evaluating the expression again based on required project columns on matching conditions and making new projection, directly identify the indexes required for output row and then directly access these indices on the incoming internal row object after step3, so evaluation is avoided and array access with indices will give O(1) performance. 5. During join or the tagging of records, do not include all the column data, just include the join key columns and identify the tupleIDs to delete and the rows to insert, this will avoid lot of shuffle and improve performance significantly. 6. Introduce new APIs for UPSERT, UPDATE, DELETE and INSERT and make the user exposed APIs simple. So now user just needs to give the key column for join, source dataset and the operation type as mentioned above. These new APIs will make use of all the improvements mentioned above and avoid unnecessary operations of the existing merge APIs. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4148

… is inconsistent Why is this PR needed? When carbon.storelocation and spark.sql.warehouse.dir are configured to different values, the databaselocation maybe inconsistent. When DROP DATABASE command is executed, maybe both location (carbon dblcation and hive dblocation) will be cleared, which may confuses the users What changes were proposed in this PR? Drop database is prohibited when database locaton is inconsistent Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4186

…d This closes #4190

…ddress.

…) link and update the Nabble address This closes #4195

…bondata.md Due to Github Flavored Markdown, Github does't support TOC automatic generation in Markdown file. Use anchors to implement TOC of headings.

…ute-to-apache-carbondata.md This closes #4192

Modify minor errors and correct some misunderstandings in the document Create quick-start-guide.md

…quick-start-guide.md This closes #4197

…lumn Why is this PR needed? Currently, SI creation on a complex column that includes child column with a dot(.) fails with parse exception. What changes were proposed in this PR? Handled parsing for create index on complex column. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4187

Why is this PR needed? Currently carbondata is integrated with presto-sql 316, which is 1.5 years older. There are many good features and optimization that came into presto like dynamic filtering, Rubix data cache and some performance improvements. It is always good to use latest version, latest version is presto-sql 348. But jumping from 316 to 348 will be too many changes. So, to utilize these new features and based on customer demand, I suggest to upgrade presto-sql to 333 version. Later it will be again upgraded to more latest version in few months. Note: This is a plain integration to support all existing features of presto316, deep integration to support new features like dynamic filtering, Rubix cache will be handled in another PR. What changes were proposed in this PR? 1. Adapt to the new hive adapter changes like some constructor changes, Made a carbonDataConnector to support CarbonDataHandleResolver 2. Java 11 removed ConstructorAccessor class, so using unsafe class for reflection. (presto333 depend on java 11 for runtime) 3. POM changes to support presto333 Note: JAVA 11 environment is needed for running presto333 with carbon and also need add this jvm property "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" This closes #4034

Why is this PR needed? PrestoSQL has now changed its name to Trino. Because Facebook established the Presto Foundation at The Linux Foundation®，Led to prestosql Must be change the name More information can see here : https://trino.io/blog/2020/12/27/announcing-trino.html What changes were proposed in this PR? 1. Change the url to prestosql 333 2. Added a description indicating that the user prestoSQL has been renamed to Trino Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4202

…with overwrite

…with overwrite This closes #4207

…Support geo insert without geoId and document changes Why is this PR needed? 1. To insert without geoid (like load) on geo table. 2. [CARBONDATA-4119] : User Input for GeoID column not validated. 3. [CARBONDATA-4238] : Documentation Issue in ddl-of-carbondata.md#add-columns 4. [CARBONDATA-4237] : Documentation issues in streaming-guide.md, file-structure-of-carbondata.md and sdk-guide.md. 5. [CARBONDATA-4236] : Documenatation issues in configuration-parameters.md. 6. import processing class in streaming-guide.md is wrong What changes were proposed in this PR? 1. Made changes to support insert on geo table with auto-generated geoId. 2. [CARBONDATA-4119] : Added documentation about insert with custom geoId. Changes in docs/spatial-index-guide.md 3. Other documentation changes added. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4205

…Support alter add map, multilevel complex columns and rename/change datatype. Why is this PR needed? Support alter add map, multilevel complex columns, and Change datatype for complex type. What changes were proposed in this PR? 1. Support adding of single-level and multi-level map columns 2. Support adding of multi-level complex columns(array/struct) 3. Support renaming of map columns including nested levels 4. Alter change datatype at nested levels (array/map/struct) Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4180

Why is this PR needed? With spark 3.1, we can create a partition table by giving partition columns from schema. Like below example: create table partitionTable(c1 int, c2 int, v1 string, v2 string) stored as carbondata partitioned by (v2,c2) When the table is created by SparkSession with CarbonExtension, catalog table is created with the specified partitions. But in cluster/ with carbon session, when we create partition table with above syntax it is creating normal table with no partitions. What changes were proposed in this PR? partitionByStructFields is empty when we directly give partition column names. So it was not creating a partition table. Made changes to identify the partition column names and get the struct field and datatype info from table columns. This closes #4208

Why is this PR needed? This PR enables Dynamic Partition Pruning for carbon. What changes were proposed in this PR? CarbonDatasourceHadoopRelation has to extend HadoopFsRelation, because spark has added a check to use DPP only for relation matching HadoopFsRelation Apply Dynamic filter and get runtimePartitions and set this to CarbonScanRDD for pruning This closes #4199

Why is this PR needed? Create partition table with location fails with unsupported message. What changes were proposed in this PR? This scenario works in cluster mode. This check can be moved in local mode also and partition table can be created with table with location Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4211

Why is this PR needed? When block index[BlockIndex] is available then no need to prepare indexes[List[BlockIndex] from available segments and partition locations which might cause delay in query performance. What changes were proposed in this PR? Call directly get segment properties if block index[BlockIndex] available. if (segmentIndices.get(0) instanceof BlockIndex) { segmentProperties = segmentPropertiesFetcher.getSegmentPropertiesFromIndex(segmentIndices.get(0)); } else { segmentProperties = segmentPropertiesFetcher.getSegmentProperties(segment, partitionLocations); } getSegmentPropertiesFromIndex is calling directly block index segment properties. Does this PR introduce any user interface change? No Is any new testcase added? No. Already index related test cases are present which can cover the added code. This closes #4209

…d to long string, SI, local dictionary Why is this PR needed? 1.Insert/load fails after alter add complex column if table contains long string columns. 2.create index on array of complex column (map/struct) throws null pointer exception instead of correct error message. 3.alter table property local dictionary inlcude/exclude with newly added map column is failing. What changes were proposed in this PR? 1. The datatypes array and data row are of different order leading to ClassCastException. Made changes to add newly added complex columns after the long string columns and other dimensions in carbonTableSchemaCommon.scala 2. For complex columns, SI creation on only array of primitive types is allowed. Check if the child column is of complex type and throw an exception. Changes made in SICreationCommand.scala 3. In AlterTableUtil.scala, while validating local dictionary columns, array and struct type are present but map type is missed. Added check for complex types. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4214

Why is this PR needed? The CustomIndex interface extends Serializable and for different version store, if the serialization id doesn't match, it throws java.io.InvalidClassException during load/update/query operations. What changes were proposed in this PR? As the instance is stored in table properties, made changes to initialize and update instance while refresh table. Also added static serialId for the CustomIndex interface. Does this PR introduce any user interface change? No Is any new testcase added? No, tested in cluster This closes #4216

…le with complex column fails Why is this PR needed? Insert after alter add column on partition table with complex column fails with bufferUnderFlowException List of columns order in TableSchema is different after alter add column. Ex: If partition is of dimension type, when table is created the schema columns order is as dimension columns(partition column also) + complex column After alter add, we are changing the order of columns in schema by moving the partition column to last. complex column + partition column Due to this change in order, while fillDimensionAndMeasureDetails, the indexing is wrong as it expects complex column to be last always which causes bufferUnderFlowException while flattening complex row. What changes were proposed in this PR? After alter add, removed changes to add partition column at last. This closes #4215

Why is this PR needed? Select query on a table with and filter condition returns an empty result while valid data present in the table. Root cause: Currently when we are building the min-max index at block level, that time we are using unsafe byte comparator for either dimension or measure column which returns incorrect result for measure columns. What changes were proposed in this PR? We should use different comparators for dimensions and measure columns which we are already doing at time of writing the min-max index at blocklet level. Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4217

…action failure Why is this PR needed? Alter add complex columns with global sort compaction is failing due to AOI exception : Currently creating default complex delimiter list in global sort compaction with size of 3. For map case need extra complex delimiter for handling the key-value bad record handling: When we add complex columns after insert the data, complex columns has null data for previously loaded segments. this null value is going to treat as bad record and compaction is failed. What changes were proposed in this PR? In Global sort compaction flow create default complex delimiter with 4, as already doing in load flow. Bad records handling pruned for compaction case. No need to check bad records for compaction as they are already checked while loading. previously loaded segments data we are inserting again in compaction case This closes #4218

…er caching mechanism. Why is this PR needed? There are 2 issues in the Index Server flow: In case when there is a main table with a SI table with prepriming disabled and index serve enabled, new load to main table and SI table put the cache for the main table in the index server. Cache is also getting again when a select query is fired. This issue happens because during load to SI table, getSplits is called on the main table segment which is in Insert In Progress state. Index server considers this segment as a legacy segment because it's index size = 0 and does not put it's entry in the tableToExecutor mapping. In the getsplits method isRefreshneeded is false the first time getSplits is called. During the select query, in getSplits method isRefreshNeeded is true and the previous loaded entry is removed from the driver but since there is no entry for that table in tableToExecutor mapping, the previous cache value becomes dead cache and always stays in the index server. The newly loaded cache is loaded to a new executor and 2 copies of cache for the same segment is being mantained. Concurrent select queries to the index server shows wrong cache values in the Index server. What changes were proposed in this PR? The following changes are proposed to the index server code: Removing cache object from the index server in case the segment is INSERT IN PROGRESS and in the case of legacy segment adding the value in tabeToExecutor mappping so that the cache is also removed from the executor side. Concurrent queries were able adding duplicate cache values to other executors. Changed logic of assign executors method so that concurrent queries are not able to add cache for same segment in other executors This closes #4219

Why is this PR needed? Currently, the select query fails when table contains SI and column_meta_cache on the same columns with to date() UDF. This is happening because pushdownfilters is null in CarbonDataSourceScanHelper and it is causing null pointer exception. What changes were proposed in this PR? At place of passing null value for pushdownfilters in CarbonDataSourceScan.doCanonicalize passed Seq.empty. This closes #4225

…dd segment Why is this PR needed? Deleted records are reappearing or updated records are showing old values in select queries. It is because after horizontal compaction delete delta file for the external segment is written to the default path which is Fact\part0\segment_x\ while if the segment is an external segment then delete delta file should be written to the path where the segment is present. What changes were proposed in this PR? After delete/update operation on the segment, horizontal compaction will be triggered. Now after horizontal compaction for external segments, the delete delta file will be written to the segment path at the place of the default path. This closes #4220

…sactional table Why is this PR needed? Currently, when you create a table with location( without external keyword) in cluster, the corresponding table is created as transactional table. If External keyword is present, then it is created as non-transactional table. This scenario is not handled in local mode. What changes were proposed in this PR? Made changes, to check if external keyword is present or not. If present, then make the corresponding table as transactional table. This closes #4221

…h vector read disabled Why is this PR needed? If carbon.enable.vector.reader is disabled and parquet/orc segments are added to carbon table. Then on query, it fails with java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRow. When vector reader property is disabled, while scanning ColumnarBatchScan supportBatch would be overridden to false but external file format like ParuetFileFormat supportBatch is not overriden and it takes default as true. What changes were proposed in this PR? Made changes to override supportBatch of external file formats based on carbon.enable.vector.reader property. This closes #4226

Why is this PR needed? To support spatial index creation using spark data frame What changes were proposed in this PR? Added spatial properties in carbonOptions and edited existing testcases. Does this PR introduce any user interface change? Yes Is any new testcase added? Yes This closes #4222

…ex type Why is this PR needed? 1. IS_EMPTY_DATA_BAD_RECORD property not supported for complex types. 2. To update documentation that COLUMN_META_CACHE and RANGE_COLUMN doesn't support complex datatype What changes were proposed in this PR? 1. Made changes to pass down IS_EMPTY_DATA_BAD_RECORD property and throw exception. Store empty complex type instead of storing null value which matches with hive table result. 2. Updated document and added testcase. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4228

Why is this PR needed? Currently, with Spark 3.1, some rules are applied many times resulting in performance degrade. What changes were proposed in this PR? Changed Rules apply strategy from Fixed to Once and CarbonOptimizer can directly extend SparkOptimizer avoiding applying same rules many times This Closes #4229

… partition Why is this PR needed? When insert into table with static partition, source projects should not contain static partition column, target table will have all columns, the columns number comparison between source table and target table is: source table column number = target table column number - static partition column number. What changes were proposed in this PR? Before do the column number comparison, remove the static partition column from target table. This Closes #4233

Why is this PR needed? Few properties which were not present on configurations page but are user facing properties have been added. What changes were proposed in this PR? Addition of missing properties Does this PR introduce any user interface change? No Is any new testcase added? No This Closes #4210

Why is this PR needed? After update/delete with spark on the table which contains array/struct column, when we are trying to read from presto then it is throwing class cast exception. It is because when we perform update/delete then it contains vector of type ColumnarVectorWrapperDirectWithDeleteDelta which we are trying to typecast to CarbonColumnVectorImpl and because of this it is throwing typecast exception. After fixing this(added check for instanceOf) it started throwing IllegalArgumentException. It is because: 1. In case of local dictionary enable CarbondataPageSource.load is calling ComplexTypeStreamReader.putComplexObject before setting the correct number of rows(doesn't subtrat deleted rows). And it throws IllegalArgument while block building for child elements. 2. position count is wrong in the case of the struct. It should subtract the number of deleted rows in LocalDictDimensionDataChunkStore.fillVector. While this is not required to be changed in the case of the array because datalength of the array already taking care of deleted rows in ColumnVectorInfo.getUpdatedPageSizeForChildVector. What changes were proposed in this PR? First fixed class cast exception after putting instanceOf condition in if block. Then subtracted the deleted row count before calling ComplexTypeStreamReader.putComplexObject in DirectCompressCodec.decodeAndFillVector. Also handle deleted rows in case of struct in LocalDictDimensionDataChunkStore.fillVector Does this PR introduce any user interface change? No Is any new testcase added? No This Closes #4224

…ilities added Why is this PR needed? This PR adds schema enforcement, schema evolution and deduplication capabilities for carbondata streamer tool specifically. For the existing IUD scenarios, some work needs to be done to handle it completely, for example - 1. passing default values and storing them in table properties. Changes proposed for the phase 2 - 1. Handling delete use cases with upsert operation/command itself. Right now we consider update as delete + insert. With the new streamer tool, it is possible that user sets upsert as the operation type and incoming stream has delete records as well. What changes were proposed in this PR? Configs and utility methods are added for the following use cases - 1. Schema enforcement 2. Schema evolution - add column, delete column, data type change scenario 3. Deduplicate the incoming dataset against incoming dataset itself. This is useful in scenarios where incoming stream of data has multiple updates for the same record and we want to pick the latest. 4. Deduplicate the incoming dataset against existing target dataset. This is useful when operation type is set as INSERT and user does not want to insert duplicate records. This closes #4227

1. add segment option (partition) 2. segment-management-on-carbondata.md link addsegment-guide.md

…tch and merge from kafka and DFS Sources Why is this PR needed? In the current Carbondata CDC solution, if any user wants to integrate it with a streaming source then he need to write a separate spark application to capture changes which is an overhead. We should be able to incrementally capture the data changes from primary databases and should be able to incrementally ingest the same in the data lake so that the overall latency decreases. The former is taken care of using log-based CDC systems like Maxwell and Debezium. Here is a solution for the second aspect using Apache Carbondata. What changes were proposed in this PR? Carbondata streamer tool is a spark streaming application which enables users to incrementally ingest data from various sources, like Kafka(standard pipeline would be like MYSQL => debezium => (kafka + Schema registry) => Carbondata Streamer tool) and DFS into their data lakes. The tool comes with out-of-the-box support for almost all types of schema evolution use cases. With the streamer tool only add column support is given with drop column and other schema changes capability in line in the upcoming days. Please refer to design document for more details about usage and working of the tool. This closes #4235

1. add segment example. 2. faq.md link to addsegment-guide.md

This reverts commit 7af81ad.

This reverts commit 69ab06c.

This reverts commit 885a21c.

This reverts commit 598d1ce.

This reverts commit 81c2e29.

ths! Co-authored-by: Indhumathi27 <[email protected]>

closes #4239

Why is this PR needed? Horizontal compaction fails for partition table leading to many delete delta files for a single block, leading to slower query performance. This is happening because during horizontal compaction the delta file path prepared for the partition table is wrong which fails to identify the path and fails the operation. What changes were proposed in this PR? If it is a partition table, read the segment file and identity the partition where the block is present to prepare a proper partition path. This closes #4240

Why is this PR needed? The following issues has degraded the TPCDS query performance 1. If dynamic filters is not present in partitionFilters Set, then that filter is skipped, to pushdown to spark. 2. In some cases, some nodes like Exchange / Shuffle is not reused, because the CarbonDataSourceSCan plan is not mached 3. While accessing the metadata on the canonicalized plan throws NPE What changes were proposed in this PR? 1. Check if dynamic filters is present in PartitionFilters set. If not, pushdown the filter 2. Match the plans, by converting them to canonicalized and by normalising the expressions 3. Move variables used in metadata(), to avoid NPE while comparing plans This closes #4241

… files after horizontal compaction Why is this PR needed? After horizontal compaction was performed on partition and non partition tables, the clean files operation was not deleting the stale delete delta files. the code was removed as the part of clean files refactoring done previously. What changes were proposed in this PR? Clean files with force option now handles removal of these stale delta files as well as the stale tableupdatestatus file for both partition and non partition table. This closes #4245

Why is this PR needed? Documentation for the CDC streamer tool is missing What changes were proposed in this PR? Add th documentation for the cdc streamer tool contains the configs and all the image and example command to try out. Does this PR introduce any user interface change? No Is any new testcase added? No This closes #4243

Why is this PR needed? With the increase in the number of overwrite loads for the partition table, the time takes for each load keeps on increasing over time. This is because, 1. whenever a load overwrite for partition table is fired, it basically means that we need to overwrite or drop the partitions if anything overlaps with current partitions getting loaded. Since carbondata stores the partition information in the segments file, to identify and drop partitions, it's reading all the previous segment files to identify and drop the overwriting partitions, which leads to a decrease in performance. 2. After partition load is completed, a cleanSegments method is called which again reads segment file and table status file to identify MArked for delete segments to clean. But Since the force clean is false and timeout also is more than a day by default, it's not necessary to call this method. Clean files should handle this part. What changes were proposed in this PR? 1. we already have the information about current partitions, so with that first identify if there are any partitions to overwrite, if yes then only we read segment files to call dropParitition, else we don't read the segment files unnecessarily. It also contains other refactoring to avoid reading table status file also. 2. no need to call clean segments after every load. Clean files will take care to delete the expired ones. This closes #4242

Why is this PR needed? In the case where there are multiple delete delta files in a partition in a partition table, some delta files were being ignored and deleted, thus changing the value during the query What changes were proposed in this PR? Fixed the logic which checks which delta file to delete. Now checking the deltaStartTime and comparing it with deltaEndTime to check consider all the delta files during clean files. Does this PR introduce any user interface change? No Is any new testcase added? Yes, one test case has been added. This closes #4246

Why is this PR needed? Currently, When carbon.partition.data.on.tasklevel is enabled with local sort, the number of tasks launched for load will be based on node locality. But for insert command, the local sort task level property is not applied which is causing the number of tasks launched based on the input files. What changes were proposed in this PR? Included changes to apply carbon.partition.data.on.tasklevel property for insert command as well. Used DataLoadCoalescedRDD to coalesce the partitions and a DataLoadCoalescedUnwrapRDDto unwrap partitions from DataLoadPartitionWrap and iterate. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4248

… fix partition table creation with df spatial property Why is this PR needed? 1. Only specific properties are supported using dataframe options. Need to update the documentation. 2. Create partition table fails with Spatial index property for carbon table created with dataframe in spark-shell. What changes were proposed in this PR? 1. Added data frame supported properties in the documentation. 2. Using spark-shell, the table gets created with carbon session and catalogTable.properties is empty here. Getting the properties from catalogTable.storage.properties to access the properties set. Does this PR introduce any user interface change? No Is any new testcase added? No, tested in cluster. This closes #4250

Why is this PR needed? MV created in beeline not hitting in sql/shell and vice versa if both beeline and sql/shell are running in parallel. Currently, If the view catalog for a particular session is already initialized then the schemas are not reloaded each time. So when mv is created in another session and queried from the currently open session, mv is not hit. What changes were proposed in this PR? 1.Reload mv catalog every time to getSchemas from the path. Register the schema if not present in the catalog and deregister the schema if it's dropped. 2. When create SI is triggered, no need to try rewriting the plan and check for mv schemas. So, returning plan if DeserializeToObject is present. Does this PR introduce any user interface change? No Is any new testcase added? No, tested in cluster This closes #4251

Why is this PR needed? Some non-partition filters, which cannot be handled by carbon, is not pushed down to spark. What changes were proposed in this PR? If partition filters is non empty, then the filter column is not partition column, then push the filter to spark This closes #4252

Why is this PR needed? Drop partition with data is not supported and a few of the links are not working. What changes were proposed in this PR? Removed unsupported syntax , duplicate headings and updated the header with proper linkage. This closes #4254

Why is this PR needed? If parquet table is created and load statement with options is triggerred, then its failing with NoSuchTableException: Table ${tableIdentifier.table} does not exist. What changes were proposed in this PR? As parquet table load is not handled, added a check to filter out non-carbon tables in the parser. So that, the spark parser can handle the statement. This closes #4253

Why is this PR needed? Issue 1: When we create external table on transactional table location, schema file will be present. While creating external table, which is also transactional, the schema file is overwritten Issue 2: If external table is created on a location, where the source table already exists, on drop external table, it is deleting the table data. Query on the source table fails What changes were proposed in this PR? Avoid writing schema file if table type is external and transactional Dont drop external table location data, if table_type is external This closes #4255

Why is this PR needed? Currently, whenever MV is created with average aggregate, a full refresh is done meaning it reloads the whole MV for any newly added segments. This will slow down the loading. With incremental data load, only the segments that are newly added can be loaded to the MV. What changes were proposed in this PR? If avg is present, rewrite the query with the sum and count of the columns to create MV and use them to derive avg. Refer: https://docs.google.com/document/d/1kPEMCX50FLZcmyzm6kcIQtUH9KXWDIqh-Hco7NkTp80/edit Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4257

Why is this PR needed? Currently, carbondata will store the records of a transaction (load/insert/IUD/Add/drop segment) in a metadata file named `tablestatus’ which will be present in the Metadata directory. If the tablestatus file is lost, then the metadata for the transactions cannot be recovered directly, as there is no previous version file available for tablestatus. Hence, if we support versioning for tablestatus files, then it will be easy to recover the current version tablestatus meta from previous version tablestatus files. Please refer Table Status Versioning & Recovery Tool for more info. What changes were proposed in this PR? -> On each transaction commit, committed the latest load metadata details to a new version file -> Updated the latest tablestatus version timestamp in the table properties [CarbonTable cache] and in the hive metastore -> Added a table status version tool which can recover the latest transaction details based on old version file Does this PR introduce any user interface change? Yes Is any new testcase added? Yes This closes #4261

Why is this PR needed? Currently materialized view(mv) is enabled by default. In concurrent scenarios with default mv enabled each session is going through the list of databases even though mv not used. Due to this query time increased. What changes were proposed in this PR? Disable mv by default as users using mv rarely. If user required then enable and use it. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4264

Why is this PR needed? Drop Index Fails after TABLE RENAME What changes were proposed in this PR? After table rename, its si tables property - parentTableName is updated with latest name and index metadata gets updated. Dropping the table from the metadata cache so that it would be reloaded and gets updated property when fetched next time. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4279

…ion table Why is this PR needed? After delete segment and clean files with force option true, the load overwrite operation throws nullpointer exception. This is because when clean files with force is done, except the 0th segment and last segment remaining marked for delete will be moved to tablestatus.history file irrespective of the status of the 0th and last segment. During overwrite load, the overwritten partition will be dropped. Since all the segments are physically deleted with clean files, and load model's load metadata details list contains 0th segment which is marked for delete also leading to failure. What changes were proposed in this PR? When the valid segments are collected, filter using the segment's status to avoid the failure. This closes #4280

…L _DICTIONARY_EXCLUDE column: does not exist in table. Please check the DDL" error Why is this PR needed? Create MV fails with "LOCAL_DICTIONARY_INCLUDE/LOCAL _DICTIONARY_EXCLUDE column: does not exist in table. Please check the DDL" error. Error occurs only in this scenario: Create Table --> Load --> Alter Add Columns --> Drop table --> Refresh Table --> Create MV and not in direct scenario like: Create Table --> Load --> Alter Add Columns --> Create MV What changes were proposed in this PR? 1. After add column command, LOCAL_DICTIONARY_INCLUDE and LOCAL_DICTIONARY_EXCLUDE properties are added to the table even if the columns are empty. So, when MV is created next as LOCAL_DICTIONARY_EXCLUDE column is defined it tries to access its columns and fails. --> Added empty check before adding properties to the table to resolve this. 2. In a direct scenario after add column, the schema gets updated in catalog table but the table properties are not updated. Made changes to update table properties to catalog table. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4282

…egemnt deleted from carbon table Why is this PR needed? Update/delete operations failed when other format segments deleted from carbon table Steps to reproduce: 1. create carbon table and load the data 2. create parquet/orc tables and load the data 3. add parquet/orc format segments in carbon table by alter add segment command 4. perform update/delete operations in carbon table and they will fail as table contains mixed format segments. This is expected behaviour only. 5. delete the other format segments which is added in step3 6. try to perform update/delete operation in carbon data. They should not fail For update/delete operations we are checking if other format segments present in table path. If found then carbon data throwing exception by saying mixed format segments exists even though the other format segments deleted from table. What changes were proposed in this PR? When we are checking other format segment present in carbon table then it should check only for SUCCESS/PARTIAL_SUCCESS segments. Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4285

…h ALter ADD column query failed Why is this PR needed? 1. When spark.carbon.hive.schema.store property is enabled, alter operations fails with Class Cast Exception. 2. When Alter add/drop/rename column operation failed due to the issue mentioned above, the revert schema operation is not reverting back to the old schema What changes were proposed in this PR? 1. Use org.apache.spark.sql.hive.CarbonSessionCatalogUtil#getClient to get HiveClient to avoid ClassCast Exception 2. Revert the schema in the spark Catalog table also, in case of failure Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4277

Why is this PR needed? When drop partition operation is performed carbon data will modify only table status file and can not delete the actual partition folder which contains data and index files. As comply with hive behaviour carbon data also should delete the deleted partition folder in storage[hdfs/obs/etc..]. Before deleting carbon data will keep copy in Trash folder. User can restore it by checking the partition name and time stamp. What changes were proposed in this PR? Moved the deleted partition folder files to trash folder Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #4276

[ISSUE-4298] Fixed mail list issue

[ISSUE-4299] Fixed compile issue with spark 2.3

[ISSUE-4305] Optimize the magic number

…NPE (#4310)

Fix the issue when read schema from S3

static method shouldn't be called by object, it should be call by class

…4319)

* Add new example:Using CarbonData to visualization in notebook * Update the example:Using CarbonData in notebook

CONSTANTS should be like: UPPER_NAME; variable should be lowerCamelCase

…4314)

optimize equal

static method shouldn't be called by object, it should be call by class

add override

…4314) (#4324)

#117, The longitude is six decimal places and the dimension is five digits. Why is it the same length after conversion?

Co-authored-by: QiangCai <[email protected]>

…a types (#4263) Why is this PR needed? CHAR and VARCHAR as String data types are no longer supported in Carbon. They should be deleted from doc's desc. What changes were proposed in this PR? CHAR and VARCHAR stop appearing as two String data types in doc. Does this PR introduce any user interface change? No Is any new testcase added? No Co-authored-by: tangchuan <[email protected]>

Co-authored-by: QiangCai <[email protected]>

…ndata-2.3.1-rc1

Bumps [pyarrow](https://github.com/apache/arrow) from 0.11.1 to 14.0.1. - [Commits](apache/arrow@apache-arrow-0.11.1...go/v14.0.1) --- updated-dependencies: - dependency-name: pyarrow dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Minor refactor the build docs * Fix review comments * Update build/README.md

Co-authored-by: jacky <[email protected]>

* upgrade thrift version * change to use 0.20.0 --------- Co-authored-by: jacky <[email protected]>

Bumps org.apache.commons:commons-compress from 1.4.1 to 1.26.0. --- updated-dependencies: - dependency-name: org.apache.commons:commons-compress dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* add github action for building * Revert "[WIP] Optimize geo module, the feature seems less be used (#4353)" This reverts commit 29607c3. * Revert "[WIP] Optimize geo module, the feature seems less be used" This reverts commit 71abab0. * cache thrift

Commits on Mar 28, 2021

add .asf.yaml

chenliang613 committed Mar 28, 2021

Configuration menu

View commit details

Copy full SHA for 4ec3e58

Browse repository at this point

Copy the full SHA

4ec3e58 View commit details

Browse the repository at this point in the history

Commits on Dec 2, 2021

remove add segment refer

bieremayi committed Dec 2, 2021

Configuration menu

View commit details

Copy full SHA for 69ab06c

Browse repository at this point

Copy the full SHA

69ab06c View commit details

Browse the repository at this point in the history

Commits on Dec 2, 2023

Create maven-publish.yml

chenliang613 authored Dec 2, 2023

Configuration menu

View commit details

Copy full SHA for bcc7137

Browse repository at this point

Copy the full SHA

bcc7137 View commit details

Browse the repository at this point in the history

merge from apache master #7

Are you sure you want to change the base?

merge from apache master #7

Commits on Dec 21, 2020

Commits on Dec 22, 2020

Commits on Dec 23, 2020

Commits on Dec 29, 2020

Commits on Dec 30, 2020

Commits on Jan 5, 2021

Commits on Jan 6, 2021

Commits on Jan 11, 2021

Commits on Jan 19, 2021

Commits on Jan 21, 2021

Commits on Jan 22, 2021

Commits on Jan 25, 2021

Commits on Jan 27, 2021

Commits on Jan 29, 2021

Commits on Jan 30, 2021

Commits on Feb 2, 2021

Commits on Feb 4, 2021

Commits on Feb 10, 2021

Commits on Feb 17, 2021

Commits on Feb 18, 2021

Commits on Mar 3, 2021

Commits on Mar 5, 2021

Commits on Mar 9, 2021

Commits on Mar 10, 2021

Commits on Mar 12, 2021

Commits on Mar 15, 2021

Commits on Mar 19, 2021

Commits on Mar 21, 2021

Commits on Mar 23, 2021

Commits on Mar 25, 2021

Commits on Mar 26, 2021

Commits on Mar 28, 2021

Commits on Mar 29, 2021

Commits on Apr 15, 2021

Commits on Apr 19, 2021

Commits on Apr 20, 2021

Commits on Apr 22, 2021

Commits on Apr 26, 2021

Commits on Apr 27, 2021

Commits on Apr 30, 2021

Commits on May 10, 2021

Commits on May 11, 2021

Commits on May 20, 2021

Commits on May 24, 2021

Commits on May 25, 2021

Commits on Jun 2, 2021

Commits on Jun 4, 2021

Commits on Jun 7, 2021

Commits on Jun 10, 2021

Commits on Jun 16, 2021

Commits on Jun 18, 2021

Commits on Jun 19, 2021

Commits on Jun 22, 2021

Commits on Jun 23, 2021

Commits on Jun 26, 2021

Commits on Jun 29, 2021

Commits on Jul 5, 2021

Commits on Jul 7, 2021

Commits on Jul 14, 2021

Commits on Jul 27, 2021

Commits on Jul 28, 2021

Commits on Jul 29, 2021

Commits on Jul 30, 2021

Commits on Aug 1, 2021

Commits on Aug 3, 2021

Commits on Aug 5, 2021

Commits on Aug 7, 2021

Commits on Aug 8, 2021

Commits on Aug 11, 2021

Commits on Aug 16, 2021

Commits on Aug 19, 2021

Commits on Aug 22, 2021

Commits on Aug 24, 2021

Commits on Aug 26, 2021

Commits on Aug 31, 2021

Commits on Sep 1, 2021

Commits on Sep 8, 2021