-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Support some compress functions #47307
base: master
Are you sure you want to change the base?
Conversation
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and remember to format your file
|
||
Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments, | ||
uint32_t result, size_t input_rows_count) const override { | ||
// LOG(INFO) << "Executing FunctionCompress with " << input_rows_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove these commented lines
col_data[idx] = '0', col_data[idx + 1] = 'x'; | ||
for (int i = 0; i < 4; i++) { | ||
unsigned char byte = (value >> (i * 8)) & 0xFF; | ||
col_data[idx + 2 + i * 2] = "0123456789ABCDEF"[byte >> 4]; // 高4位 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont use Chinese
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and make magic values
|
||
auto st = compression_codec->compress(data, &compressed_str); | ||
|
||
if (!st.ok()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment about when will it fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add cases like regression-test/suites/query_p0/sql_functions/test_template_one_arg.groovy
did
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont need modify this file anymore
std::string func_name = "compress"; | ||
InputTypeSet input_types = {TypeIndex::String}; | ||
|
||
// 压缩多个不同的字符串 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont use Chinese comment
std::string uncompressed; | ||
Slice data; | ||
Slice uncompressed_slice; | ||
for (int row = 0; row < input_rows_count; row++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use size_t
, not int
illegal = 1; | ||
} else { | ||
if (data[0] != '0' || data[1] != 'x') { | ||
LOG(INFO) << "illegal: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont log info here
if (x >= 'A' && x <= 'F') return true; | ||
return false; | ||
}; | ||
auto trans = [](char x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use from_chars
and to_chars
to replace your user implemented lambdas
// Print the compressed string (after compression) | ||
// LOG(INFO) << "Compressed string at row " << row << ": " | ||
// << std::string(reinterpret_cast<const char*>(col_data.data())); | ||
col_offset[row] = col_offset[row - 1] + 10 + compressed_str.size() * 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this value for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first ten digits of the compress value are "0x" and eight digits long, followed by each digit split into two hexadecimal values
fa80a74
to
ca2b27e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the correcteness
const auto& str = arg_column.get_data_at(row); | ||
data = Slice(str.data, str.size); | ||
|
||
auto st = compression_codec->compress(data, &compressed_str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will compress fail?
Slice data; | ||
for (size_t row = 0; row < input_rows_count; row++) { | ||
null_map[row] = false; | ||
const auto& str = arg_column.get_data_at(row); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to use virtual function here
\N \N | ||
|
||
-- !const_not_nullable -- | ||
0x05000000789C73C92FCA2C060005B00202 0x446F726973 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
carefully review your result!!!
Slice data; | ||
Slice uncompressed_slice; | ||
for (size_t row = 0; row < input_rows_count; row++) { | ||
auto check = [](char x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try to use std function firstly
const auto& str = arg_column.get_data_at(row); | ||
data = Slice(str.data, str.size); | ||
|
||
int illegal = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not bool?
unsigned char* src = compressed_str.data(); | ||
{ | ||
for (size_t i = 0; i < compressed_str.size(); i++) { | ||
col_data[idx] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so tricky here. try to improve code like it
Slice data; | ||
Slice uncompressed_slice; | ||
for (size_t row = 0; row < input_rows_count; row++) { | ||
std::function<bool(char)> check = [](char x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use isxdigit
?
be/src/util/block_compression.cpp
Outdated
@@ -854,8 +854,13 @@ class ZlibBlockCompression : public BlockCompressionCodec { | |||
Slice s(*output); | |||
|
|||
auto zres = ::compress((Bytef*)s.data, &s.size, (Bytef*)input.data, input.size); | |||
if (zres != Z_OK) { | |||
return Status::InvalidArgument("Fail to do ZLib compress, error={}", zError(zres)); | |||
if (zres == Z_MEM_ERROR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also change other same calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split them to another PR may be better
implements UnaryExpression, ExplicitlyCastableSignature, PropagateNullable { | ||
|
||
public static final List<FunctionSignature> SIGNATURES = ImmutableList.of( | ||
FunctionSignature.ret(StringType.INSTANCE).args(StringType.INSTANCE)); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
implements UnaryExpression, ExplicitlyCastableSignature, AlwaysNullable { | ||
|
||
public static final List<FunctionSignature> SIGNATURES = ImmutableList.of( | ||
FunctionSignature.ret(StringType.INSTANCE).args(StringType.INSTANCE)); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
||
unsigned int length = 0; | ||
for (size_t i = 2; i <= 9; i += 2) { | ||
unsigned char byte = (hex_ctoi.at(data[i]) << 4) + hex_ctoi.at(data[i + 1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove hex_ctoi
and just use from_chars
unsigned int length = 0; | ||
for (size_t i = 2; i <= 9; i += 2) { | ||
unsigned char byte = (hex_ctoi.at(data[i]) << 4) + hex_ctoi.at(data[i + 1]); | ||
length += (byte << (8 * (i / 2 - 1))); //Little Endian : 0x01000000 -> 1 |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
std::string uncompressed; | ||
Slice data; | ||
Slice uncompressed_slice; | ||
for (size_t row = 0; row < input_rows_count; row++) { |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
} | ||
idx += 10; | ||
|
||
col_data.resize(col_data.size() + 2 * compressed_str.size()); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
//Converts a hexadecimal readable string to a compressed byte stream | ||
std::string s(((int)data.size - 10) / 2, ' '); // byte stream data.size >= 10 | ||
for (size_t i = 10, j = 0; i < data.size; i += 2, j++) { | ||
s[j] = (hex_ctoi.at(data[i]) << 4) + hex_ctoi.at(data[i + 1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
6ef40dd
to
b8b3bad
Compare
run buildall |
run buildall |
TPC-H: Total hot run time: 32100 ms
|
TPC-DS: Total hot run time: 184954 ms
|
ClickBench: Total hot run time: 30.68 s
|
run buildall |
run buildall |
TPC-H: Total hot run time: 32971 ms
|
TPC-DS: Total hot run time: 191631 ms
|
ClickBench: Total hot run time: 30.67 s
|
run buildall |
TPC-H: Total hot run time: 32328 ms
|
TPC-DS: Total hot run time: 185720 ms
|
ClickBench: Total hot run time: 30.3 s
|
run buildall |
0377751
to
c94f814
Compare
c94f814
to
9263dea
Compare
run buildall |
TPC-H: Total hot run time: 32060 ms
|
TPC-DS: Total hot run time: 184635 ms
|
ClickBench: Total hot run time: 31.04 s
|
TeamCity be ut coverage result: |
} | ||
|
||
// first ten digits represent the length of the uncompressed string | ||
col_data.resize(col_data.size() + 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L111 and L120's resize could be merged.
if (data[0] != '0' || data[1] != 'x') { | ||
illegal = true; | ||
} | ||
for (size_t i = 2; i <= 9; i += 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why +=2
here? should be ++
?
} | ||
|
||
{ | ||
std::string func_name = "uncompress"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add some case of uncompressing a string which is making from a valid compressed string with minor modifications.
like, for valid '0x1204', try to uncompress '0x12F4' which in invalid to get the NULL
result.
} | ||
|
||
uncompressed.resize(length); | ||
uncompressed_slice = Slice(uncompressed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the uncompressed
, just resize col_data
to add its length with length
. so just point uncompressed_slice
into col_data
. then we can do decompress inplace and no need to do heavily memcpy
(at L237)
What problem does this PR solve?
Added the compress and uncompressed functions similar to mysql
Issue Number: close #45530
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)