doris

Author	SHA1	Message	Date
Mingyu Chen	40be6a0b05	[fix](hive) do not split compress data file and support lz4/snappy block codec (#23245 ) 1. do not split compress data file Some data file in hive is compressed with gzip, deflate, etc. These kinds of file can not be splitted. 2. Support lz4 block codec for hive scan node, use lz4 block codec instead of lz4 frame codec 4. Support snappy block codec For hadoop snappy 5. Optimize the `count()` query of csv file For query like `select count() from tbl`, only need to split the line, no need to split the column. Need to pick to branch-2.0 after this PR: #22304	2023-08-26 12:59:05 +08:00
slothever	f66f161017	[fix](multi-catalog)fix hive table with cosn location issue (#23409 ) Sometimes, the partitions of a hive table may on different storage, eg, some is on HDFS, others on object storage(cos, etc). This PR mainly changes: 1. Fix the bug of accessing files via cosn. 2. Add a new field `fs_name` in TFileRangeDesc This is because, when accessing a file, the BE will get a hdfs client from hdfs client cache, and different file in one query request may have different fs name, eg, some of are `hdfs://`, some of are `cosn://`, so we need to specify fs name for each file, otherwise, it may return error: `reason: IllegalArgumentException: Wrong FS: cosn://doris-build-1308700295/xxxx, expected: hdfs://[172.xxxx:4007](http://172.xxxxx:4007/)`	2023-08-26 00:16:00 +08:00
Qi Chen	8af1e7f27f	[Fix](orc-reader) Fix incorrect result if null partition fields in orc file. (#23369 ) Fix incorrect result if null partition fields in orc file. ### Root Cause Theoretically, the underlying file of the hive partition table should not contain partition fields. But we found that in some user scenarios, the partition field will exist in the underlying orc/parquet file and are null values. As a result, the pushed down partition field which are null values. filter incorrectly. ### Solution we handle this case by only reading non-partition fields. The parquet reader is already handled this way, this PR handles the orc reader.	2023-08-26 00:13:11 +08:00
Qi Chen	a3a951c71d	[Fix](multi-catalog) Fix load string dict issue for transactional hive tables. (#23306 ) Fix load string dict issue for transactional hive tables. The column name need to pass 'row.column_name'. apache/doris-thirdparty#112	2023-08-26 00:09:12 +08:00
Qi Chen	29273771f7	[Fix](multi-catalog) Fix hive incorrect result by disable string dict filter if exprs contain null expr. (#23361 ) Issue Number: close #21960 Fix hive incorrect result by disable string dict filter if exprs contain null expr.	2023-08-25 21:16:43 +08:00
Qi Chen	caddcc6215	[Fix](orc-reader) Fix decimal type check for ColumnValueRange issue and use primitive_type. (#23424 ) Fix decimal type check for ColumnValueRange issue and use primitive_type in orc_reader. Because in #22842 the `CppType` of `PrimitiveTypeTraits<TYPE_DECIMALXXX> ` were changed.	2023-08-24 23:26:41 +08:00
daidai	2dda44d7b5	[fix](csv-reader)fix bug of multi-char delimiter in csv reader fix bug that csv_reader parse line in order to get column.	2023-08-23 15:19:13 +08:00
lihangyu	527293aa41	[refactor](dynamic table) remove dynamic table (#23298 )	2023-08-23 14:15:14 +08:00
Pxl	8ed4045df9	[Chore](primitive-type) remove VecPrimitiveTypeTraits (#22842 )	2023-08-23 08:37:40 +08:00
Pxl	1a1f86486d	[Improvement](function) opt for case when (#23068 ) opt for case when	2023-08-22 18:31:40 +08:00
Ashin Gau	9d2e23b1aa	[fix](parquet) A row of complex type may be stored across more pages (#23277 ) A row of complex type may be stored across two(or more) pages, and the parameter `align_rows` indicates that whether the reader should read the remaining value of the last row in previous page.	2023-08-22 14:47:10 +08:00
Ashin Gau	5ff7b57fc1	[fix](parquet) parquet reader confuses logical/physical/slot id of columns (#23198 ) `ParquetReader` confuses logical/physical/slot id of columns. If only reading the scalar types, there's nothing wrong, but when reading complex types, `RowGroup` and `PageIndex` will get wrong statistics. Therefore, if the query contains complex types and pushed-down predicates, the probability of the result set is incorrect.	2023-08-22 13:35:29 +08:00
plat1ko	d4694167a8	[Enhancement](chore) Some Status relevant enhancement (#23072 )	2023-08-21 14:14:38 +08:00
Ashin Gau	4bf055c818	[fix](parquet) the key colum of map type in parquet may be nullable (#23180 ) Fix errors when reading map type with nullable key column in parquet file. `ParquetReader` support to read nullable key column, but add a check to prevent reading nullable key column. Unfortunately, this check error was not thrown correctly, causing the BE to crash, and thrown meaningless error logs in be.out: ``` ... 11# doris::vectorized::ParquetReader::get_columns(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, doris::TypeDescriptor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, doris::TypeDescriptor> > >, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >) at /root/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:508 12# doris::vectorized::VFileScanner::_get_next_reader() in /root/yun_you_external/output/be/lib/doris_be 13# doris::vectorized::VFileScanner::_get_block_impl(doris::RuntimeState, doris::vectorized::Block, bool*) at /root/doris/be/src/vec/exec/scan/vfile_scanner.cpp:241 ... ```	2023-08-20 22:59:18 +08:00
daidai	419e922a69	[fix](json)Fix the bug that does not stop when reading json files (#23062 ) * [fix](json)Fix the bug that does not stop when reading json files	2023-08-18 18:23:19 +08:00
Ashin Gau	795006ea3d	[fix](multi-catalog) conversion of compatible numerical types (#23113 ) Hive support schema change, but doesn't rewrite the parquet file, so the physical type of parquet file may not equal the logical type of table schema.	2023-08-18 14:05:33 +08:00
wuwenchi	a5ca6cadd6	[Improvement] Optimize count operation for iceberg (#22923 ) Iceberg has its own metadata information, which includes count statistics for table data. If the table does not contain equli'ty delete, we can get the count data of the current table directly from the count statistics.	2023-08-18 09:57:51 +08:00
Qi Chen	314f5a5143	[Fix](orc-reader) Fix filling partition or missing column used incorrect row count. (#23096 ) [Fix](orc-reader) Fix filling partition or missing column used incorrect row count. `_row_reader->nextBatch` returns number of read rows. When orc lazy materialization is turned on, the number of read rows includes filtered rows, so caller must look at `numElements` in the row batch to determine how many rows were not filtered which will to fill to the block. In this case, filling partition or missing column used incorrect row count which will cause be crash by `filter.size() != offsets.size()` in filter column step. When orc lazy materialization is turned off, add `_convert_dict_cols_to_string_cols(block, nullptr)` if `(block->rows() == 0)`.	2023-08-17 23:26:11 +08:00
Siyang Tang	b49dc8042d	[feature](load) refactor CSV reading process during scanning, and support enclose and escape for stream load (#22539 ) ## Proposed changes Refactor thoughts: close #22383 Descriptions about `enclose` and `escape`: #22385 ## Further comments 2023-08-09: It's a pity that experiment shows that the original way for parsing plain CSV is faster. Therefor, the refactor is only applied on enclose related code. The plain CSV parser use the original logic. Fallback of performance is unavoidable anyway. From the `CSV reader`'s perspective, the real weak point may be the write column behavior, proved by the flame graph. Trimming escape will be enable after fix: #22411 is merged Cases should be discussed: 1. When an incomplete enclose appears in the beginning of a large scale data, the line delimiter will be unreachable till the EOF, will the buffer become extremely large? 2. What if an infinite line occurs in the case? Essentially, `1.` is equivalent to this. Only support stream load as trial in this PR, avoid too many unrelated changes. Docs will be added when `enclose` and `escape` is available for all kinds of load.	2023-08-15 09:23:53 +08:00
amory	5e2748d2b4	[Improve](complex-type)update orc reader for complex type and add regress tests (#22856 )	2023-08-12 07:06:12 +08:00
DongLiang-0	db69457576	[fix](avro)Fix S3 TVF avro format reading failure (#22199 ) This pr fixes two issues: 1. when using s3 TVF to query files in AVRO format, due to the change of `TFileType`, the originally queried `FILE_S3 ` becomes `FILE_LOCAL`, causing the query failed. 2. currently, both parameters `s3.virtual.key` and `s3.virtual.bucket` are removed. A new `S3Utils` in jni-avro to parse the bucket and key of s3. The purpose of doing this operation is mainly to unify the parameters of s3.	2023-08-11 17:22:48 +08:00
amory	be1e0dcd27	[new-feature](complex-type) support read nested parquet and orc file with complex type (#22793 )	2023-08-10 18:23:07 +08:00
Qi Chen	f2658dc7bd	[Feature](multi-catalog) Truncate char or varchar columns if size is smaller than file columns or not found in the file column schema. (#22318 ) Truncate char or varchar columns if size is smaller than file columns or not found in the file column schema by session var `truncate_char_or_varchar_columns`.	2023-08-10 14:37:20 +08:00
daidai	f1db6bd8c1	[feature](hive)append support for struct and map column type on textfile format of hive table (#22347 ) 1. append support for struct and map column type on textfile format of hive table. 2. optimizer code that array column type. ```mysql +------+------------------------------------+ \| id \| perf \| +------+------------------------------------+ \| 1 \| {"key1":"value1", "key2":"value2"} \| \| 1 \| {"key1":"value1", "key2":"value2"} \| \| 2 \| {"name":"John", "age":"30"} \| +------+------------------------------------+ ``` ```mysql +---------+------------------+ \| column1 \| column2 \| +---------+------------------+ \| 1 \| {10, "data1", 1} \| \| 2 \| {20, "data2", 0} \| \| 3 \| {30, "data3", 1} \| +---------+------------------+ ``` Summarizes support for complex types(support assign delimiter) : 1. array< primitive_type > and array< array< ... > > 2. map< primitive_type , primitive_type > 3. Struct< primitive_type , primitive_type ... >	2023-08-10 13:47:58 +08:00
zzzzzzzs	66784cef71	[Enhancement](Load) Stream Load using SQL (#22509 ) This PR was originally #16940 , but it has not been updated for a long time due to the original author @Cai-Yao . At present, we will merge some of the code into the master first. thanks @Cai-Yao @yiguolei	2023-08-08 13:49:04 +08:00
Xujian Duan	3024b82918	[fix](load)Fix wrong default value for char and varchar of reading json data (#22626 ) If a column is defined as: col VARCHAR/CHAR NULL and no default value. Then we load json data which misses column col, the result queried is not correct: +------+ \| col \| +------+ \| 1 \| +------+ But expect: +------+ \| col \| +------+ \| NULL \| +------+ --------- Co-authored-by: duanxujian <duanxujian@jd.com>	2023-08-05 12:47:27 +08:00
Qi Chen	9c0528daf6	[Opt](orc-reader) opt the performance of date convertion. (#22381 ) Opt the performance of date conversion in orc reader. ``` mysql> select count(l_commitdate) from lineitem; +---------------------+ \| count(l_commitdate) \| +---------------------+ \| 600037902 \| +---------------------+ 1 row in set (1.28 sec) mysql> select count(l_commitdate) from lineitem; +---------------------+ \| count(l_commitdate) \| +---------------------+ \| 600037902 \| +---------------------+ 1 row in set (0.19 sec) ```	2023-08-04 10:52:09 +08:00
Mingyu Chen	1ed1b69485	[refactor](reader) move reader from vec/exec/scan to vec/exec/format (#22371 ) This readers should be in vec/exec/format	2023-08-04 09:47:20 +08:00
Ashin Gau	938f768aba	[fix](parquet) resolve offset check failed in parquet map type (#22510 ) Fix error when reading empty map values in parquet. The `offsets.back()` doesn't not equal the number of elements in map's key column. ### How does this happen Map in parquet is stored as repeated group, and `repeated_parent_def_level` is set incorrectly when parsing map node in parquet schema. ``` the map definition in parquet: optional group <name> (MAP) { repeated group map (MAP_KEY_VALUE) { required <type> key; optional <type> value; } } ``` ### How to fix Set the `repeated_parent_def_level` of key/value node as the definition level of map node. `repeated_parent_def_level` is the definition level of the first ancestor node whose `repetition_type` equals `REPEATED`. Empty array/map values are not stored in doris column, so have to use `repeated_parent_def_level` to skip the empty or null values in ancestor node. For instance, considering an array of strings with 3 rows like the following: `null, [], [a, b, c]` We can store four elements in data column: `null, a, b, c` and the offsets column is: `1, 1, 4` and the null map is: `1, 0, 0` For the `i-th` row in array column: range from `offsets[i - 1]` until `offsets[i]` represents the elements in this row, so we can't store empty array/map values in doris data column. As a comparison, spark does not require `repeated_parent_def_level`, because the spark column stores empty array/map values , and use anther length column to indicate empty values. Please reference: https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetColumnVector.java Furthermore, we can also avoid store null array/map values in doris data column. The same three rows as above, We can only store three elements in data column: `a, b, c` and the offsets column is: `0, 0, 3` and the null map is: `1, 0, 0`	2023-08-02 22:33:10 +08:00
Jerry Hu	4bc65aa921	[fix](load) PrefetchBufferedReader Crashing caused updating counter with an invalid runtime profile (#22464 )	2023-08-02 18:19:48 +08:00
HappenLee	3a11de889f	[Opt](exec) opt the performance of date parquet convert by date dict (#22384 ) before： mysql> select count(l_commitdate) from lineitem; +---------------------+ \| count(l_commitdate) \| +---------------------+ \| 600037902 \| +---------------------+ 1 row in set (0.86 sec) after: mysql> select count(l_commitdate) from lineitem; +---------------------+ \| count(l_commitdate) \| +---------------------+ \| 600037902 \| +---------------------+ 1 row in set (0.36 sec)	2023-08-01 12:24:00 +08:00
zclllyybb	f2919567df	[feature](datetime) Support timezone when insert datetime value (#21898 )	2023-07-31 13:08:28 +08:00
HappenLee	4077338284	[Opt](parquet) opt the performance of date convertion (#22360 ) before： ``` mysql> select count(l_commitdate) from lineitem; +---------------------+ \| count(l_commitdate) \| +---------------------+ \| 600037902 \| +---------------------+ 1 row in set (1.61 sec) ``` after: ``` mysql> select count(l_commitdate) from lineitem; +---------------------+ \| count(l_commitdate) \| +---------------------+ \| 600037902 \| +---------------------+ 1 row in set (0.86 sec) ```	2023-07-30 15:54:13 +08:00
daidai	ae8a26335c	[opt](hive)opt select count() stmt push down agg on parquet in hive . (#22115 ) Optimization "select count() from table" stmtement , push down "count" type to BE. support file type : parquet ，orc in hive . 1. 4kfiles , 60kwline num before: 1 min 37.70 sec after: 50.18 sec 2. 50files , 60kwline num before: 1.12 sec after: 0.82 sec	2023-07-29 00:31:01 +08:00
Pxl	f7e0479605	[Chore](refactor) remove some unused code (#22152 ) remove some unused code	2023-07-28 17:30:46 +08:00
Ashin Gau	0d7d9b92db	[fix](multi-catalog) complex types parsing failed, with unexpected nulls and rows (#22228 ) Fix tow bugs: 1. Unexpected null values in array column. If 65535 consecutive values are not null in nullable array column, this error will be triggered. The reason is that the array parser did not handle boundary conditions. 2. The number of rows of key filed, and that of value field in map column are not equal. Similarly, the number of rows among fields in struct column are not the same. This would be triggered when the number of rows are not equal among parquet pages of different columns in a row group.	2023-07-28 10:03:08 +08:00
Qi Chen	7b270d1ae9	[Fix](mutli-catalog) Fix orc reader crashed when hdfs reading error by catching exception. (#22193 ) orc reader crashed when hdfs reading error. 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t, void) at /home/zcp/repo_center/zcp_repo/be/src/common/signal_handler.h:413 1# 0x00007F6F8B3C00C0 in /lib/x86_64-linux-gnu/libc.so.6 2# raise in /lib/x86_64-linux-gnu/libc.so.6 3# abort in /lib/x86_64-linux-gnu/libc.so.6 4# _gnu_cxx::_verbose_terminate_handler() [clone .cold] at ../../../../libstdc+-v3/libsupc+/vterminate.cc:75 5# _cxxabiv1::_terminate(void ()) at ../../../../libstdc+-v3/libsupc+/eh_terminate.cc:48 6# 0x0000555CBC4718C1 in /mnt/hdd01/STRESS_ENV/be/lib/doris_be 7# 0x0000555CBC471A14 in /mnt/hdd01/STRESS_ENV/be/lib/doris_be 8# doris::vectorized::ORCFileInputStream::read(void, unsigned long, unsigned long) at /home/zcp/repo_center/zcp_repo/be/src/vec/exec/format/orc/vorc_reader.cpp:121 9# orc::SeekableFileInputStream::Next(void const, int) in /mnt/hdd01/STRESS_ENV/be/lib/doris_be 10# orc::DecompressionStream::readHeader() in /mnt/hdd01/STRESS_ENV/be/lib/doris_be 11# orc::DecompressionStream::Next(void const, int) in /mnt/hdd01/STRESS_ENV/be/lib/doris_be 12# void orc::RleDecoderV2::next<long>(long, unsigned long, char const) in /mnt/hdd01/STRESS_ENV/be/lib/doris_be 13# orc::StringDictionaryColumnReader::loadDictionary() in /mnt/hdd01/STRESS_ENV/be/lib/doris_be 14# orc::StructColumnReader::loadStringDicts(std::unordered_map<unsigned long, std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, std::unordered_map<std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, orc::StringDictionary, std::hash<std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, orc::StringDictionary> > >, orc::StringDictFilter const) in /mnt/hdd01/STRESS_ENV/be/lib/doris_be 15# orc::RowReaderImpl::startNextStripe(orc::ReadPhase const&) in /mnt/hdd01/STRESS_ENV/be/lib/doris_be 16# orc::RowReaderImpl::nextBatch(orc::ColumnVectorBatch&, void) in /mnt/hdd01/STRESS_ENV/be/lib/doris_be 17# doris::vectorized::OrcReader::get_next_block(doris::vectorized::Block, unsigned long, bool) at /home/zcp/repo_center/zcp_repo/be/src/vec/exec/format/orc/vorc_reader.cpp:1420 18# doris::vectorized::VFileScanner::_get_block_impl(doris::RuntimeState, doris::vectorized::Block, bool) at /home/zcp/repo_center/zcp_repo/be/src/vec/exec/scan/vfile_scanner.cpp:250 19# doris::vectorized::VScanner::get_block(doris::RuntimeState, doris::vectorized::Block, bool) in /mnt/hdd01/STRESS_ENV/be/lib/doris_be 20# doris::vectorized::ScannerScheduler::_scanner_scan(doris::vectorized::ScannerScheduler, doris::vectorized::ScannerContext, std::shared_ptr<doris::vectorized::VScanner>) at /home/zcp/repo_center/zcp_repo/be/src/vec/exec/scan/scanner_scheduler.cpp:335 21# std::_Function_handler<void (), doris::vectorized::ScannerScheduler::_schedule_scanners(doris::vectorized::ScannerContext)::$_1::operator()() const::	2023-07-26 08:57:31 +08:00
Qi Chen	752cec9e19	[Fix](multi-catalog) Fix not single slot filter conjuncts with dict filter issue. (#22052 ) ### Issue Dictionary filtering is a mechanism that directly reads the dictionary encoding of a single string column filter condition for filter comparison. But dictionary filtered single string columns may be included in other multi-column filter conditions. This can cause problems. For example: `select * from multi_catalog.lineitem_string_date_orc where l_commitdate < l_receiptdate and l_receiptdate = '1995-01-01' order by l_orderkey, l_partkey, l_suppkey, l_linenumber limit 10;` `l_receiptdate` is string filter column，it is included by multi-column filter condition `l_commitdate < l_receiptdate`. ### Solution Resolve it by separating the multi-column filter conditions and executing it after the dictionary filter column is converted to string.	2023-07-24 22:31:18 +08:00
lihangyu	20242d9a0e	[Improve](simdjson) put unescaped string value after parsed (#21866 ) In some cases, it is necessary to unescape the original value, such as when converting a string to JSONB. If not unescape, then later jsonb parse will be failed	2023-07-20 10:33:17 +08:00
Mingyu Chen	5fc0a84735	[improvement](catalog) reduce the size thrift params for external table query (#21771 ) ### 1 In previous implementation, for each FileSplit, there will be a `TFileScanRange`, and each `TFileScanRange` contains a list of `TFileRangeDesc` and a `TFileScanRangeParams`. So if there are thousands of FileSplit, there will be thousands of `TFileScanRange`, which cause the thrift data send to BE too large, resulting in: 1. the rpc of sending fragment may fail due to timeout 2. FE will OOM For a certain query request, the `TFileScanRangeParams` is the common part and is same of all `TFileScanRange`. So I move this to the `TExecPlanFragmentParams`. After that, for each FileSplit, there is only a list of `TFileRangeDesc`. In my test, to query a hive table with 100000 partitions, the size of thrift data reduced from 151MB to 15MB, and the above 2 issues are gone. ### 2 Support when setting `max_external_file_meta_cache_num` <=0, the file meta cache for parquet footer will not be used. Because I found that for some wide table, the footer is too large(1MB after compact, and much more after deserialized to thrift), it will consuming too much memory of BE when there are many files. This will be optimized later, here I just support to disable this cache.	2023-07-17 13:37:02 +08:00
HappenLee	a7eb186801	[Bug](CSVReader) fix null pointer coredump in CSVReader in p2 (#20811 )	2023-07-15 22:50:10 +08:00
Siyang Tang	b013f8006d	[enhancement](multi-table) enable mullti table routine load on pipeline engine (#21729 )	2023-07-14 12:16:32 +08:00
Qi Chen	6fd8f5cd2f	[Fix](parquet-reader) Fix parquet string column min max statistics issue which caused query result incorrectly. (#21675 ) In parquet, min and max statistics may not be able to handle UTF8 correctly. Current processing method is using min_value and max_value statistics introduced by PARQUET-1025 if they are used. If not, current processing method is temporarily ignored. A better way is try to read min and max statistics if it contains only ASCII characters. I will improve it in the future PR.	2023-07-14 00:09:41 +08:00
daidai	ff42cd9b49	[feature](hive)add read of the hive table textfile format array type (#21514 )	2023-07-11 22:37:48 +08:00
Mryange	8973610543	[feature](datetime) "timediff" supports calculating microseconds (#21371 )	2023-07-10 19:21:32 +08:00
GoGoWen	469c8b7ece	[Fix](JSON LOAD)fix json load issue when string conform with RFC 4627 #21390 should set: enable_simdjson_reader=false in master as master enable_simdjson_reader=true by default. Issue Number: close #21389 from rapidjson: Query String In addition to GetString(), the Value class also contains GetStringLength(). Here explains why: According to RFC 4627, JSON strings can contain Unicode character U+0000, which must be escaped as "\u0000". The problem is that, C/C++ often uses null-terminated string, which treats \0 as the terminator symbol. To conform with RFC 4627, RapidJSON supports string containing U+0000 character. If you need to handle this, you can use GetStringLength() to obtain the correct string length. For example, after parsing the following JSON to Document d: { "s" : "a\u0000b" } The correct length of the string "a\u0000b" is 3, as returned by GetStringLength(). But strlen() returns 1. GetStringLength() can also improve performance, as user may often need to call strlen() for allocating buffer. Besides, std::string also support a constructor: string(const char* s, size_t count); which accepts the length of string as parameter. This constructor supports storing null character within the string, and should also provide better performance.	2023-07-09 17:16:03 +08:00
Mingyu Chen	2678afd2db	[fix][improvement](fs) add HdfsIO profile and modification time (#21638 ) Refactor the interface of create_file_reader the file_size and mtime are merged into FileDescription, not in FileReaderOptions anymore. Now the file handle cache can get correct file's modification time from FileDescription. Add HdfsIO for hdfs file reader pick from [Enhancement](multi-catalog) Add hdfs read statistics profile. #21442	2023-07-08 14:49:44 +08:00
Mingyu Chen	242a35fa80	[fix](s3) fix s3 fs benchmark tool (#21401 ) 1. fix concurrency bug of s3 fs benchmark tool, to avoid crash on multi thread. 2. Add `prefetch_read` operation to test prefetch reader. 3. add `AWS_EC2_METADATA_DISABLED` env in `start_be.sh` to avoid call ec2 metadata when creating s3 client. 4. add `AWS_MAX_ATTEMPTS` env in `start_be.sh` to avoid warning log of s3 sdk.	2023-07-05 16:20:58 +08:00
caiconghui	db50face41	[fix](time_zone) be compatible with doris old version for CST time_zone when load orc file in broker load (#21263 ) Fix error for broker load with orc file when time_zone is CST of which message is "Failed to create orc row reader. reason = Can't open /usr/share/zoneinfo/CST" Co-authored-by: caiconghui1 <caiconghui1@jd.com>	2023-06-28 09:44:42 +08:00
lihangyu	50c1d55769	[Improve](dynamic schema) support filtering invalid data (#21160 ) * [Improve](dynamic schema) support filtering invalid data 1. Support dynamic schema to filter illegal data. 2. Expand the regular expression for ColumnName to support more column names. 3. Be compatible with PropertyAnalyzer and support legacy tables. 4. Default disable parse multi dimenssion array, since some bug unresolved	2023-06-26 19:32:43 +08:00

1 2 3 4 5

247 Commits