Commit Graph

1422 Commits

Author SHA1 Message Date
8930df3b31 [Feature](iceberg-writer) Implements iceberg partition transform. (#37692)
## Proposed changes

Cherry-pick iceberg partition transform functionality. #36289 #36889

---------

Co-authored-by: kang <35803862+ghkang98@users.noreply.github.com>
Co-authored-by: lik40 <lik40@chinatelecom.cn>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Mingyu Chen <morningman@163.com>
2024-07-13 16:07:50 +08:00
cf2fb6945a [branch-2.1](memory) Refactor LRU cache policy memory tracking (#37658)
pick 
#36235
#35965
2024-07-11 21:04:01 +08:00
62e0230523 [branch-2.1](memory) Add ThreadMemTrackerMgr BE UT (#37654)
## Proposed changes

pick #35518
2024-07-11 21:03:49 +08:00
fed632bf4a [fix](move-memtable) check segment num when closing each tablet (#36753) (#37536)
cherry-pick #36753 and #37660
2024-07-11 20:33:44 +08:00
9f4e7346fb [fix](compaction) fixing the inaccurate statistics of concurrent compaction tasks (#37318) (#37496) 2024-07-10 22:23:25 +08:00
afcc6170f6 [fix](txn_manager) Add ingested rowsets to unused rowsets when removing txn (#37417)
Generally speaking, as long as a rowset has a version, it can be
considered not to be in a pending state. However, if the rowset was
created through ingesting binlogs, it will have a version but should
still be considered in a pending state because the ingesting txn has not
yet been committed.

This PR updates the condition for determining the pending state. If a
rowset is COMMITTED, the txn should be allowed to roll back even if a
version exists.

Cherry-pick #36551
2024-07-10 14:25:44 +08:00
5280e277e7 [chore](be) Acquire and check MD5 digest of the file to download (#37418)
Cherry-pick #35807, #36621, #36726
2024-07-08 18:55:35 +08:00
ceef9ee123 [feature](serde) support presto compatible output format (#37039) (#37253)
bp #37039
2024-07-04 13:56:05 +08:00
07278e9dcb [improvement](segmentcache) limit segment cache by memory or segment … (#37035)
…num (#37026)

pick ##37026
2024-06-30 20:34:13 +08:00
f27ae8fa09 [fix](bitmap) incorrect type of BitmapValue with fastunion (#36834) (#36896) 2024-06-28 11:29:03 +08:00
0cff539810 [feature](function) support new function replace_empty (#36283) (#36656)
#36283
2024-06-21 16:46:22 +08:00
c8f2a3f952 [fix](eq_for_null) fix incorrect logic in function eq_for_null #36004 (#36124)
cherry pick from #36004
cherry pick from #36164
2024-06-21 14:31:21 +08:00
612f2ae961 [feature](api) add BE HTTP /api/load_streams (#36312) (#36338)
cherry-pick #36312
2024-06-16 22:09:04 +08:00
b75533e72b [branch-2.1](beut) fix BE UT (#36147)
only for branch-2.1
2024-06-12 08:21:38 +08:00
596a9a16d3 [chore](Compile) Fix segment cache ut's compile error due to miss cherry-pick (#36099) 2024-06-11 17:12:42 +08:00
a0f3c1cd1e [chore](Compile) Fix S3 file writer ut's compile error due to miss cherry-pick (#36037)
The S3 File Writer's ut can't pass ut compile, this pr tries to fix it.
2024-06-08 22:21:20 +08:00
af779f5cd8 Pick "[fix](gclog) Skip tablet dir without schema hash dir in path gc (#32793)" (#35978)
## Proposed changes
Pick "[fix](gclog) Skip tablet dir without schema hash dir in path gc
(#32793)"
2024-06-06 22:24:30 +08:00
f80b856405 [enhancement](oom) return error when bloom filter allocate memory failed (#35790)
## Proposed changes


1. return error when bloom filter allocate memory failed
2. return error when deserialize a block,  it may need a lot of memory.

---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2024-06-03 18:22:11 +08:00
9dd573888a [bugfix](stdcallonce) replace std callonce with a lock because it is not exception safe (#35126) 2024-06-01 08:00:42 +08:00
9c270e5cdf [fix](delete) Fix unrecognized column name delete handler (#32429) (#35742)
pick doris-master #32429
2024-05-31 20:41:22 +08:00
680be6d19f [fix](ub) fix uninitialized accesses in BE (#35370)
ubsan hints:
```c++
/root/doris/be/src/olap/hll.h:93:29: runtime error: load of value 3078029312, which is not a valid value for type 'HllDataType'
/root/doris/be/src/olap/hll.h:94:23: runtime error: load of value 3078029312, which is not a valid value for type 'HllDataType'
/root/doris/be/src/runtime/descriptors.h:439:38: runtime error: load of value 118, which is not a valid value for type 'bool'
/root/doris/be/src/vec/exec/vjdbc_connector.cpp:61:50: runtime error: load of value 35, which is not a valid value for type 'bool' 
```
2024-05-29 20:31:07 +08:00
b91d2caab8 [Feature](iceberg-writer) Implements iceberg sink basic functionality for inserting into table. (#35587)
backport #34929
2024-05-29 16:40:54 +08:00
8fb28244d6 [improvement](page builder) avoid allocating big memory in ctor (#35493)
## Proposed changes

Issue Number: close #xxx

<!--Describe your changes.-->

## Further comments

If this is a relatively large or complex change, kick off the discussion
at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why
you chose the solution you did and what alternatives you considered,
etc...
2024-05-29 15:03:54 +08:00
7058b31edd [fix](move-memtable) clear load streams before shutdown SegmentFileWriterThreadPool (#35217) 2024-05-28 13:12:03 +08:00
Pxl
b143f0dfe2 [Improvement](date) shortcut for str to date parse (#35288)
shortcut for str to date parse
2024-05-25 17:47:20 +08:00
639c7ee7fb [fix](decimalv2) fix scale of decimalv2 to string (#35222) (#35359)
* [fix](decimalv2) fix scale of decimalv2 to string
2024-05-24 17:20:43 +08:00
309503855e [Fix](bloom filter) Fix bloom filter memory leak (#34871)
* Issue: Doris occasionally encounters an issue where memory usage becomes exceptionally high and does not decrease. The leaked memory is occupied by Bloom filters stored in memory.

Reason: The segment cache stores segment objects read from files into memory. It functions as an LRU cache with an eviction strategy: when the number of segments exceeds the maximum number, or the total memory size of segment objects in the cache exceeds the maximum usage, it evicts the older segments. However, there is a piece of logic in the code that first reads the segment object into memory, assuming it occupies memory size A, then places the read segment object into the cache (at this point, the cache considers the segment object size to be A). It then reads the segment's Bloom filter from the file and assigns it to the segment's Bloom filter member variable, assuming the Bloom filter occupies memory size B. Thus, the total size of the segment object at this point is A+B. However, the cache does not update this size, leading to the actual size of the segment object stored in the cache (A+B) being larger than the size considered by the cache (A). When the number of segment objects in the cache increases to a certain extent, the used memory will surge dramatically. However, the cache does not perceive the size as reaching the eviction limit, so it does not evict the segment objects. In such cases, a memory leak issue arises.

Solution: Since each segment object only reads the Bloom filter once, the issue can be resolved by changing the logic from reading the segment, placing it into the cache, and then reading the Bloom filter to reading the segment, reading the Bloom filter, and then placing it into the cache.
2024-05-24 16:23:58 +08:00
a6f7747d29 [feature](datatype) add BE config to allow zero date (#34961)
Co-authored-by: Gabriel <gabrielleebuaa@gmail.com>
2024-05-23 19:12:39 +08:00
c23384ff07 [fix](decimal) Fix long string casting to decimalv2 (#35121) 2024-05-22 14:32:29 +08:00
98f8eb5c43 [opt](split) get file splits in batch mode (#34032) (#35107)
bp  #34032
2024-05-21 22:27:07 +08:00
b4a798240a [fix](inverted_index) donot use int32_t for index id to avoid overflow (#35062) 2024-05-21 12:58:38 +08:00
e3e5f18f26 [Fix](Json type) correct cast result for json type (#34764) 2024-05-18 18:40:17 +08:00
eb7eaee386 [fix](function) money format (#34680) 2024-05-18 18:35:29 +08:00
1a24895257 [opt](routine-load) optimize routine load task thread pool and related param(#32282) (#34896) 2024-05-15 12:42:02 +08:00
95b05928fd [fix](compaction) fix time series compaction merge empty rowsets priority #34562 (#34765) 2024-05-14 09:10:09 +08:00
0ae1b9c70a [chore](remove code) Remove dragonbox related (#34528)
* Revert "[refactor](mysql result format) use new serde framework to tuple convert (#25006)"

This reverts commit e5ef0aa6d439c3f9b1f1fe5bc89c9ea6a71d4019.

* run buildall

* MORE

* FIX
2024-05-13 22:16:57 +08:00
32cbd4a583 [chore](status) unify error code between thrift,pb, status.h (#34397)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2024-05-10 14:41:01 +08:00
9b712b03b4 [FIX]fix is_ip_address_in_range func with const param (#34266) 2024-05-10 14:37:20 +08:00
8fdfbcb3c4 Revert "[Opt](func) opt the percentile func performance (#34373) (#34416)"
This reverts commit 509ae425e416b4779ae94eab9c2b21f9850e03c3.
2024-05-07 07:23:48 +08:00
f7900b53ce [enhancement](function) floor/ceil/round/round_bankers can use column as scale argument (#34391) 2024-05-06 22:18:36 +08:00
509ae425e4 [Opt](func) opt the percentile func performance (#34373) (#34416) 2024-05-06 20:10:35 +08:00
0f0c0a266b [opt](parquet)Skip page with offset index (#33082)
Make skip_page() in ColumnChunkReader more efficient. No more reading page headers if there are pagelocations in chunk.
2024-04-26 15:06:16 +08:00
c631f4f8a8 [fix](schema change) resolve the use count check of source logical column (#33932)
Fix error like:
```
8# google::LogMessageFatal::~LogMessageFatal() in /mnt/hdd01/ci/master-deploy/be/lib/doris_be
 9# doris::vectorized::Block::clear_column_data(int) in /mnt/hdd01/ci/master-deploy/be/lib/doris_be
10# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:514
11# doris::vectorized::VFileScanner::_get_block_impl(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/vfile_scanner.cpp:333
12# doris::vectorized::VScanner::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/vscanner.cpp:132
13# doris::vectorized::VScanner::get_block_after_projects(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/vscanner.cpp:99
```

Because source logical column is the destination logical column if logical converter is consistent. Previously, the reference of column was reset after the conversion was completed, but if an EOF occurred, it was returned in advance, but EOF is not a true error.
```
if (_logical_converter->is_consistent()) {
            // If logical converter is consistent, _src_logical_column is the final destination column,
            // other components will check the use count
            _src_logical_column.reset();
}
```
2024-04-22 12:31:46 +08:00
7e91e69eb9 [fix](compaction) fix single compaction (#33907)
* [fix](compaction)Fix single compaction to get all local versions #33849

add test and comment

* remove single replica compaction prepare input rowsets

reviesd
2024-04-19 23:30:25 +08:00
ffd9da44a2 [fix](move-memtable) fix commit may fail due to duplicated reports (#32403) 2024-04-19 15:02:49 +08:00
9b7af4c0cf [feature](schema change) unified schema change for parquet and orc reader (#32873)
Following #25138, unified schema change interface for parquet and orc reader, and can be applied to other format readers as well.
Unified schema change interface for all format readers:
- First, read the data according to the column type of the file into source column;
- Second, convert source column to the destination column with type planned by FE.
2024-04-12 15:09:25 +08:00
a4924dabb7 [enhancement](exception) enble exception logic in pipeline execute thread (#33437)
* [enhancement](exception) enble exception logic in pipeline execute thread

* f

---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2024-04-12 15:09:25 +08:00
Pxl
5f30463bb3 [Chore](descriptors) remove unused codes for descriptors (#33408)
remove unused codes for descriptors
2024-04-12 15:09:25 +08:00
26d9082b9a [Feature](function) Add function strcmp (#33272) 2024-04-12 15:09:25 +08:00
31984bb4f0 [feature](function) support quote string function #33055 2024-04-12 15:09:25 +08:00