doris

Author	SHA1	Message	Date
yiguolei	42b5d17fa1	[refactor](remove non vec) remove column block and column view (#16022 ) * [refactor](remove non vec) remove column block and column view and column vectorized batch Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-01-18 12:40:53 +08:00
YueW	e579530c99	[Feature-WIP](inverted index) support use inverted index searcher cache (#16003 ) use inverted index searcher cache to improve query performance dependency pr: #14211 #15807 #15823	2023-01-18 09:30:55 +08:00
Tiewei Fang	1489e3cfbf	[Fix](file system) Make the constructor of `XxxFileSystem` a private method (#15889 ) Since Filesystem inherited std::enable_shared_from_this , it is dangerous to create native point of FileSystem. To avoid this behavior, making the constructor of XxxFileSystem a private method and using the static method create(...) to get a new FileSystem object.	2023-01-13 15:32:16 +08:00
Tiewei Fang	f17d69e450	[feature](file cache)Import `file cache` for remote file reader (#15622 ) The main purpose of this pr is to import `fileCache` for lakehouse reading remote files. Use the local disk as the cache for reading remote file, so the next time this file is read, the data can be obtained directly from the local disk. In addition, this pr includes a few other minor changes Import File Cache: 1. The imported `fileCache` is called `block_file_cache`, which uses lru replacement policy. 2. Implement a new FileRereader `CachedRemoteFilereader`, so that the logic of `file cache` is hidden under `CachedRemoteFilereader`. Other changes: 1. Add a new interface `fs()` for `FileReader`. 2. `IOContext` adds some statistical information to count the situation of `FileCache` Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com>	2023-01-10 12:23:56 +08:00
Xin Liao	cc7a9d92ad	[refactor](non-vec) remove non vec code for indexed column reader (#15409 )	2022-12-30 23:01:54 +08:00
Xin Liao	2f572ccc43	[fix](index) fix that the last element of each batch will be read repeatedly for binary prefix page (#15481 )	2022-12-30 15:36:55 +08:00
Jet He	75aa00d3d0	[Feature](NGram BloomFilter Index) add new ngram bloom filter index to speed up like query (#11579 ) This PR implement the new bloom filter index: NGram bloom filter index, which was proposed in #10733. The new index can improve the like query performance greatly, from our some test case , can get order of magnitude improve. For how to use it you can check the docs in this PR, and the index based on the ```enable_function_pushdown```, you need set it to ```true```, to make the index work for like query.	2022-12-28 18:01:50 +08:00
yiguolei	e640f49b6d	[refactor](non-vec) remove non vectorized predicate and row_block (#15348 ) remove non vectorized predicate and row_block	2022-12-25 21:45:00 +08:00
yiguolei	83a99a0f8b	[refactor](non-vec) Remove non vec code from be (#15278 ) * [refactor](removecode) remove some non-vectorization Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-12-22 23:28:30 +08:00
plat1ko	f3aea7f0f0	[Enhancement](status) Unify error code and enable customed err msg for BE internal errors (#14744 )	2022-12-11 23:33:18 +08:00
yixiutt	94a6ffb906	[feature](compaction) support vertical_compaction & ordered_data_compaction (#14524 )	2022-12-01 22:15:41 +08:00
Pxl	0e26f28bf2	[Enhancement](runtime-filter) enlarge runtime filter in predicate threshold (#13581 ) enlarge runtime filter in predicate threshold	2022-11-10 15:48:46 +08:00
Pxl	9d8b4bc176	[Enhancement](Dictionary-codec) update dict once on same segment (#13936 ) update dict once on same segment	2022-11-08 10:59:35 +08:00
pengxiangyu	eab8876abc	[Feature](remote) Using heavy schema change if the table is not enable light weight schema change (#13487 )	2022-10-28 15:48:22 +08:00
zhannngchen	4a5095f00d	[cleanup](config) remove unused config push_write_mbytes_per_sec (#13290 )	2022-10-12 15:58:04 +08:00
Adonis Ling	e7f18e998a	[chore](be-ut) Remove useless lines which cause compilation errors (#13053 )	2022-09-30 11:26:25 +08:00
Gabriel	5f7d6e8f2b	[Refactor](predicate) Unify Conditions and ColumnPredicate (#11985 )	2022-08-29 12:11:22 +08:00
plat1ko	db07e51cd3	[refactor](status) Refactor status handling in agent task (#11940 ) Refactor TaggableLogger Refactor status handling in agent task: Unify log format in TaskWorkerPool Pass Status to the top caller, and replace some OLAPInternalError with more detailed error message Status Premature return with the opposite condition to reduce indention	2022-08-29 12:06:01 +08:00
Gabriel	6e6269c682	[Improvement](load) accelerate streamload and compaction (#12119 ) * [Improvement](load) accelerate streamload and compaction	2022-08-28 23:10:47 +08:00
yiguolei	ccff3f5711	[bugfix](light weight schema change) support delete condition in schema change (#11869 ) * [bugfix](light weight schema change) support delete condition in schema change Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-08-26 11:45:55 +08:00
Xin Liao	12c4d1f4dd	[feature-wip](unique-key-merge-on-write) unique key table with MOW supports sequence column (#11808 )	2022-08-17 10:56:14 +08:00
pengxiangyu	e5c2bb9699	[fix](remote)Fix bug for Cache Reader (#11629 )	2022-08-12 13:40:32 +08:00
yiguolei	ea57bf6370	[refactor](delete predicate) Unify delete to segmentiterator (#11650 ) * remove seek columns and unify delete columns in rowset reader Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-08-11 15:12:43 +08:00
Gabriel	2068bf2dea	[Refactor](predicate) Use primitive type as template argument for predicate (#11647 )	2022-08-11 12:06:44 +08:00
yixiutt	0a5fd99d02	[feature-wip](unique-key-merge-on-write) speed up publish_txn (#11557 ) In our origin design, we calc delete bitmap in publish txn, and this operation will cost too much time as it will load segment data and lookup row key in pre rowset and segments.And publish version task should run in order, so it'll lead to timeout in publish_txn. In this pr, we seperate delete_bitmap calculation to tow part, one of it will be done in flush mem table, so this work can run parallel. And we calc final delete_bitmap in publish_txn, get a rowset_id set that should be included and remove rowsets that has been compacted, the rowset difference between memtable_flush and publish_txn is really small so publish_txn become very fast.In our test, publish_txn cost about 10ms. Co-authored-by: yixiutt <yixiu@selectdb.com>	2022-08-08 18:57:55 +08:00
yiguolei	321107cb40	[refactor](schema change) Using tablet schema shared ptr instead of raw ptr (#11475 ) * Using tabletschema shared ptr instead of raw ptrs Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-08-05 11:04:38 +08:00
Lightman	b35daf0a04	[improvement](light-schema-change) Support tablet schema cache (#11131 )	2022-08-01 12:18:00 +08:00
Xin Liao	d4fb27125a	[feature-wip](unique-key-merge-on-write) row id conversion for compaction (#11149 )	2022-07-27 16:32:13 +08:00
Xin Liao	eab8382b4a	[feature-wip](unique-key-merge-on-write) add the implementation of primary key index update, DSIP-018 (#11057 )	2022-07-27 14:17:56 +08:00
plat1ko	0b6d2ae290	[fix] Move s3 fs connect outside the lock critical area (#11026 ) * fix potential bug of S3FileSystem * move s3 fs connect outside the lock critical area	2022-07-23 16:06:29 +08:00
Xinyi Zou	4960043f5e	[enhancement] Refactor to improve the usability of MemTracker (step2) (#10823 )	2022-07-21 17:11:28 +08:00
yiguolei	a2ed4b5c78	[improvement] improvement for light weight schema change (#10860 ) * improvement for dynamic schema not use schema as lru cache key any more. load segment just use the rowset's original schema not the current read schema. generate column reader and column iterator using the original schema, using the read schema if it is a new column. using column unique id as key instead of column ordinals. Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-07-18 17:53:31 +08:00
plat1ko	523d395527	[refactor] Remove alpha rowset meta (#10933 ) * remove alpha_rowset_meta * remove alpha rowset related codes in compaction * remove alpha rowset related codes in RowsetMeta * fix be ut because some ut use alpha rowsetmeta	2022-07-18 08:45:46 +08:00
plat1ko	3bc6655069	[refactor] remove BlockManager (#10913 ) * remove BlockManager * remove deprecated field in tablet meta	2022-07-17 14:10:06 +08:00
plat1ko	eec142ae90	[Enhancement] Use shared file reader when read a segment (#10896 ) * readers under a segment use a shared FileReader * no need to cache fd in LocalFileReader	2022-07-17 07:54:58 +08:00
Gabriel	3b46242483	[feature-wip] Optimize Decimal type (#10794 ) * [feature-wip](decimalv3) support decimalv3 * [feature-wip] Optimize Decimal type Co-authored-by: liaoxin <liaoxinbit@126.com>	2022-07-14 10:50:50 +08:00
Lightman	486cf0ebd4	[Feature] Lightweight schema change of add/drop column (#10136 ) * [Schema Change] support fast add/drop column (#49) * [feature](schema-change) support fast schema change. coauthor: yixiutt * [schema change] Using columns desc from fe to read data. coauthor: Lchangliang * [feature](schema change) schema change optimize for add/drop columns. 1.add uniqueId field for class column. 2.schema change for add/drop columns directly update schema meta Co-authored-by: yixiutt <yixiu@selectdb.com> Co-authored-by: SWJTU-ZhangLei <1091517373@qq.com> [Feature](schema change) fix write and add regression test (#69) Co-authored-by: yixiutt <yixiu@selectdb.com> [schema change] be ssupport that delete use newest schema add delete regression test fix regression case (#107) tmp [feature](schema change) light schema change exclude rollup and agg/uniq/dup key type. [feature](schema change) fe olapTable maxUniqueId write in disk. [feature](schema change) add rpc iface for sc add column. [feature](schema change) add columnsDesc to TPushReq for ligtht sc. resolve the deadlock when schema change (#124) fix columns from fe don't has bitmap_index flag (#134) add update/delete case construct MATERIALIZED schema from origin schema when insert fix not vectorized compaction coredump use segment cache choose newest schema by schema version when compaction (#182) [bugfix](schema change) fix ligth schema change problem. [feature](schema change) light schema change add alter job. (#1) fix be ut [bug] (schema change) unique drop key column should not light schema change [feature](schema change) add schema change regression-test. fix regression test [bugfix](schema change) fix multi alter clauses for light schema change. (#2) [bugfix](schema change) fix multi clauses calculate column unique id (#3) modify PushTask process (#217) [Bugfix](schema change) fix jobId replay cause bdbje exception. [bug](schema change) fix max col unique id repeatitive. (#232) [optimize](schema change) modify pendingMaxColUniqueId generate rule. fix compaction error * fix be ut * fix snapshot load core fix unique_id error (#278) [refact](fe) remove redundant code for light schema change. (#4) [refact](fe) remove redundant code for light schema change. (#4) format fe core format be core fix be ut modify fe meta version fix rebase error flush schema into rowset_meta in old table [refactor](schema change) refact fe light schema change. (#5) delete the change of schemahash and support get max version schema * modify for review * fix be ut * fix schema change test	2022-07-12 19:41:06 +08:00
zhannngchen	3b9cb524bc	[feature-wip](unique-key-merge-on-write) Add rowset tree, based on interval-tree, DSIP-018[3/3] (#10714 ) * port from rowset-tree from kudu * use shared_ptr * some update * add mock rowset * some compatibility update * fix ut fail * reformat code	2022-07-11 23:10:38 +08:00
plat1ko	331fa50501	[feature](cold-data) move cold data to object storage without losing any feature(BE) (#10280 ) This PR supports rowset level data upload on the BE side, so that there can be both cold data and hot data in a tablet, and there is no necessary to prohibit loading new data to cooled tablets. Each rowset is bound to a `FileSystem`, so that the storage layer can read and write rowsets without perceiving the underlying filesystem. The abstracted `RemoteFileSystem` can try local caching strategies with different granularity, instead of caching segment files as before. To avoid conflicts with the code in be/src/io, we temporarily put the file system related code in the be/src/io/fs directory. In the future, `FileReader`s and `FileWriter`s should be unified.	2022-07-08 12:18:39 +08:00
yiguolei	89e56ea67f	[refactor] remove alpha rowset related code and vectorized row batch related code (#10584 )	2022-07-05 20:33:34 +08:00
Zhengguo Yang	d259770b86	[Fix] avoid core dump cause by malformed bitmap type data (#10458 )	2022-06-30 11:27:22 +08:00
Jerry Hu	797f6e1472	[Enhancement]Decode bitshuffle data before adding it into PageCache (#10036 ) * [Enhancement]Decode bitshuffle data before add into PageCache * Fix be ut failed	2022-06-13 09:04:23 +08:00
Adonis Ling	f377c26bf7	[refactor][be] Optimize headers (#9708 )	2022-05-30 16:12:10 +08:00
HappenLee	d330bc3806	[Vectorized](stream-load-vec) Support stream load in vectorized engine (#8709 ) (#9280 ) Implement vectorized stream load. Added fe configuration option `enable_vectorized_load` to enable vectorized stream load. Co-authored-by: tengjp@outlook.com Co-authored-by: mrhhsg@gmail.com Co-authored-by: minghong.zhou@163.com Co-authored-by: HappenLee <happenlee@hotmail.com> Co-authored-by: zhoubintao <35688959+zbtzbtzbt@users.noreply.github.com>	2022-04-29 09:50:51 +08:00
Adonis Ling	bd126f0679	[improvement] Refactor type info for further optimizations. (#8786 ) ## Design: For now, there are two categories of types in Doris, one is for scalar types (such as int, char and etc.) and the other is for composite types (array and etc.). For the sake of performance, we can cache type info of scalar types globally (unique objects) due to the limited number of scalar types. When we consider the composite types, normally, the type info is generated in runtime (we can also use some cache strategy to speed up). The memory thereby should be reclaimed when we create type info for composite types. There are a lots of interfaces to get the type info of a specific type. I reorganized those as the following describes. 1. `const TypeInfo* get_scalar_type_info(FieldType field_type)` The function is used to get the type info of scalar types. Due to the cache, the caller uses the result WITHOUT considering the problems about memory reclaim. 2. `const TypeInfo* get_collection_type_info(FieldType sub_type)` The function is used to get the type info of array types with just ONE depth. Due to the cache, the caller uses the result WITHOUT considering the problems about memory reclaim. 3. `TypeInfoPtr get_type_info(segment_v2::ColumnMetaPB* column_meta_pb)` 4. `TypeInfoPtr get_type_info(const TabletColumn* col)` These functions are used to get the type info of BOTH scalar types and composite types. The caller should be responsible to manage the resources returned. #### About the new type `TypeInfoPtr` `TypeInfoPtr` is an alias type to `unique_ptr` with a custom deleter. 1. For scalar types, the deleter does nothing. 2. For composite types, the deleter reclaim the memory. By analyzing the callers of `get_type_info`, these classes should hold TypeInfoPtr: 1. `Field` 2. `ColumnReader` 3. `DefaultValueColumnIterator` Other classes are either constructed by the foregoing classes or hold those, so they can just use the raw pointer of `TypeInfo` directly for the sake of performance. 1. `ScalarColumnWriter` - holds `Field` 1. `ZoneMapIndexWriter` - created by `ScalarColumnWriter`, use `type_info` from the field in `ScalarColumnWriter` 1. `IndexedColumnWriter` - created by `ZoneMapIndexWriter`, only uses scalar types. 2. `BitmapIndexWriter` - created by `ScalarColumnWriter`, uses `type_info` from the field in `ScalarColumnWriter` 1. `IndexedColumnWriter` - created by `BitmapIndexWriter`, uses `type_info` in `BitmapIndexWriter` and `BitmapIndexWriter` doesn't support `ArrayType`. 3. `BloomFilterIndexWriter` - created by `ScalarColumnWriter`, uses `type_info` from the field in `ScalarColumnWriter` 1. `IndexedColumnWriter` - created by `BloomFilterIndexWriter`, only uses scalar types. 2. `IndexedColumnReader` initializes `type_info` by the field type in meta (only scalar types). 3. `ColumnVectorBatch` 1. `ZoneMapIndexReader` creates `ColumnVectorBatch`, `ColumnVectorBatch` uses `type_info` in `IndexedColumnReader` 2. `BitmapIndexReader` supports scalar types only and it creates `ColumnVectorBatch`, `ColumnVectorBatch` uses `type_info` in `BitmapIndexReader` 3. `BloomFilterIndexWriter` supports scalar types only and it creates `ColumnVectorBatch`, `ColumnVectorBatch` uses `type_info` in `BloomFilterIndexWriter`	2022-04-20 14:47:29 +08:00
yiguolei	e5e0dc421d	[refactor] Change ALL OLAPStatus to Status (#8855 ) Currently, there are 2 status code in BE, one is common/Status.h, and the other is olap/olap_define.h called OLAPStatus. OLAPStatus is just an enum type, it is very simple and could not save many informations, I will unify these code to common/Status.	2022-04-14 11:43:49 +08:00
yiguolei	c872793a23	remove rowset converter since it is useless (#8974 ) Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-04-13 10:40:12 +08:00
Zhengguo Yang	5a44eeaf62	[refactor] Unify all unit tests into one binary file (#8958 ) 1. solved the previous delayed unit test file size is too large (1.7G+) and the unit test link time is too long problem problems 2. Unify all unit tests into one file to significantly reduce unit test execution time to less than 3 mins 3. temporarily disable stream_load_test.cpp, metrics_action_test.cpp, load_channel_mgr_test.cpp because it will re-implement part of the code and affect other tests	2022-04-12 15:30:40 +08:00
caiconghui	4076c5466b	[refactor][improvement](type_info) use template and single instance to refactor get type info logic (#8680 ) 1. use const pointer instead of shared_ptr 2. Restrict array types to support only primitive types and nest up to 9 levels.	2022-04-03 10:10:36 +08:00
pengxiangyu	e63afc1a3c	[feature-wip](remote storage)(step2) add storage_backend_mgr on BE side (#8663 ) 1. add storage backend mgr 2. remove env_remote	2022-03-31 11:13:14 +08:00

1 2 3 4

172 Commits