Commit Graph

88 Commits

Author SHA1 Message Date
e412dd12e8 [chore](build) Use include-what-you-use to optimize includes (PART II) (#18761)
Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.
2023-04-19 23:11:48 +08:00
1f631c388d [enhance](cooldown)accelerate cooldown task produce efficiency (#16089) 2023-02-10 16:58:27 +08:00
4b6a4b3cf7 [refactor](remove unused code) Remove unused mempool declare or function params (#16222)
* Remove unused mempool declare or function params

---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-30 13:03:18 +08:00
5eaa995704 [refactor](some mempool) not memset 0 in default value iterator (#16194)
---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-29 22:50:39 +08:00
0b5e71d3b4 [refactor](refactor field) remove unused method (#16068) 2023-01-19 10:16:09 +08:00
d257059e6b [refactor](remove hadoop dpp) remove hadoop dpp code since it is not used (#16009) 2023-01-18 15:01:04 +08:00
bbdf40b6bd [Enhencement](Push Handle) use VParquetScanner in PushHandle (#15980)
* use VParquetScanner in PushHadnle

* delete ParquetScanner
2023-01-17 16:21:04 +08:00
97fcad76f8 [enhancement](memtracker) Improve readability (#15716) 2023-01-16 16:30:35 +08:00
58c520dbfd [Feature](remote) Cooldown cold data to object storage only one replica (#15832) 2023-01-14 23:58:00 +08:00
9468711f9f [Bug](join) fix bug null aware left anti join not correct result (#15841) 2023-01-13 10:18:05 +08:00
fe5e5d2bf4 [refactor] separate agg and flush in memtable (#15713) 2023-01-11 10:07:34 +08:00
1018657d9d [Enhancement](SparkLoad): avoid BE OOM in push task, fix #15572 (#15620)
Release memory pool held by the parquet reader when the data has been flushed by rowset writter.
Co-authored-by: spaces-x <weixiang06@meituan.com>
2023-01-05 10:20:32 +08:00
ad68764977 [enhancement](tablet) Unify redundant create_rowset_writer methods (#15519)
* Remove redundant create_rowset_writer methods

* Set resource id when setting FS in rowset meta

* fix

* fix ut
2022-12-30 22:57:12 +08:00
b085ff49f0 [refactor](non-vec) delete non-vec data sink (#15283)
* [refactor](non-vec) delete non-vec data sink

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-12-23 14:10:47 +08:00
e9a201e0ec [refactor](non-vec) delete some non-vec exec node (#15239)
* [refactor](non-vec) delete some non-vec exec node
2022-12-22 14:05:51 +08:00
821c12a456 [chore](BE) remove all useless segment group related code #15193
The segment group is useless in current codebase, remove all the related code inside Doris. As for the related protobuf code, use reserved flag to prevent any future user from using that field.
2022-12-20 17:11:47 +08:00
f3aea7f0f0 [Enhancement](status) Unify error code and enable customed err msg for BE internal errors (#14744) 2022-12-11 23:33:18 +08:00
58bc254529 [enhancement](BE)add metric for too many version (#14735)
* add one funciton to get if exceeds version limit

add bvar to indicate version exceed

* resolve

* remove unnecessary header file
2022-12-05 11:37:14 +08:00
915d8989c5 [feature](spark-load)Spark load supports string type data import (#11927) 2022-08-22 08:56:59 +08:00
321107cb40 [refactor](schema change) Using tablet schema shared ptr instead of raw ptr (#11475)
* Using tabletschema shared ptr instead of raw ptrs


Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-08-05 11:04:38 +08:00
003335c1c5 [refactor](schema change) spark dpp need not call convert rowset during load process (#11397)
* remove unused schema change logic in push handler

Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-08-02 10:18:00 +08:00
b35daf0a04 [improvement](light-schema-change) Support tablet schema cache (#11131) 2022-08-01 12:18:00 +08:00
4960043f5e [enhancement] Refactor to improve the usability of MemTracker (step2) (#10823) 2022-07-21 17:11:28 +08:00
82251a6bab [refactor] some refactor of delete predicates (#10816) 2022-07-15 14:13:34 +08:00
486cf0ebd4 [Feature] Lightweight schema change of add/drop column (#10136)
* [Schema Change] support fast add/drop column  (#49)

* [feature](schema-change) support fast schema change. coauthor: yixiutt

* [schema change] Using columns desc from fe to read data. coauthor: Lchangliang

* [feature](schema change) schema change optimize for add/drop columns.

1.add uniqueId field for class column.
2.schema change for add/drop columns directly update schema meta

Co-authored-by: yixiutt <yixiu@selectdb.com>
Co-authored-by: SWJTU-ZhangLei <1091517373@qq.com>

[Feature](schema change) fix write and add regression test (#69)

Co-authored-by: yixiutt <yixiu@selectdb.com>

[schema change] be ssupport that delete use newest schema

add delete regression test

fix regression case (#107)

tmp

[feature](schema change) light schema change exclude rollup and agg/uniq/dup key type.

[feature](schema change) fe olapTable maxUniqueId write in disk.

[feature](schema change) add rpc iface for sc add column.

[feature](schema change) add columnsDesc to TPushReq for ligtht sc.

resolve the deadlock when schema change (#124)

fix columns from fe don't has bitmap_index flag (#134)

add update/delete case

construct MATERIALIZED schema from origin schema when insert

fix not vectorized compaction coredump

use segment cache

choose newest schema by schema version when compaction (#182)

[bugfix](schema change) fix ligth schema change problem.

[feature](schema change) light schema change add alter job. (#1)

fix be ut

[bug] (schema change) unique drop key column should not light schema
change

[feature](schema change) add schema change regression-test.

fix regression test

[bugfix](schema change) fix multi alter clauses for light schema change. (#2)

[bugfix](schema change) fix multi clauses calculate column unique id (#3)

modify PushTask process (#217)

[Bugfix](schema change) fix jobId replay cause bdbje exception.

[bug](schema change) fix max col unique id repeatitive. (#232)

[optimize](schema change) modify pendingMaxColUniqueId generate rule.

fix compaction error
* fix be ut

* fix snapshot load core

fix unique_id error (#278)

[refact](fe) remove redundant code for light schema change. (#4)

[refact](fe) remove redundant code for light schema change. (#4)

format fe core

format be core

fix be ut

modify fe meta version

fix rebase error

flush schema into rowset_meta in old table

[refactor](schema change) refact fe light schema change. (#5)

delete the change of schemahash and support get max version schema

* modify for review

* fix be ut

* fix schema change test
2022-07-12 19:41:06 +08:00
89e56ea67f [refactor] remove alpha rowset related code and vectorized row batch related code (#10584) 2022-07-05 20:33:34 +08:00
Pxl
5805f8077f [Feature] [Vectorized] Some pre-refactorings or interface additions for schema change part2 (#10003) 2022-06-16 10:50:08 +08:00
2c79d223e4 [refactor][rowset]move rowset writer to a single place (#9368) 2022-05-19 23:57:02 +08:00
4cd579b155 [refactor] Check status precise_code instead of construct OLAPInternalError (#9514)
* check status precise_code instead of construct OLAPInternalError
* move is_io_error to Status
2022-05-12 15:39:29 +08:00
bd126f0679 [improvement] Refactor type info for further optimizations. (#8786)
## Design:

For now, there are two categories of types in Doris, one is for scalar types (such as int, char and etc.) and the other is for composite types (array and etc.). For the sake of performance, we can cache type info of scalar types globally (unique objects) due to the limited number of scalar types. When we consider the composite types, normally, the type info is generated in runtime (we can also use some cache strategy to speed up). The memory thereby should be reclaimed when we create type info for composite types.

There are a lots of interfaces to get the type info of a specific type. I reorganized those as the following describes.
1. `const TypeInfo* get_scalar_type_info(FieldType field_type)`
    The function is used to get the type info of scalar types. Due to the cache, the caller uses the result **WITHOUT** considering the problems about memory reclaim.
2. `const TypeInfo* get_collection_type_info(FieldType sub_type)`
    The function is used to get the type info of array types with just **ONE** depth. Due to the cache, the caller uses the result **WITHOUT** considering the problems about memory reclaim.
3. `TypeInfoPtr get_type_info(segment_v2::ColumnMetaPB* column_meta_pb)`
4. `TypeInfoPtr get_type_info(const TabletColumn* col)`
    These functions are used to get the type info of **BOTH** scalar types and composite types. The caller should be responsible to manage the resources returned.

#### About the new type `TypeInfoPtr`
`TypeInfoPtr` is an alias type to `unique_ptr` with a custom deleter.
1. For scalar types, the deleter does nothing.
2. For composite types, the deleter reclaim the memory.

By analyzing the callers of `get_type_info`, these classes should hold TypeInfoPtr:
1. `Field`
2. `ColumnReader`
3. `DefaultValueColumnIterator`

Other classes are either constructed by the foregoing classes or hold those, so they can just use the raw pointer of `TypeInfo` directly for the sake of performance.
1. `ScalarColumnWriter` - holds `Field`
    1. `ZoneMapIndexWriter` - created by `ScalarColumnWriter`, use `type_info` from the field in `ScalarColumnWriter`
        1. `IndexedColumnWriter` - created by `ZoneMapIndexWriter`, only uses scalar types.
    2. `BitmapIndexWriter` - created by `ScalarColumnWriter`, uses `type_info` from the field in `ScalarColumnWriter`
        1. `IndexedColumnWriter` - created by `BitmapIndexWriter`, uses `type_info` in `BitmapIndexWriter` and  `BitmapIndexWriter` doesn't support `ArrayType`.
    3. `BloomFilterIndexWriter` - created by `ScalarColumnWriter`, uses `type_info` from the field in `ScalarColumnWriter`
        1.  `IndexedColumnWriter` - created by `BloomFilterIndexWriter`, only uses scalar types.
2. `IndexedColumnReader` initializes `type_info` by the field type in meta (only scalar types).
3. `ColumnVectorBatch`
    1. `ZoneMapIndexReader` creates `ColumnVectorBatch`, `ColumnVectorBatch` uses `type_info` in  `IndexedColumnReader`
    2. `BitmapIndexReader` supports scalar types only and it creates `ColumnVectorBatch`, `ColumnVectorBatch` uses `type_info` in `BitmapIndexReader`
    3. `BloomFilterIndexWriter` supports scalar types only and it creates `ColumnVectorBatch`, `ColumnVectorBatch` uses `type_info` in `BloomFilterIndexWriter`
2022-04-20 14:47:29 +08:00
e5e0dc421d [refactor] Change ALL OLAPStatus to Status (#8855)
Currently, there are 2 status code in BE, one is common/Status.h,
and the other is olap/olap_define.h called OLAPStatus.
OLAPStatus is just an enum type, it is very simple and could not save many informations,
I will unify these code to common/Status.
2022-04-14 11:43:49 +08:00
290366787c [refactor] refactor code, replace some file with stl libs (#8759)
1. replace ConditionVariables with std::condition_variable
2. repalace Mutex with std::mutex
3. repalce MonoTime with std::chrono
2022-04-13 09:55:29 +08:00
bd0a3369b7 [fix] check disk capacity before writing data (#8887)
1. We forgot to check disk capacity when writing data.
2. TODO: the user specified disk capacity is not used now. We need to find a way to use it.
3. Avoid print too much compaction log when there is not suitable version for compaction.
2022-04-08 11:29:49 +08:00
4076c5466b [refactor][improvement](type_info) use template and single instance to refactor get type info logic (#8680)
1. use const pointer instead of shared_ptr
2. Restrict array types to support only primitive types and nest up to 9 levels.
2022-04-03 10:10:36 +08:00
c69dd54116 [refactor](mutex) Use std::mutex to replace Mutex and refactor some lock logic (#8452) 2022-03-24 14:50:02 +08:00
eeae516e37 [Feature](Memory) Hook TCMalloc new/delete automatically counts to MemTracker (#8476)
Early Design Documentation: https://shimo.im/docs/DT6JXDRkdTvdyV3G

Implement a new way of memory statistics based on TCMalloc New/Delete Hook,
MemTracker and TLS, and it is expected that all memory new/delete/malloc/free
of the BE process can be counted.
2022-03-20 23:06:54 +08:00
e17aef9467 [refactor] refactor the implement of MemTracker, and related usage (#8322)
Modify the implementation of MemTracker:
1. Simplify a lot of useless logic;
2. Added MemTrackerTaskPool, as the ancestor of all query and import trackers, This is used to track the local memory usage of all tasks executing;
3. Add cosume/release cache, trigger a cosume/release when the memory accumulation exceeds the parameter mem_tracker_consume_min_size_bytes;
4. Add a new memory leak detection mode (Experimental feature), throw an exception when the remaining statistical value is greater than the specified range when the MemTracker is destructed, and print the accurate statistical value in HTTP, the parameter memory_leak_detection
5. Added Virtual MemTracker, cosume/release will not sync to parent. It will be used when introducing TCMalloc Hook to record memory later, to record the specified memory independently;
6. Modify the GC logic, register the buffer cached in DiskIoMgr as a GC function, and add other GC functions later;
7. Change the global root node from Root MemTracker to Process MemTracker, and remove Process MemTracker in exec_env;
8. Modify the macro that detects whether the memory has reached the upper limit, modify the parameters and default behavior of creating MemTracker, modify the error message format in mem_limit_exceeded, extend and apply transfer_to, remove Metric in MemTracker, etc.;

Modify where MemTracker is used:
1. MemPool adds a constructor to create a temporary tracker to avoid a lot of redundant code;
2. Added trackers for global objects such as ChunkAllocator and StorageEngine;
3. Added more fine-grained trackers such as ExprContext;
4. RuntimeState removes FragmentMemTracker, that is, PlanFragmentExecutor mem_tracker, which was previously used for independent statistical scan process memory, and replaces it with _scanner_mem_tracker in OlapScanNode;
5. MemTracker is no longer recorded in ReservationTracker, and ReservationTracker will be removed later;
2022-03-11 22:04:23 +08:00
c86d469baf [Refactor](storage_engine) Use std::shared_mutex to replace RWMutex (#8387) 2022-03-11 18:14:24 +08:00
b40e9144cb [feature-wip][array-type] Refactor type info for nested array. (#8279) 2022-03-02 14:20:39 +08:00
a8a5c0a6a8 [improvement](load) memory usage optimization for load job (#7454)
Reduce memory usage when loading unqualified data
2021-12-24 21:30:28 +08:00
20ef8a6e21 [feature-wip](remote storage)(step1) use a struct instead of string for parameter path, add basic remote method (#7098)
For the first, we need to make a parameter to discribe the data is local or remote.
At then, we need to support some basic function to support the operation for remote storage.
2021-12-22 22:58:23 +08:00
6c4aeab06f [fix](broker-load) BE may crash when using preceding filter in broker or routine load (#7193)
The broker scan node has two tuple descriptors:
One is dest tuple and the other is src tuple.
The src tuple is used to read the lines of the original file,

and the dest tuple is used to save the converted lines.
The preceding filter is executed on the src tuple, so src tuple descriptor should be used
to initialize the filter expression
2021-11-30 22:04:05 +08:00
6c6380969b [refactor] replace boost smart ptr with stl (#6856)
1. replace all boost::shared_ptr to std::shared_ptr
2. replace all boost::scopted_ptr to std::unique_ptr
3. replace all boost::scoped_array to std::unique<T[]>
4. replace all boost:thread to std::thread
2021-11-17 10:18:35 +08:00
4f744333c2 fix some core in local test: (#6594)
1. insert very large string value may coredump
    2. some analitic functiuon and agg function result may be incorrect
    3. string compare may be coredump when string type is too large
    4. string type in delete condition can not process correctly
    5. add text/blob as alias of string to compitable with mysql
    6. fix string type min/max agg may  process incorrectly
2021-09-10 09:52:03 +08:00
c65ec3136b [Improvement] spark load without agg and de/serialization (#6270)
fix #6269 

The outline of our changes is to improve our memory in case of OOM in BE and to speed up the calculation.
1. We do not need to do Aggregation in load, which has already been done in the ETL spark job.
2. Based on 1, we do not need to serialize/deserialize bitmap/HLL objects.
2021-08-19 14:15:01 +08:00
8738ce380b Add long text type STRING, with a maximum length of 2GB. Usage is similar to varchar, and there is no guarantee for the performance of storing extremely long data (#6391) 2021-08-18 09:05:40 +08:00
c6aa37f5ef [Alter] Support doing compaction for tablets under alter operation (#6365)
The problem I want to solve is described in #6355.
This CL mainly changes:

1. Support compacting tablets under alter operations

   On BE side, the compaction logic will select tablets which state is "TABLET_NOTREADY" to do cumulative compaction.

2. Remove "alter_task" field in tablet's meta on BE side.

   "alter_task" field is never used long time ago

3. Support doing delete operation when table is doing alter operation.

   Previously, when a table is doing alter operation, execution of delete will return error: Table's state is not NORMAL.
   But now, delete can be executed successfully only if the condition column is not under schema change.
   And delete condition will be applied to all materialized indexes.
2021-08-07 21:32:26 +08:00
b423274f17 [Enhance] Make MemTracker more accurate (#5515) (#5516)
* [Enhance] Make MemTracker more accurate (#5515)
 This PR main about:
 1. Improve the readability of MemTrackers' name
 2. Add the MemTracker of:
    * Load
    * Compaction
    * SchemaChange
    * StoragePageCache
    * TabletManager
 3. Change SchemaChange to a Singleon

* revise some code for Code Review

* change the name of mem_tracker

* keep reader_context have the same lifetime of rowset_reader in schema change.

* change vlog notice to log(warning) in schema change
2021-04-08 09:14:55 +08:00
d641a26490 [Refactor] Remove boost filesystem (#5579)
* use std::filesystem instead of boost
Co-authored-by: Mingyu Chen <morningman.cmy@gmail.com>
2021-04-08 09:11:59 +08:00
780900ac9c [Feature] Support preceding filter original data when loading (#5338)
Support conditional filtering of original data in broker load and routine load
eg:

```
LOAD LABEL `label1`
(
DATA INFILE ('bos://cmy-repo/1.csv')
INTO TABLE tbl2
COLUMNS TERMINATED BY '\t'
(event_day, product_id, ocpc_stage, user_id)
SET (
	ocpc_stage = ocpc_stage + 100
)
PRECEDING FILTER user_id = 1381035
WHERE ocpc_stage > 30
)
...
```
2021-02-07 22:37:48 +08:00