Commit Graph

304 Commits

Author SHA1 Message Date
73c3e3ab55 [Feature](x-load) support config min replica num for loading data (#21118) 2023-10-11 21:07:35 +08:00
5a55e47acd [Enhancement](Load) stream tvf support two phase commit (#23800) 2023-10-09 14:15:56 +08:00
6fe060b79e [fix](streamload) fix http_stream retry mechanism (#24978)
If a failure occurs, doris may retry. Due to ctx->is_read_schema is a global variable that has not been reset in a timely manner, which may cause exceptions.


---------

Co-authored-by: yiguolei <676222867@qq.com>
2023-10-08 11:16:21 +08:00
642e5cdb69 [Fix](Status) Make Status [[nodiscard]] and handle returned Status correctly (#23395) 2023-09-29 22:38:52 +08:00
452318a9fc [Enhancement](streamload) stream tvf support user specified label (#24219)
stream tvf support user specified label
example:

curl -v --location-trusted -u root: -H "sql: insert into test.t1 WITH LABEL label1 select c1,c2 from http_stream(\"format\" = \"CSV\", \"column_separator\" = \",\")" -T example.csv http://127.0.0.1:8030/api/_http_stream
return:

{
    "TxnId": 2064,
    "Label": "label1",
    "Comment": "",
    "TwoPhaseCommit": "false",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 2,
    "NumberLoadedRows": 2,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 27,
    "LoadTimeMs": 152,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 83,
    "ReadDataTimeMs": 92,
    "WriteDataTimeMs": 41,
    "CommitAndPublishTimeMs": 24
}
2023-09-27 12:09:35 +08:00
55d1090137 [feature](insert) Support group commit stream load (#24304) 2023-09-26 20:57:02 +08:00
082bcd820b [feature](insert) Support wal for group commit insert (#23053) 2023-09-26 14:46:24 +08:00
8d4fd76a16 [Feature](StreamLoad2PC) Support commit and abort streamload2PC by label (#24613) 2023-09-25 22:21:27 +08:00
8679095e5c [feature](debug) support debug point used in debug code (#24502) 2023-09-25 17:56:12 +08:00
58ab25ccaa Revert "[Feature](merge-on-write)Support ignore mode for merge-on-write unique table (#21773)" (#24731)
This reverts commit 3ee89aea35726197cb7e94bb4f2c36bc9d50da84.
2023-09-21 21:01:28 +08:00
9c681692bd Revert "[fix] fix http_stream retry mechanism (#23969)" (#24407)
This reverts commit 05e365ea137eb8c92b8e7eedc7d1435e83f065ae.
2023-09-15 10:07:53 +08:00
05e365ea13 [fix] fix http_stream retry mechanism (#23969)
Co-authored-by: yiguolei <676222867@qq.com>
2023-09-14 21:41:11 +08:00
927de33166 [config](log) disable StreamLoad log default and enable in regression pipeline (#24354)
disable StreamLoad log default and enable in regression pipeline
2023-09-14 20:47:26 +08:00
3ee89aea35 [Feature](merge-on-write)Support ignore mode for merge-on-write unique table (#21773) 2023-09-14 18:03:51 +08:00
86aa3802cf [log](config) set streamload record default to enable 2023-09-13 16:32:30 +08:00
c7ae2a7d22 [Refactor & Bugfix](static variables) move some static vairables to exec_env (#24029) 2023-09-13 09:27:03 +08:00
f9a75b5c4f [feature](csv_serde)1.append csv serde for serialize to csv and deserialize from csv. 2.let csvReader use csv serde not text_converter. (#23352)
1. append csv serde for serialize to csv and deserialize from csv.
2. let csvReader use csv serde not text_converter.
2023-09-10 00:16:21 +08:00
537369f4e2 [Fix](http) Fix curl return HTTP_ERROR && Add not_found HttpClientTest, fix (#23984)
Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>
2023-09-07 10:10:51 +08:00
a6dff2faf0 [Feature](config) allow update multiple be configs in one request (#23702) 2023-09-02 14:26:54 +08:00
25b6e4deb2 [fix](daemon) Fix incorrect initialization order of daemon services (#23578)
Current initialization dependency:

      Daemon ───┬──► StorageEngine ──► ExecEnv ──► Disk/Mem/CpuInfo
                │
                │
BackendService ─┘
However, original code incorrectly initialize Daemon before StorageEngine.
This PR also stop and join threads of daemon services in their dtor, to ensure Daemon services release resources in reverse order of initialization via RAII.
2023-08-31 19:46:38 +08:00
Pxl
f35ab37e1e [Bug](materialized-view) fix load db use analyzer to analyze diffrent metaindex (#23673)
fix load db use analyzer to analyze diffrent metaindex
2023-08-31 12:35:38 +08:00
05771e8a14 [Enhancement](Load) stream Load using SQL (#23362)
Using stream load in SQL mode

for example:
example.csv

10000,北京
10001,天津
curl -v --location-trusted -u root: -H "sql: insert into test.t1(c1, c2) select c1,c2 from stream(\"format\" = \"CSV\", \"column_separator\" = \",\")" -T example.csv http://127.0.0.1:8030/api/_stream_load_with_sql
curl -v --location-trusted -u root: -H "sql: insert into test.t2(c1, c2, c3) select c1,c2, 'aaa' from stream(\"format\" = \"CSV\", \"column_separator\" = \",\")" -T example.csv http://127.0.0.1:8030/api/_stream_load_with_sql
curl -v --location-trusted -u root: -H "sql: insert into test.t3(c1, c2) select c1, count(1) from stream(\"format\" = \"CSV\", \"column_separator\" = \",\") group by c1" -T example.csv http://127.0.0.1:8030/api/_stream_load_with_sql
2023-08-30 19:02:48 +08:00
da9eb79ac4 [Enhancement](Schema hash) Remove schema hash in tablet info (#23516) 2023-08-29 10:05:12 +08:00
ba351af452 [enhancement](thirdparty) upgrade thirdparty libs - again (#23414)
submit again #23290 (not upgrade brpc, because bthread local has error)

protobuf 3.15.0 -> 21.11
glog 0.4.0 -> 0.6.0
lz4 1.9.3 -> 1.9.4
curl 7.79.0 -> 8.2.1
zstd 1.5.2 -> 1.5.5
arrow 7.0.0 -> 13.0.0
abseil 20220623.1 -> 20230125.3
orc 1.7.2 -> 1.9.0
jemalloc for arrow 5.2.1 -> 5.3.0
xsimd 7.0.0 -> 13.0.0
opentelemetry-proto 0.19.0 -> 1.0.0
opentelemetry 1.8.3 -> 1.10.0

new:
c-ares -> 1.19.1
grpc -> 1.54.3
2023-08-26 22:59:10 +08:00
2b6d876280 [feature](move-memtable)[6/7] add options to enable memtable on sink node (#23470)
Co-authored-by: Siyang Tang <82279870+TangSiyang2001@users.noreply.github.com>
2023-08-25 22:32:22 +08:00
9cacf9535a [Opt](functions) Use preloaded cache to accelerate timezone parsing (#22694)
* opt

* bugfix

* fix ut

* fix stylecheck
2023-08-25 10:00:48 +08:00
d4694167a8 [Enhancement](chore) Some Status relevant enhancement (#23072) 2023-08-21 14:14:38 +08:00
81dd00f6e4 [Feature](Compaction) Support do full compaction by table id (#22010) 2023-08-21 11:54:51 +08:00
b49dc8042d [feature](load) refactor CSV reading process during scanning, and support enclose and escape for stream load (#22539)
## Proposed changes

Refactor thoughts: close #22383
Descriptions about `enclose` and `escape`: #22385

## Further comments

2023-08-09: 
It's a pity that experiment shows that the original way for parsing plain CSV is faster. Therefor, the refactor is only applied on enclose related code. The plain CSV parser use the original logic.

Fallback of performance is unavoidable anyway. From the `CSV reader`'s perspective, the real weak point may be the write column behavior, proved by the flame graph.
 
Trimming escape will be enable after fix: #22411 is merged

Cases should be discussed: 

1. When an incomplete enclose appears in the beginning of a large scale data, the line delimiter will be unreachable till the EOF, will the buffer become extremely large?
2. What if an infinite line occurs in the case? Essentially,  `1.` is equivalent to this.  

Only support stream load as trial in this PR, avoid too many unrelated changes. Docs will be added when `enclose` and `escape` is available for all kinds of load.
2023-08-15 09:23:53 +08:00
66784cef71 [Enhancement](Load) Stream Load using SQL (#22509)
This PR was originally #16940 , but it has not been updated for a long time due to the original author @Cai-Yao . At present, we will merge some of the code into the master first.

thanks @Cai-Yao @yiguolei
2023-08-08 13:49:04 +08:00
22cbf43b14 [Improvement](binlog) Add full/incr engine clone with binlog (#22678)
Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>
2023-08-08 10:03:11 +08:00
19d1f49fbe [improvement](compaction) compaction policy and options in the properties of a table (#22461) 2023-08-01 22:02:23 +08:00
f2919567df [feature](datetime) Support timezone when insert datetime value (#21898) 2023-07-31 13:08:28 +08:00
765f1b6efe [Refactor](load) Extract load public code (#22304) 2023-07-29 12:56:31 +08:00
0f5b973cb9 [Enhancement](http) Add HttpError to HttpClient::execute_with_retry (#21989) 2023-07-20 10:43:05 +08:00
c409fa0f58 [Feature](Compaction)Support full compaction (#21177) 2023-07-16 13:21:15 +08:00
Pxl
ca71048f7f [Chore](status) avoid empty error msg on status (#21454)
avoid empty error msg on status
2023-07-11 13:48:16 +08:00
c470bf56a5 [chore](build) Fix compilation errors reported by GCC-13 (#21215)
Add missing headers to fix the compilation errors reported by GCC-13.
2023-06-27 17:04:44 +08:00
aea719627d Revert "[enhencement](streamload) add on_close callback for httpserver (#20826)" (#20927)
This reverts commit 5b6761acb86852a93351b7b971eb2049fb567aaf.
2023-06-17 10:39:02 +08:00
2e295a1ee9 [Enhancement](http) unify http auth config (#20864) 2023-06-16 16:55:46 +08:00
5b6761acb8 [enhencement](streamload) add on_close callback for httpserver (#20826)
Sometimes connection cannot be released properly during on_free. We need
on_close callback as the last resort.

Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>
2023-06-15 13:44:02 +08:00
Pxl
a15a0b9193 [Chore](build) use file(GLOB_RECURSE xxx CONFIGURE_DEPENDS) to replace set cpp (#20461)
use file(GLOB_RECURSE xxx CONFIGURE_DEPENDS) to replace set cpp
2023-06-08 19:36:21 +08:00
Pxl
7dc7ed97eb [Chore](build) remove some unused code and remove some wno (#20326)
remove some unused code about spinlock
remove some wno and fix warning
remove varadic macro usage
2023-06-05 10:48:07 +08:00
42239d635a [fix](tablet_manager_lock) fix create tablet timeout #20067 (#20069) 2023-05-28 23:05:13 +08:00
93933308e6 [Feature-WIP](CCR): Add ccr doris interface (WIP) (#17881) 2023-05-26 23:40:49 +08:00
a05dbd3f81 [chore](compile) Improves PCH cache hit ratio (#19469)
Supplement the documentation of be-clion-dev, avoid the problem of undefined DORIS_JAVA_HOME and inability to find jni.h when using clion development without directly compiling through build.sh
Complete the classification of header files in pch.h and introduce some header files that are not frequently modified in doris.
Separate the declaration and definition in common/config.h. If you need to modify the default configuration now, please modify it in common/config.cpp.
gen_cpp/version.h is regenerated every time it is recompiled, which may cause PCH to fail, so now you need to get the version information indirectly rather than directly.
2023-05-10 12:49:01 +08:00
d24dd12b20 [enhancement](http) add fail reply for failed submitting tasks in single-replica-download (#19356) 2023-05-10 10:54:32 +08:00
e08de52ee7 [chore](compile) using PCH for compilation acceleration under clang (#19303) 2023-05-08 19:51:06 +08:00
5bf1396efe [enhancement](load) merge single-replica related services as non-standalone (#18421) 2023-05-06 22:54:56 +08:00
b6c7f3aeb8 [opt](FileCache) Add file cache metrics and management (#19177)
Add file cache metrics and management.
1. Get file cache metrics
> If the performance of file cache is not efficient, there are currently no metrics to investigate the cause. In practice, hit ratio, disk usage, and segments removed status are very important information. 

API: `http://be_host:be_webserver_port/metrics`
File cache metrics for each base path start with `doris_be_file_cache_` prefix. `hits_ratio` is the hit ratio of the cache since BE startup; `removed_elements` is the num of removed segment files since BE startup; Every cache path has three queues: index, normal and disposable. The capacity ratio of the three queues is 1:17:2.
```
doris_be_file_cache_hits_ratio{path="/mnt/datadisk1/gaoxin/file_cache"} 0.500000
doris_be_file_cache_hits_ratio{path="/mnt/datadisk1/gaoxin/small_file_cache"} 0.500000
doris_be_file_cache_removed_elements{path="/mnt/datadisk1/gaoxin/file_cache"} 0
doris_be_file_cache_removed_elements{path="/mnt/datadisk1/gaoxin/small_file_cache"} 0

doris_be_file_cache_normal_queue_max_size{path="/mnt/datadisk1/gaoxin/file_cache"} 912680550400
doris_be_file_cache_normal_queue_max_size{path="/mnt/datadisk1/gaoxin/small_file_cache"} 8500000000
doris_be_file_cache_normal_queue_max_elements{path="/mnt/datadisk1/gaoxin/file_cache"} 217600
doris_be_file_cache_normal_queue_max_elements{path="/mnt/datadisk1/gaoxin/small_file_cache"} 102400

doris_be_file_cache_normal_queue_curr_size{path="/mnt/datadisk1/gaoxin/file_cache"} 14129846
doris_be_file_cache_normal_queue_curr_size{path="/mnt/datadisk1/gaoxin/small_file_cache"} 14874904
doris_be_file_cache_normal_queue_curr_elements{path="/mnt/datadisk1/gaoxin/file_cache"} 18
doris_be_file_cache_normal_queue_curr_elements{path="/mnt/datadisk1/gaoxin/small_file_cache"} 22

...
```
2. Release file cache
> Frequent segment files swapping can seriously affect the performance of file cache. Adding a deletion interface helps users clean up the file cache.

API: `http://be_host:be_webserver_port/api/file_cache?op=release&base_path=${file_cache_base_path}`
Return the number of released segment files. If `base_path` is not provide in url, all cache paths will be released.
It's thread-safe to call this api, so only the segment files not been read currently can be released.
```
{"released_elements":22}
```
3. Specify the base path to store cache data
> Currently, regression testing lacks test cases of file cache, which cannot guarantee the stability of file cache. This interface is generally used in regression testing scenarios. Different queries use different paths to verify different usage cases and performance.

User can set session variable `file_cache_base_path` to specify the base path to store cache data. `file_cache_base_path="random"` as default, means chosing a random path from cached paths to store cache data.  If `file_cache_base_path` is not one of the base paths in BE configuration, a random path is used.
2023-05-05 14:28:01 +08:00