Add scanner isolation class loader to make each plugin non-conflicting.
The BE will get scanner classes by JNI call and use JniClassLoader load them.
In the last version,we always get canner classes from the system class path by default,
so it cannot isolate the classes for each scanner
### Two main changes:
- 1、add minidump replay
- 2、change minidump serialization of statistic messages and some interface between main logic of nereids optimizer and minidump
### Use of nereids ut:
- 1、save minidump files:
Execute command by mysql-client:
```
set enable_nereids_planner=true;
set enable_minidump=true;
```
Execute sql in mysql-client
- 2、use nereids-ut script to execute directory:
```
cp -r ${DORIS_HOME}/minidump ${DORIS_HOME}/output/fe && cd ${DORIS_HOME}/output/fe
./nereids_ut --d ${directory_of_minidump_files}
```
### Refactor of minidump
- move statistics used serialization to serialization of input and serialize with catalogs
- generating minidump file only when enable_minidump flag is set, minidump module interactive with main optimizer only by :
serializeInputsToDumpFile(catalog, statistics, query) && serializeOutputsToDumpFile(outputplan).
Introduce libunwind get stack trace, cost is negligible and has line numbers.
use StackTraceCache, PHDRCache speed up, is customizable and has some optimizations.
Other stack trace tools remain: glog, boost, glibc, in case for need.
TODO:
currently support linux __x86_64__, __arm__, __powerpc__, not supported __FreeBSD__, APPLE
Note: __arm__, __powerpc__ not been verified
Support signal handle
libunwid support unw_backtrace for jemalloc
Use of undefined compile option USE_MUSL for later
Fix hadoop short circuit reading can not enabled in some environments.
- Revert #21430 because it will cause performance degradation issue.
- Add `$HADOOP_CONF_DIR` to `$CLASSPATH`.
- Remove empty `hdfs-site.xml`. Because in some environments it will cause hadoop short circuit reading can not enabled.
- Copy the hadoop common native libs(which is copied from https://github.com/apache/doris-thirdparty/pull/98
) and add it to `LD_LIBRARY_PATH`. Because in some environments `LD_LIBRARY_PATH` doesn't contain hadoop common native libs, which will cause hadoop short circuit reading can not enabled.
Optimize the usage of fs benchmark tool:
1. Remove `Open` benchmark, it is useless.
2. Remove `Delete` benchmark, it is dangerous.
3. Add `SingleRead` benchmark, user can specify an exist file to test read operation:
`sh bin/run-fs-benchmark.sh --conf=conf/hdfs_read.conf --fs_type=hdfs --operation=single_read`
4. Modify the `run-fs-benchmark.sh`, remove `OPTS` section, use options in `fs_benchmark_tool` directly
5. Add some custom counters in the benchmark result, eg:
```
--------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1 6864 ms 2385 ms 1 ReadRate=200.936M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1 3919 ms 1828 ms 1 ReadRate=351.96M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1 3839 ms 1819 ms 1 ReadRate=359.265M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_mean 4874 ms 2011 ms 3 ReadRate=304.054M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_median 3919 ms 1828 ms 3 ReadRate=351.96M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_stddev 1724 ms 324 ms 3 ReadRate=89.3768M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_cv 35.37 % 16.11 % 3 ReadRate=29.40%
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_max 6864 ms 2385 ms 3 ReadRate=359.265M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_min 3839 ms 1819 ms 3 ReadRate=200.936M/s
```
- For `open_read` and `single_read`, add `ReadRate` as `bytes per second`.
- For `create_write`, add `WriteRate` as `bytes per second`.
- For `exists` and `rename`, add `ExistsCost` and `RenameCost` as `time cost per one operation`.
Add an optional executable binary fs_benchmark_tool, for test the performance of file system such as hdfs, s3.
Usage:
./fs_benchmark_tool --conf my.conf --fs_type=s3 --operation=read --iterations=5
in my.conf, you can add any config key value with following format:
key1=value1
key2=value2
By default, this binary will not be built. Only build it when setting BUILD_FS_BENCHMARK=ON.
The binary will be installed in output/be/lib.
For developer, you can add new subclass of BaseBenchmark to add your own benchmark.
See be/src/io/fs/benchmark/s3_benchmark.hpp for an example
The java-udf module has become increasingly large and difficult to manage, making it inconvenient to package and use as needed. It needs to be split into multiple sub-modules, such as : java-commom、java-udf、jdbc-scanner、hudi-scanner、 paimon-scanner.
Co-authored-by: lexluo <lexluo@tencent.com>
- Implements ORC lazy materialization, integrate with the implementation of https://github.com/apache/doris-thirdparty/pull/56 and https://github.com/apache/doris-thirdparty/pull/62.
- Refactor code: Move `execute_conjuncts()` and `execute_conjuncts_and_filter_block()` in `parquet_group_reader `to `VExprContext`, used by parquet reader and orc reader.
- Add session variables `enable_parquet_lazy_materialization` and `enable_orc_lazy_materialization` to control whether enable lazy materialization.
- Modify `build.sh` to update apache-orc submodule or download package every time.
After #19246, when compilng FE, it will automatically generate Config and Session Variables doc and overwrite the origin one.
Need to avoid it because it is not ready to use yet
1. Add apache-orc git submodule update path, not update all modules
When sh build.sh, update all modules will fails serveral times because of unstable github network.
It wastes many time.
2. Add gitignore for be/src/apache-orc/ to avoid mistake commits.
1. Organize http documents
2. Add http interface authentication for FE
3. **Support https interface for FE**
4. Provide authentication interface
5. Add http interface authentication for BE
6. Support https interface for BE
1. Introduce hadoop libhdfs
2. For Linux-X86 platform, use the hadoop libhdfs
3. For other platform, use libhdfs3, because currently we don't have hadoop libhdfs binary for other platform
Co-authored-by: adonis0147 <adonis0147@gmail.com>
This PR ports the codebase to Clang-16.
Upgrade some third-party libraries:
1. Apache BRPC: 1.2.0 -> 1.4.0 (Some bugs are fixed and all patches for 1.2.0 can be removed.)
2. Boost: 1.73.0 -> 1.81.0 (Porting to Clang-16)
3. libclucene: 2.4.6 -> 2.4.8 (Porting to Clang-16)
As part of Inverted Index DSIP steps, we'd like to contribute our inverted index implementations step by step.
First of all we need to introduce clucene to doris thirdparty libs, because inverted index implementations are based on
lucence API and index file format, also we add our features and performance improvements base on clucene, so we
need to maintain the repo ourselves
According to the post https://developer.apple.com/forums/thread/676684, the executable whose size is bigger than 2G may fail to start. The size of the executable `doris_be_test` generated by run-be-ut.sh is 2.1G (> 2G) now and we can't run it on macOS (arm64).
We can separate the debug info from the executable `doris_be_test` to reduce the size. After that, we can run `doris_be_test` successfully.