Load data from hdfs in hive will move the source directory into table's location directory, leading the error like Can not get first file, please check uri in tvf test.
1. Collect external table row count when execute analyze database.
2. Support show cached table stats (row count)
3. Support alter external table column stats.
4. Refresh/Invalidate table row count stat memory cache when analyze task finished and drop table stats.
Currently, for merge-on-write unique table, the delete bitmap of a rowset will be calculated during flush phase, commit phase and publish phase. In this PR, we add a special mark in every rowset considered when we calculate delete bitmap in these three phases. Before we finally merge the delete bitmap to the table meta's delete bitmap, we will check if all the rowsets contain the special mark to check if we have considered all the rowsets during the above three phases.
Because the executor can not fail in publish phase if the coordinator have received successful commits info from all the executors, we just print logs if this correctness check failed rather than report a failure.
The avro-scanner-jar package is reduced from 204M to 160M.
Hadoop-related dependencies in the original avro pom are directly packaged into a jar package, resulting in a jar volume of 200M. Now since there is already a hadoop jar package environment in be lib, it can be directly referenced.
This pr fixes two issues:
1. when using s3 TVF to query files in AVRO format, due to the change of `TFileType`, the originally queried `FILE_S3 ` becomes `FILE_LOCAL`, causing the query failed.
2. currently, both parameters `s3.virtual.key` and `s3.virtual.bucket` are removed. A new `S3Utils` in jni-avro to parse the bucket and key of s3.
The purpose of doing this operation is mainly to unify the parameters of s3.
When data stream sender is doing broadcast shuffle, it accumulate to batch size and then send blocks to destinations, but for local receivers, it ONLY send the current block, which will cause data loss.
This issue is introduced by #22218.
If #22218 is pick to 2.0 branch, then also need to pick this PR.
Assume that there is a hive catalog named hive_ctl, a hive db named db1 and a table named tbl1, if we connect a slave FE and execute following commands:
1. `switch hive_ctl`
2. `show partitions from db1.tbl1`
Then we will meet the error like this:
```
MySQL [(none)]> show partitions from db1.tbl1;
ERROR 1049 (42000): errCode = 2, detailMessage = Unknown database 'default_cluster:db1'
```
The reason is that the slave FE will forward the `ShowPartitionStmt` to master FE but we do not sync the default catalog information, so the parser can not find the db and throws this exception. This is just one case, some other simillar cases will failed too.
fe log is large for a busy doris cluster, if you want to preserve some historical logs, it cost too much disk space.
enable compression is a good way to save space.
and a gzip compressed text file can be viewed without decompression.
Upgrade guava to 32.1.2-jre
Set ck dependency scope to provided
Upgrade okio to 3.4.0
Upgrade snake yaml to 1.33
Upgrade aws-java-sdk to 1.12.519
Upgrade hadoop to 3.3.6
1. If derived from a origin column, eg: `create table tbl1 as select col1 from tbl2`, the length will be same os the origin column.
2. If derived from a function, eg: `create table tbl1 as select func(col1) from tbl2`, the length will be 65533.
3. If derived from a constant value, eg: `create table tbl1 as select "abc" from tbl2`, the length will be 65533.
* Refactor thirdparty.cmake by add_thirdparty function
Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>
* Refactor thirdparty.cmake
- Rename add_library NOADD to NOTADD
- Fix tcmalloc not add to COMMON_THIRDPARTY
- Fix libintl in OS_MACOSX
Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>
---------
Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>