Support match syntax in nereids.
match syntax use like:
```sql
select * from test where msg match "hello";
select * from test where msg match_any "hello";
select * from test where msg match_all "hello hi";
select * from test where msg match_phrase "hello world";
```
`match` is same as `match_any`.
the pr of match syntax in original planner: https://github.com/apache/doris/pull/14211
hudi serializer `org.apache.hudi.common.util.SerializationUtils$KryoInstantiator.newKryo` throws error like `java.lang.IllegalArgumentException: classLoader cannot be null`. Set the default class loader for scan thread.
```
public Kryo newKryo() {
Kryo kryo = new Kryo();
...
// Thread.currentThread().getContextClassLoader() returns null
kryo.setClassLoader(Thread.currentThread().getContextClassLoader());
...
return kryo;
}
```
The changes in this PR:
1. rename BatchRewriteJob to AbstractBatchJobExecutor
2. add a new rewrite job type, CostBasedRewriteJob. It receive a RewriteJob as input, compare the cost of two candidate plans using or not using the input RewriteJob and return the lower cost plan as the rewrite result.
3. do some small refactor on NereidsPlanner for better abstraction
4. do some refactor on dir structure of Nereids
The usage of cbo rewrite framework:
if you want let a rule or a rule list to be run in cbo rewrite frame work, you just need to wrap the rule / rule list with costBased function of class Rewriter, for example
```java
...
costBased(
custom(RuleType.AGG_SCALAR_SUBQUERY_TO_WINDOW_FUNCTION,
AggScalarSubQueryToWindowFunction::new)
),
...
```
As we know, log4j2 some times may be bottleneck in doris fe when there are many logs to be output in sync mode while asynchronous logging has a better performance, and we find that capturing caller location has a similar impact across all logging libraries, and slows down asynchronous logging by about 30-100x. so, here we provide three log mode for log4j2 to meet the needs of different users.
refer to https://logging.apache.org/log4j/2.x/performance.html
1. cast string literal to date like type should not be an implict cast
2. the string representation of float like type should not be scientific notation
3. the data type of like function's regex expr should be string type even if it's a null literal
4. add -Xss4m in fe.conf to prevent stack overflow in some case
This PR addresses the refactoring of common methods that were originally located within the ODBC classes, but were used by the JDBC classes. These methods have now been moved to the JDBC classes to improve code readability and maintainability.
In addition, we have disabled the creation of ODBC external tables by default. However, this will not affect the existing usage of ODBC. You can still enable the ODBC external tables through the enable_odbc_table setting. Please be aware that we plan to completely remove the ODBC external tables in future versions, so we recommend using the JDBC Catalog as a priority.
The java-udf module has become increasingly large and difficult to manage, making it inconvenient to package and use as needed. It needs to be split into multiple sub-modules, such as : java-commom、java-udf、jdbc-scanner、hudi-scanner、 paimon-scanner.
Co-authored-by: lexluo <lexluo@tencent.com>
After supporting insert-only transactional hive full acid tables #19518, #19419, this PR support transactional hive full acid tables.
Support hive3 transactional hive full acid tables.
Hive2 transactional hive full acid tables need to run major compactions.
When executing routine load job, there may encounter StackOverflowException.
This is because the expr in column setting list will be analyze for each routine load sub task,
and there is a self-reference bug that may cause endless loop when analyzing expr.
The following columns expr list may trigger this bug:
```
columns(col1, col2,
col2=null_or_empty(col2),
col1=null_or_empty(col2))
```
This fix is verified by user, but I can't add regression test for this case, because I can't submit a routine load job
in our regression test, and this bug can only be triggered in routine load.
this pr impacts tpch q16 Agg strategy, but no performance issue
this pr improves tpcds sf100
before:
cold 141 sec
hot 133 sec
after:
code 137 sec
hot 128 sec
Issue Number: close#20669
RewriteInPredicateRule may cast InPredicate expr's two child to the same type, for example: where cast(age as char) in ('11'), the type of age is int, RewriteInPredicateRule will cast expr's two child type to int. As in the example above, child 0 will be such struct:
```
child 0: type: int
|--- child: type : char
|-- child: type : int
```
Due to the RewriteInPredicateRule cast the type of the expr to int, it will reanalyze stmt, but it will reset stmt first before reanalyze the stmt, and reset opt will change child 0 to such struct:
```
child: type : char
|-- child: type : int
```
It cause two child's type will be cast to varchar in func castAllToCompatibleType, the logic of RewriteInPredicateRule will be useless.
In 1.1-lts and 1.2-lts, such case " where cast(age as char) in ('11')" can't work well, because func castAllToCompatibleType will cast int to char but int can't cast to char(master can work well because func castAllToCompatibleType will cast int to varchar in such case).
```
MySQL [test]> select user_id from test_cast where cast(age as char) in ('45');
ERROR 1105 (HY000): errCode = 2, detailMessage = type not match, originType=INT, targeType=CHAR(*)
```
Currently, there are many profiles using add child profile to orgnanize profile into blocks. But it is wrong. Child profile will have a total time counter. Actually, what we should use is just a label.
- MemoryUsage:
- HashTable: 23.98 KB
- SerializeKeyArena: 446.75 KB
Add a new macro ADD_LABEL_COUNTER to add just a label in the profile.
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
now, we check some olap table state normal outside write lock scope, the table state may be changed to unnormal when we do alter operation
---------
Co-authored-by: caiconghui1 <caiconghui1@jd.com>
the formula used to compute ndv after filter implies that the new rowCount is smaller than the original rowCount. When we apply this formula to join, we should add branch if new row count is bigger than original row count.
when new row count is bigger, the ndv is not changed.
Support collect statistics for HMS external table with specific partitions. Add session variables to limit the partitions to collect for whole table line number and columns statistics.
PR(https://github.com/apache/doris/pull/19909) has implemented the framework of hudi reader for MOR table. This PR completes all functions of reading MOR table and enables end-to-end queries.
Key Implementations:
1. Use hudi meta information to generate the table schema, not from hive client.
2. Use hive client to list hudi partitions, so it strongly depends the sync-tools(https://hudi.apache.org/docs/syncing_metastore/) which syncs the partitions of hudi into hive metastore. However, we may get the hudi partitions directly from .hoodie directory.
3. Remove `HudiHMSExternalCatalog`, because other catalogs like glue is compatible with hive catalog.
4. Read the COW table originally from c++.
5. Hudi RecordReader will use ProcessBuilder to start a hotspot debugger process, which may be stuck when attaching the origin JNI process, soI use a tricky method to kill this useless process.
we have some prunning path logical in cascades framework. However it do not work as we expected. if we do prunning on one Group, then maybe we need to do thousands of times optimization on its parent without any success result. This PR remove these prunning provisionally. We will add prunning back when we re-design it.
If BE crashed the error would be logged, and the analysis task would be mark as finished, which is incorrect.
In this PR, update analysis task according to the query state