Previously, delete statement with conditions on value columns are only supported on duplicate tables. After we introduce delete sign mechanism to do batch delete, a delete statement with conditions on value columns on unique tables will be transformed into the corresponding insert into ..., __DELETE_SIGN__ select ... statement. However, for unique table with merge-on-write enabled, the overhead of inserting these data can be eliminated. So this PR add the ability to allow delete predicate on value columns for merge-on-write unique tables.
## Proposed changes
Refactor thoughts: close#22383
Descriptions about `enclose` and `escape`: #22385
## Further comments
2023-08-09:
It's a pity that experiment shows that the original way for parsing plain CSV is faster. Therefor, the refactor is only applied on enclose related code. The plain CSV parser use the original logic.
Fallback of performance is unavoidable anyway. From the `CSV reader`'s perspective, the real weak point may be the write column behavior, proved by the flame graph.
Trimming escape will be enable after fix: #22411 is merged
Cases should be discussed:
1. When an incomplete enclose appears in the beginning of a large scale data, the line delimiter will be unreachable till the EOF, will the buffer become extremely large?
2. What if an infinite line occurs in the case? Essentially, `1.` is equivalent to this.
Only support stream load as trial in this PR, avoid too many unrelated changes. Docs will be added when `enclose` and `escape` is available for all kinds of load.
Truncate char or varchar columns if size is smaller than file columns or not found in the file column schema by session var `truncate_char_or_varchar_columns`.
This function can be used to replace bitmap_union(to_bitmap(expr)), because bitmap_union(to_bitmap(expr)) need create many many small bitmaps firstly and then merge them into a single bitmap.
bitmap_agg will convert the column value into a bitmap directly. Its performance is better than bitmap_union(to_bitmap(expr)) . In our test , there is about 30% improvement.
1. make mv matched when preagg have value column predicate contained in mv
'where clause
2. fix `org.apache.doris.common.AnalysisException: errCode = 2, detailMessage = BITMAP_UNION need input a bitmap column, but input INVALID_TYPE`
3. make the error message more detailed when create mv stmt parse failed
This PR was originally #16940 , but it has not been updated for a long time due to the original author @Cai-Yao . At present, we will merge some of the code into the master first.
thanks @Cai-Yao @yiguolei
when pushing down constant conjunct into set operation node, we should assign the conjunct to agg node if there is one. This is consistant with pushing constant conjunct into inlineview.
For load request, there are 2 tuples on scan node, input tuple and output tuple.
The input tuple is for reading file, and it will be converted to output tuple based on user specified column mappings.
And the broker load support different column mapping in different data description to same table(or partition).
So for each scanner, the output tuples are same but the input tuple can be different.
The previous implements save the input tuple in scan node level, causing different scanner using same input tuple,
which is incorrect.
This PR remove the input tuple from scan node and save them in each scanners.
sort out the test cases of external table.
After modify, there are 2 directories:
1. `external_table_p0`: all p0 cases of external tables: hive, es, jdbc and tvf
2. `external_table_p2`: all p2 cases of external tables: hive, es, mysql, pg, iceberg and tvf
So that we can run it with one line command like:
```
sh run-regression-test.sh --run -d external_table_p0,external_table_p2
```
count_by_enum(expr1, expr2, ... , exprN);
Treats the data in a column as an enumeration and counts the number of values in each enumeration. Returns the number of enumerated values for each column, and the number of non-null values versus the number of null values.
If a column is defined as: col VARCHAR/CHAR NULL and no default value. Then we load json data which misses column col, the result queried is not correct:
+------+
| col |
+------+
| 1 |
+------+
But expect:
+------+
| col |
+------+
| NULL |
+------+
---------
Co-authored-by: duanxujian <duanxujian@jd.com>
return empty result instead of error for empty match query as follows:
`SELECT * FROM t WHERE msg MATCH ''`
`SELECT * FROM t WHERE msg MATCH 'stop_word'`