At present, the application of vlog in the code is quite confusing.
It is inherited from impala VLOG_XX format, and there is also VLOG(number) format.
VLOG(number) format does not have a unified specification, so this pr standardizes the use of VLOG
For #4674
This is a udaf for approximate topn using Space-Saving algorithm. At present, we can only calculate
the frequent items and their frequencies in a certain column, based on which we can implement similar
topN functions supported by Kylin in the future.
I have also added a test to calculate the accuracy of this algorithm. The following is a rough running result.
The total amount of data is 1 million lines and follows the Zipfian distribution, where Element Cardinality
represents the data cardinality, 20X, 50X.. The value representing space_expand_rate is 20,50, which is
used to set the counter number in the space-saving algorithm
```
zf exponent = 0.5
Element cardinality 20X 50X 100X
1000 100% 100% 100%
10000 100% 100% 100%
100000 100% 100% 100%
500000 94% 98% 99%
zf exponent = 0.6,1
Element cardinality 20X 50X 100X
1000 100% 100% 100%
10000 100% 100% 100%
100000 100% 100% 100%
500000 100% 100% 100%
```
The return type of str_to_date depends on whether the time part is included in the format.
If included, it is DATETIME, otherwise it is DATE.
If the format parameter is not constant, the return type will be DATETIME.
The above judgment has been completed in the FE query planning stage,
so here we directly set the value type to the return type set in the query plan.
For example:
A table with one column k1 varchar, and has 2 lines:
"%Y-%m-%d"
"%Y-%m-%d %H:%i:%s"
Query:
SELECT str_to_date("2020-09-01", k1) from tbl;
Result will be:
2020-09-01 00:00:00
2020-09-01 00:00:00
Query:
SELECT str_to_date("2020-09-01", "%Y-%m-%d");
Return type is DATE
Query:
SELECT str_to_date("2020-09-01", "%Y-%m-%d %H:%i:%s");
Return type is DATETIME
#4619
Add time_round functions that provides `time_floor` & `time_ceil` at each time unit.
Fix two related bugs.
- #4618
- Fix `struct TimeInterval` to use `int64_t` instead of `int32_t`, in case when the second diff overflow
Use static local variable instead of create it every calls.
Time cost of the new added unit benchmark test could reduce
from about 60 seconds to 10 seconds.
The parameter 'part' of parse_url function does not support lower case, and parse protocol not right.
And This function does not support parse 'port'.
This PR tries to make parse_url function case insensitive and support parse 'port'.
The issue: #4451
replace is an user defined function, which is to replace all old substrings with a new substring in a string, as follow:
mysql> select replace("http://www.baidu.com:9090", "9090", "");
+------------------------------------------------------+
| replace('http://www.baidu.com:9090', '9090', '') |
+------------------------------------------------------+
| http://www.baidu.com: |
+------------------------------------------------------+
(1) Add LargeInt cast to date and datatime, see #3864
LargeInt can cast to date and datatime. Fix this error:
Unable to find _ZN5doris13CastFunctions16cast_to_date_valEPN9doris_udf15FunctionContextERKNS1_11LargeIntValE
(2) Add local timezone info to stale_version_path_json_doc rest api
Add timezone to "last create time" field.
{
"path id": "1",
"last create time": "1970-01-01 10:46:40 +0800",
"path list": "1 -> [2-3] -> [4-5]"
},
and add timezone to the test unix, see #4121 .
Fix be crash caused by cast decimal to date. A be crashed bug caused by Unable to find. _ZN5doris18DecimalV2Operators16cast_to_date_val.
also see #4281
We make all MemTrackers shared, in order to show MemTracker real-time consumptions on the web.
As follows:
1. nearly all MemTracker raw ptr -> shared_ptr
2. Use CreateTracker() to create new MemTracker(in order to add itself to its parent)
3. RowBatch & MemPool still use raw ptrs of MemTracker, it's easy to ensure RowBatch & MemPool destructor exec
before MemTracker's destructor. So we don't change these code.
4. MemTracker can use RuntimeProfile's counter to calc consumption. So RuntimeProfile's counter need to be shared
too. We add a shared counter pool to store the shared counter, don't change other counters of RuntimeProfile.
Note that, this PR doesn't change the MemTracker tree structure. So there still have some orphan trackers, e.g. RowBlockV2's MemTracker. If you find some shared MemTrackers are little memory consumption & too time-consuming, you could make them be the orphan, then it's fine to use the raw ptr.
from/to_base64 may return incorrect value when the value is null #4130
remove the duplicated base64 code
fix the base64 encoded string length is wrong, and this will cause the memory error
This CL mainly changes:
1. Reorganized the code logic to limit the supported json format to two, and the import behavior is more consistent.
2. Modified the statistical behavior of the number of error rows when loading in json format, so that the error rows can be counted correctly.
3. See `load-json-format.md` to get details of loading json format.
Fix: #3946
CL:
1. Add prepare phase for `from_unixtime()`, `date_format()` and `convert_tz()` functions, to handle the format string once for all.
2. Find the cctz timezone when init `runtime state`, so that don't need to find timezone for each rows.
3. Add constant rewrite rule for `utc_timestamp()`
4. Add doc for `to_date()`
5. Comment out the `push_handler_test`, it can not run in DEBUG mode, will be fixed later.
6. Remove `timezone_db.h/cpp` and add `timezone_utils.h/cpp`
The performance shows bellow:
11,000,000 rows
SQL1: `select count(from_unixtime(k1)) from tbl1;`
Before: 8.85s
After: 2.85s
SQL2: `select count(from_unixtime(k1, '%Y-%m-%d %H:%i:%s')) from tbl1 limit 1;`
Before: 10.73s
After: 4.85s
The date string format seems still slow, we may need a further enhancement about it.
* Fix large string val allocation failure
Large bitmap will need use StringVal to allocate large memory, which is large than MAX_INT.
The overflow will cause serialization failure of bitmap.
Fixed#3600
* Support bitmap_intersect
Support aggregate function Bitmap Intersect, it is mainly used to take intersection of grouped data.
The function 'bitmap_intersect(expr)' calculates the intersection of bitmap columns and returns a bitmap object.
The defination is following:
FunctionName: bitmap_intersect,
InputType: bitmap,
OutputType: bitmap
The scenario is as follows:
Query which users satisfy the three tags a, b, and c at the same time.
```
select bitmap_to_string(bitmap_intersect(user_id)) from
(
select bitmap_union(user_id) user_id from bitmap_intersect_test
where tag in ('a', 'b', 'c')
group by tag
) a
```
Closed#3552.
* Add docs of bitmap_union and bitmap_intersect
* Support null of bitmap_intersect