Commit Graph

28 Commits

Author SHA1 Message Date
650536d53e [Feature] Add Topn udaf (#4803)
For #4674 
This is a udaf for approximate topn using Space-Saving algorithm.  At present, we can only calculate
the frequent items and their frequencies in a certain column, based on which we can implement similar
topN functions supported by Kylin in the future. 

I have also added a test to calculate the accuracy of this algorithm. The following is a rough running result.
The total amount of data is 1 million lines and follows the Zipfian distribution, where Element Cardinality
represents the data cardinality, 20X, 50X.. The value representing space_expand_rate is 20,50, which is
used to set the counter number in the space-saving algorithm

```
zf exponent = 0.5
Element cardinality	        20X        50X          100X
               1000		100%	   100%         100%
               10000		100%	   100%		100%
	       100000		100%	   100%		100%
	       500000		 94%	    98%		 99%

zf exponent = 0.6,1
Element cardinality	        20X        50X          100X
		1000		100%	   100%         100%
		10000		100%	   100%		100%
		100000		100%	   100%		100%
		500000		100%	   100%		100%

```
2020-12-16 21:58:34 +08:00
6fedf5881b [CodeFormat] Clang-format cpp sources (#4965)
Clang-format all c++ source files.
2020-11-28 18:36:49 +08:00
09f97f8a05 [Refactor] Fixes some be typo part 2 (#4747) 2020-10-20 09:28:57 +08:00
b780df697a [refactor] Optimize threads usage mode in BE (#4440)
BE can not graceful exit because some threads are running in endless
loop. This patch do the following optimization:
- Use the well encapsulated Thread and ThreadPool instead of std::thread
  and std::vector<std::thread>
- Use CountDownLatch in thread's loop condition to avoid endless loop
- Introduce a new class Daemon for daemon works, like tcmalloc_gc,
  memory_maintenance and calculate_metrics
- Decouple statistics type TaskWorkerPool and StorageEngine notification
  by submit tasks to TaskWorkerPool's queue
- Reorder objects' stop and deconstruct in main(), i.e. stop network
  services at first, then internal services
- Use libevent in pthreads mode, by calling evthread_use_pthreads(),
  then EvHttpServer can exit gracefully in multi-threads
- Call brpc::Server's Stop() and ClearServices() explicitly
2020-09-06 20:19:14 +08:00
498b06fbe2 [Metrics] Support tablet level metrics (#4428)
Sometimes we want to detect the hotspot of a cluster, for example, hot scanned tablet, hot wrote tablet,
but we have no insight about tablets in the cluster.
This patch introduce tablet level metrics to help to achieve this object, now support 4 metrics on tablets: `query_scan_bytes `, `query_scan_rows `, `flush_bytes `, `flush_count `. 
However, one BE may holds hundreds of thousands of tablets, so I add a parameter for the metrics HTTP request,
and not return tablet level metrics by default.
2020-09-02 10:39:41 +08:00
e71152132c [metrics] Redesign metrics to 3 layers (#4115)
Redesign metrics to 3 layers:
    MetricRegistry - MetricEntity - Metrics
    MetricRegistry : the register center
    MetricEntity : the entity registered on MetricRegistry. Generally a MetricRegistry can be registered on several 
        MetricEntities, each of MetricEntity is an independent entity, such as server, disk_devices, data_directories, thrift 
        clients and servers, and so on. 
    Metric : metrics of an entity. Such as fragment_requests_total on server entity, disk_bytes_read on a disk_device entity, 
        thrift_opened_clients on a thrift_client entity.
    MetricPrototype: the type of a metric. MetricPrototype is a global variable, can be shared by the same metrics across 
        different MetricEntities.
2020-08-08 11:23:01 +08:00
af1beb6ce4 [Enhance] Add prepare phase for some timestamp functions (#3947)
Fix: #3946 

CL:
1. Add prepare phase for `from_unixtime()`, `date_format()` and `convert_tz()` functions, to handle the format string once for all.
2. Find the cctz timezone when init `runtime state`, so that don't need to find timezone for each rows.
3. Add constant rewrite rule for `utc_timestamp()`
4. Add doc for `to_date()`
5. Comment out the `push_handler_test`, it can not run in DEBUG mode, will be fixed later.
6. Remove `timezone_db.h/cpp` and add `timezone_utils.h/cpp`

The performance shows bellow:

11,000,000 rows

SQL1: `select count(from_unixtime(k1)) from tbl1;`
Before: 8.85s
After: 2.85s

SQL2: `select count(from_unixtime(k1, '%Y-%m-%d %H:%i:%s')) from tbl1 limit 1;`
Before: 10.73s
After: 4.85s

The date string format seems still slow, we may need a further enhancement about it.
2020-06-29 19:15:09 +08:00
b58b1b3953 [metrics] Make DorisMetrics to be a real singleton (#3417) 2020-05-04 09:20:53 +08:00
fc55423032 [SQL] Support Grouping Sets, Rollup and Cube to extend group by statement
Support Grouping Sets, Rollup and Cube to extend group by statement
support GROUPING SETS syntax 
```
SELECT a, b, SUM( c ) FROM tab1 GROUP BY GROUPING SETS ( (a, b), (a), (b), ( ) );
```
cube  or rollup like 
```
SELECT a, b,c, SUM( d ) FROM tab1 GROUP BY ROLLUP|CUBE(a,b,c)
```

[ADD] support grouping functions in expr like grouping(a) + grouping(b) (#2039)
[FIX] fix analyzer error in window function(#2039)
2020-01-17 16:24:02 +08:00
a99a49a444 Add bitamp_to_string function (#2731)
This CL changes:

1. add function bitmap_to_string and bitmap_from_string, which will
 convert a bitmap to/from string which contains all bit in bitmap
2. add function murmur_hash3_32, which will compute murmur hash for
input strings
3. make the function cast float to string the same with user result
logic
2020-01-13 12:31:37 +08:00
11eafe524f Add ChunkAllocator to accelerate chunk allocation (#1792)
I add ChunkAllocator in this CL to put unused memory chunk to a chunk
pool other than return it to system allocator. Now we only change
MemPool's chunk allocation and free to this.

And two configuration are introduduced too. 'chunk_reserved_bytes_limit'
is the limit of how many bytes this chunk pool can reserve in total and
its default value is 2147483648(2GB). 'use_mmap_allocate_chunk': if
chunk is allocated via mmap and default value is false.

And in my test case with default configuration a simple like
"select * from table limit 10", this can improve throughput from 280 QPS
to to 650 QPS. And when I config 'chunk_reserved_bytes_limit' to 0,
which means this is disabled, the throughput is the same with origin's.
2019-09-13 08:27:24 +08:00
cd5cfea5cc Encapsulate HLL logic (#1756) 2019-09-09 15:52:10 +08:00
1e4dd77d2a Add bitmap agg type and udaf (#1610) 2019-08-26 14:24:42 +08:00
69af50aa8c Time zone related BE function (#1598)
Details can be found in time-zone.md document
2019-08-12 20:57:59 +08:00
4aedaea84e Support TIME type and timediff function (#1505) 2019-07-23 13:42:39 +08:00
7cdaba66dc Add spatial func (#1213)
Support some spatial functions, such as ST_Contains.
2019-05-31 14:23:09 +08:00
afa3aa9069 Add some pre-calculated metrics (#1079)
1. max io util of disks
2. max network send/receive bytes rate of all network devices
3. base/cumulative compaction request counter and failure counter
2019-04-30 11:12:23 +08:00
d3251a19f7 Modify the method to obtain some metrics (#904) 2019-04-10 19:37:48 +08:00
c34b306b4f Decimal optimize branch #695 (#727) 2019-03-22 17:22:16 +08:00
90d71508ff Add UserFunctionCache to cache UDF's library (#453)
* Add UserFunctionCache to cache UDF's library

This patch replace LibCache with UserFunctionCache. LibCache use HDFS
URL to identify a UDF's Library, and when BE process restart all of
downloaded library should be loaded another time. We use function id
corresponding to a library, and when process restart, all downloaded
libraries can be loaded without another downloading.

* update
2018-12-21 22:07:21 +08:00
a2b299e3b9 Reduce UT binary size (#314)
* Reduce UT binary size

Almost every module depend on ExecEnv, and ExecEnv contains all
singleton, which make UT binary contains all object files.

This patch seperate ExecEnv's initial and destory to anthor file to
avoid other file's dependence. And status.cc include debug_util.h which
depend tuple.h tuple_row.h, and I move get_stack_trace() to
stack_util.cpp to reduce status.cc's dependence.

I add USE_RTTI=1 to build rocksdb to avoid linking librocksdb.a

Issue: #292

* Update
2018-11-15 16:17:23 +08:00
37b4cafe87 Change variable and namespace name in BE (#268)
Change 'palo' to 'doris'
2018-11-02 10:22:32 +08:00
2868793b6b Change license to Apache License 2.0 (#262) 2018-11-01 09:06:01 +08:00
051aced48d Missing many files in last commit
In last commit, a lot of files has been missed
2018-10-31 16:19:21 +08:00
3c9f2ae669 remove aes encrypt and decrypt which used GPL 2018-06-12 17:26:21 +08:00
ac770c33d4 remove useless aes 2018-06-09 15:03:51 +08:00
2419384e8a push 3.3.19 to github (#193)
* push 3.3.19 to github

* merge to 20ed420122a8283200aa37b0a6179b6a571d2837
2018-05-15 20:38:22 +08:00
e2311f656e baidu palo 2017-08-11 17:51:21 +08:00