For #4674 This is a udaf for approximate topn using Space-Saving algorithm. At present, we can only calculate the frequent items and their frequencies in a certain column, based on which we can implement similar topN functions supported by Kylin in the future. I have also added a test to calculate the accuracy of this algorithm. The following is a rough running result. The total amount of data is 1 million lines and follows the Zipfian distribution, where Element Cardinality represents the data cardinality, 20X, 50X.. The value representing space_expand_rate is 20,50, which is used to set the counter number in the space-saving algorithm ``` zf exponent = 0.5 Element cardinality 20X 50X 100X 1000 100% 100% 100% 10000 100% 100% 100% 100000 100% 100% 100% 500000 94% 98% 99% zf exponent = 0.6,1 Element cardinality 20X 50X 100X 1000 100% 100% 100% 10000 100% 100% 100% 100000 100% 100% 100% 500000 100% 100% 100% ```
61 lines
2.9 KiB
Markdown
61 lines
2.9 KiB
Markdown
---
|
|
{
|
|
"title": "TOPN",
|
|
"language": "en"
|
|
}
|
|
---
|
|
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
|
|
# TOPN
|
|
## description
|
|
### Syntax
|
|
|
|
`topn(expr, INT top_num[, INT space_expand_rate])`
|
|
|
|
The topn function uses the Space-Saving algorithm to calculate the top_num frequent items in expr, and the result is the
|
|
frequent items and their occurrence times, which is an approximation
|
|
|
|
The space_expand_rate parameter is optional and is used to set the number of counters used in the Space-Saving algorithm
|
|
```
|
|
counter numbers = top_num * space_expand_rate
|
|
```
|
|
The higher value of space_expand_rate, the more accurate result will be. The default value is 50
|
|
|
|
## example
|
|
```
|
|
MySQL [test]> select topn(keyword,10) from keyword_table where date>= '2020-06-01' and date <= '2020-06-19' ;
|
|
+------------------------------------------------------------------------------------------------------------+
|
|
| topn(`keyword`, 10) |
|
|
+------------------------------------------------------------------------------------------------------------+
|
|
| a:157, b:138, c:133, d:133, e:131, f:127, g:124, h:122, i:117, k:117 |
|
|
+------------------------------------------------------------------------------------------------------------+
|
|
|
|
MySQL [test]> select date,topn(keyword,10,100) from keyword_table where date>= '2020-06-17' and date <= '2020-06-19' group by date;
|
|
+------------+-----------------------------------------------------------------------------------------------+
|
|
| date | topn(`keyword`, 10, 100) |
|
|
+------------+-----------------------------------------------------------------------------------------------+
|
|
| 2020-06-19 | a:11, b:8, c:8, d:7, e:7, f:7, g:7, h:7, i:7, j:7 |
|
|
| 2020-06-18 | a:10, b:8, c:7, f:7, g:7, i:7, k:7, l:7, m:6, d:6 |
|
|
| 2020-06-17 | a:9, b:8, c:8, j:8, d:7, e:7, f:7, h:7, i:7, k:7 |
|
|
+------------+-----------------------------------------------------------------------------------------------+
|
|
```
|
|
## keyword
|
|
TOPN |