Commit Graph

42 Commits

Author SHA1 Message Date
89dc461f91 Fix UT and remove unused code (#2160) 2019-11-08 08:47:48 +08:00
188d97c215 Add null bit verification for row_batch transformation (#2139) 2019-11-07 14:05:23 +08:00
f53f188c5d Add arrow IPC serialization for Doris-Spark-Connector (#2013) 2019-10-31 10:32:06 +08:00
41e55cfca9 Modify fixed partition feature (#1989)
1. Not support MAVALUE in multi partition column.
2. Fix the incorrect show create table stmt.
2019-10-16 16:03:46 +08:00
62acf5d098 Limit the memory usage of Loading process (#1954) 2019-10-15 09:26:20 +08:00
262c7f4834 Make All BE UT pass in debug mode (#1913)
Fix OrdinalPageIndexTest
Fix ColumnReaderWriterTest
Fix binary_dict_page_test
Fix routine_load_task_executor_test
2019-09-29 19:37:51 +08:00
d3a445ee09 Fix memory_scratch_sink_test in debug mode (#1906) 2019-09-28 10:33:24 +08:00
c643cbd30c Optimize the load performance for large file (#1798)
The current load process is:

Tablet Sink -> Tablet Channel Mgr -> Tablets Channel -> Delta Writer -> MemTable -> Flush to disk

In the path of Tablets Channel -> DeltaWriter -> MemTable -> Flush to disk, the following operations are performed:

Insert tuple into different memtables according to tablet ID
When the memtable size reaches the threshold, it is written to disk.
The above operations are equivalent to single thread execution for a single load task.
In fact, the insertion of memtable and the flush of memtable can be executed synchronously.
Perform these operation in single thread prevents the insertion of memtable from being delayed due to slow disk writing.

In the new implementation, I added a MemTableFlushExecutor class with a set of flush queues and corresponding worker threads.
By default, each data directory uses two worker threads for flush, which can be modified by the parameter flush_thread_num_per_store of BE.
DeltaWriter will push the full memtable to MemTableFlushExecutor for flush operation and generate a new memtable for receiving new data.

This design can improve the performance of load large files.
In single host testing, the time to load a 1GB text file is reduced from 48 seconds to 29 seconds.
2019-09-25 13:49:32 +08:00
11eafe524f Add ChunkAllocator to accelerate chunk allocation (#1792)
I add ChunkAllocator in this CL to put unused memory chunk to a chunk
pool other than return it to system allocator. Now we only change
MemPool's chunk allocation and free to this.

And two configuration are introduduced too. 'chunk_reserved_bytes_limit'
is the limit of how many bytes this chunk pool can reserve in total and
its default value is 2147483648(2GB). 'use_mmap_allocate_chunk': if
chunk is allocated via mmap and default value is false.

And in my test case with default configuration a simple like
"select * from table limit 10", this can improve throughput from 280 QPS
to to 650 QPS. And when I config 'chunk_reserved_bytes_limit' to 0,
which means this is disabled, the throughput is the same with origin's.
2019-09-13 08:27:24 +08:00
b4f6f755f1 Add exchange in MemPool to reduce alloc/free operation (#1732)
Reuse allocated chunks when storage read operation.
2019-09-02 19:29:30 +08:00
76987275b9 Fix result of unix_timestamp() (#1727) 2019-08-30 21:39:16 +08:00
58801c6ab0 Support converting RowBatch and RowBlockV2 to/from Arrow (#1699) 2019-08-27 11:30:00 +08:00
032d0b41bb Fix compile error (#1630) 2019-08-13 10:00:18 +08:00
69af50aa8c Time zone related BE function (#1598)
Details can be found in time-zone.md document
2019-08-12 20:57:59 +08:00
e3348c46a9 Expose data pruned-filter-scan ability (#1527) 2019-08-11 12:59:24 +08:00
0694b6a6fa Fix bugs of Broker load (#1546)
Use same UUID as query ID and load ID of a load execution plan.
Each load execution plan has a load ID, and as a plan, there is also a query ID.
We can use same UUID as query ID and load ID, for tracing the load process more easily.

Change the load ID when retrying a load execution plan.
When a load execution plan retry, the load ID should be changed, otherwise BE can not
distinguish the old and new load requests.

Cancel the running loading task when cancelling the broker load.
When user cancel a broker load, the running loading task should also be cancelled, or
it may occupies the worker thread for a long time.

Remove the unnecessary query report when doing load execution plan.
Only the last query report is needed.

Add a new BE config tablet_writer_rpc_timeout_sec.
It is used for RPC of tablet sink. The default is 600 seconds. which is long enough for flushing
about 6GB data. The long timeout config will reduce the possibility of encountering fail to send batch error when loading.

Use streaming_load_max_mb instead of mini_load_max_mb in BE config.

Add more logs for tracing a broker load process easily.
2019-07-27 20:17:05 +08:00
0d48a3961c Refactor Storage Engine (#1478)
NOTE: This patch would modify all Backend's data.
And this will cause a very long time to restart be.
So if you want to interferer your product environment,
you should upgrade backend one by one.

1. Refactoring be is to clarify the structure the codes.
2. Use unique id to indicate a rowset.
   Nameing rowset with tablet_id and version will lead to
   many conflicts among compaction, clone, restore.
3. Extract an rowset interface to encapsulate rowsets
   with different format.
2019-07-15 21:18:22 +08:00
9d03ba236b Uniform Status (#1317) 2019-06-14 23:38:31 +08:00
ff0dd0d2da Support SSL authentication with Kafka in routine load job (#1235) 2019-06-07 16:29:01 +08:00
6231fe0abc Fix FragmentMgrTest crash sometimes (#1232) 2019-06-01 18:10:24 +08:00
722a9e71c7 Optimize json functions (#1177)
1. get_json_xxx() now support using quoto to escape dot
2. Implement json_path_prepare() function to preprocess json_path

Performance of get_json_string() on 1000000 rows reduces from 2.27s to 0.27s
2019-05-21 09:13:12 +08:00
e8b360d193 Merge master and fix BE ut 2019-04-28 10:33:50 +08:00
192c8c5820 Fix bug that data consumer should be removed from pool when being got (#723) 2019-04-28 10:33:50 +08:00
567d5de2de Add a data consumer pool to reuse the data consumer (#691) 2019-04-28 10:33:50 +08:00
9618d20a72 Add unit test (#675) 2019-04-28 10:33:50 +08:00
0820a29b8d Implement the routine load process of Kafka on Backend (#671) 2019-04-28 10:33:50 +08:00
1cc2926a40 Fix wrong result in decimal type multiplication when carry is required (#980) 2019-04-21 17:12:58 +08:00
c34b306b4f Decimal optimize branch #695 (#727) 2019-03-22 17:22:16 +08:00
b3b86731cb Remove useless check, not need lsb_release any more (#526) 2019-01-11 14:59:20 +08:00
742dc796b2 Fix inconsistency of three replicas belongs to one tablet (#523)
There are A, B, C replicas of one tablet.
A has 0 - 10 version.
B has 0 - 5, 6, 7, 9, 10 version.
1. B has missed versions, so it clones 0 - 10 from A, and remove overlapped versions in its header.
2. Coincidentally, 6 is a version for delete predicate (delete where day = 20181221).
   When removing overlapped versions, version 6 is removed but delete predicate is not be removed.
3. Unfortunately, 0-10 cloned from A has data indicated at 20181221.
4. B performs compaction, and data generated by 20181221 is be removed falsely.
2019-01-10 18:51:12 +08:00
5b1e3d3f40 Optimize backup & restore process (#460)
1. Print broker address for debug.
2. Do not letting backup job cancelled if it already in state UPLOAD_INFO.
3. Cancel task on Backends when job is cancelled.
4. Show detail progress of backup and restore job.
5. Make 'show snapshot' result more readable.
6. Change upload and download thread num of backup and restore in Backend to 1.
2018-12-24 16:49:16 +08:00
90d71508ff Add UserFunctionCache to cache UDF's library (#453)
* Add UserFunctionCache to cache UDF's library

This patch replace LibCache with UserFunctionCache. LibCache use HDFS
URL to identify a UDF's Library, and when BE process restart all of
downloaded library should be loaded another time. We use function id
corresponding to a library, and when process restart, all downloaded
libraries can be loaded without another downloading.

* update
2018-12-21 22:07:21 +08:00
37b4cafe87 Change variable and namespace name in BE (#268)
Change 'palo' to 'doris'
2018-11-02 10:22:32 +08:00
2868793b6b Change license to Apache License 2.0 (#262) 2018-11-01 09:06:01 +08:00
051aced48d Missing many files in last commit
In last commit, a lot of files has been missed
2018-10-31 16:19:21 +08:00
5d3fc80067 Added:
* Add streaming load feature. You can execute 'help stream load;' to see more information.

Changed:
* Loading phase of a certain table can be parallelized, to reduce the load job execution time when multi load jobs to a single table.
* Using RocksDB to save the header info of tablets in Backends, to reduce the IO operations and increate speeding of restarting.

Fixed:
* A lot of bugs fixed.
2018-10-31 14:46:22 +08:00
ae9ce81453 Changed: change build.sh to use environment variable to get thirdparty's
path, and change PALO_HOME to DORIS_HOME
2018-10-30 16:29:06 +08:00
cc74efb3c5 merge to ddb65b69f9c788e359e191889cb31f15279c41ec (#224)
1. Apache HDFS broker support HDFS HA and Hadoop kerberos authentication.
2. New Backup and Restore function. Use Fs Broker to backup your data to HDFS or restore them from HDFS.
3. Table-Level Privileges. Grant fine-grained privileges on table-level to specified user.
4. A lot of bugs fixed.
5. Performance improvement.
2018-08-24 17:12:26 +08:00
2419384e8a push 3.3.19 to github (#193)
* push 3.3.19 to github

* merge to 20ed420122a8283200aa37b0a6179b6a571d2837
2018-05-15 20:38:22 +08:00
1e951e5c1c fix ut compile. set timeout to pull load task. fix export sink bug (#175) 2018-01-09 10:42:57 +08:00
6486be64c3 fix license statement (#29)
* change picture to word

* change picture to word

* SHOW FULL TABLES WHERE Table_type != VIEW sql can not execute

* change license description
2017-08-18 19:16:23 +08:00
e2311f656e baidu palo 2017-08-11 17:51:21 +08:00