baidu palo

This commit is contained in:
cyongli
2017-08-11 17:51:21 +08:00
commit e2311f656e
1988 changed files with 586941 additions and 0 deletions

364
LICENSE.txt Normal file
View File

@ -0,0 +1,364 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--------------------------------------------------------------------------------
be/src/gutil (some portions): Apache 2.0, and 3-clause BSD
Some portions of this module are derived from code in the Chromium project,
copyright (c) Google inc and (c) The Chromium Authors and licensed under the
Apache 2.0 License or the under the 3-clause BSD license:
Copyright (c) 2013 The Chromium Authors. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are
permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list
of conditions and the following disclaimer in the documentation and/or other
materials provided with the distribution. * Neither the name of Google Inc. nor
the names of its contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--------------------------------------------------------------------------------
be/src/gutil/utf: licensed under the following terms:
UTF-8 Library
The authors of this software are Rob Pike and Ken Thompson.
Copyright (c) 1998-2002 by Lucent Technologies.
Permission to use, copy, modify, and distribute this software for any purpose without
fee is hereby granted, provided that this entire notice is included in all copies of any
software which is or includes a copy or modification of this software and in all copies
of the supporting documentation for such software. THIS SOFTWARE IS BEING PROVIDED "AS
IS", WITHOUT ANY EXPRESS OR IMPLIED WARRANTY. IN PARTICULAR, NEITHER THE AUTHORS NOR
LUCENT TECHNOLOGIES MAKE ANY REPRESENTATION OR WARRANTY OF ANY KIND CONCERNING THE
MERCHANTABILITY OF THIS SOFTWARE OR ITS FITNESS FOR ANY PARTICULAR PURPOSE.
--------------------------------------------------------------------------------
be/src/gutil/valgrind.h: licensed under the following terms:
This file is part of Valgrind, a dynamic binary instrumentation
framework.
Copyright (C) 2000-2008 Julian Seward. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. The origin of this software must not be misrepresented; you must
not claim that you wrote the original software. If you use this
software in a product, an acknowledgment in the product
documentation would be appreciated but is not required.
3. Altered source versions must be plainly marked as such, and must
not be misrepresented as being the original software.
4. The name of the author may not be used to endorse or promote
products derived from this software without specific prior written
permission.
THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS
OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--------------------------------------------------------------------------------
Parts of be/src/runtime/string-search.h: Python Software License V2
Copyright (c) 2001 - 2016 Python Software Foundation; All Rights Reserved
PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2
--------------------------------------------
1. This LICENSE AGREEMENT is between the Python Software Foundation ("PSF"), and the
Individual or Organization ("Licensee") accessing and otherwise using this software
("Python") in source or binary form and its associated documentation.
2. Subject to the terms and conditions of this License Agreement, PSF hereby grants
Licensee a nonexclusive, royalty-free, world-wide license to reproduce, analyze, test,
perform and/or display publicly, prepare derivative works, distribute, and otherwise use
Python alone or in any derivative version, provided, however, that PSF's License
Agreement and PSF's notice of copyright, i.e., "Copyright (c) 2001, 2002, 2003, 2004,
2005, 2006 Python Software Foundation; All Rights Reserved" are retained in Python alone
or in any derivative version prepared by Licensee.
3. In the event Licensee prepares a derivative work that is based on or incorporates
Python or any part thereof, and wants to make the derivative work available to others as
provided herein, then Licensee hereby agrees to include in any such work a brief summary
of the changes made to Python.
4. PSF is making Python available to Licensee on an "AS IS" basis. PSF MAKES NO
REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT NOT
LIMITATION, PSF MAKES NO AND DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY
OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON WILL NOT INFRINGE ANY
THIRD PARTY RIGHTS.
5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON FOR ANY INCIDENTAL,
SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS A RESULT OF MODIFYING, DISTRIBUTING, OR
OTHERWISE USING PYTHON, OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY
THEREOF.
6. This License Agreement will automatically terminate upon a material breach of its
terms and conditions.
7. Nothing in this License Agreement shall be deemed to create any relationship of
agency, partnership, or joint venture between PSF and Licensee. This License Agreement
does not grant permission to use PSF trademarks or trade name in a trademark sense to
endorse or promote products or services of Licensee, or any third party.
8. By copying, installing or otherwise using Python, Licensee agrees to be bound by the
terms and conditions of this License Agreement.
--------------------------------------------------------------------------------
Parts of be/src/util/coding-util.cc: Boost Software License V1.0
Boost Software License - Version 1.0 - August 17th, 2003
Permission is hereby granted, free of charge, to any person or organization obtaining a
copy of the software and accompanying documentation covered by this license (the
"Software") to use, reproduce, display, distribute, execute, and transmit the Software,
and to prepare derivative works of the Software, and to permit third-parties to whom the
Software is furnished to do so, all subject to the following:
The copyright notices in the Software and this entire statement, including the above
license grant, this restriction and the following disclaimer, must be included in all
copies of the Software, in whole or in part, and all derivative works of the Software,
unless such copies or derivative works are solely in the form of machine-executable
object code generated by a source language processor.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE
DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN
CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
THE USE OR OTHER DEALINGS IN THE SOFTWARE.

11
NOTICE.txt Normal file
View File

@ -0,0 +1,11 @@
Apache Impala (incubating)
Copyright 2017 The Apache Software Foundation
This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).
Portions of this software were developed at
Cloudera, Inc (http://www.cloudera.com/).
Palo
Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved

177
README.md Normal file
View File

@ -0,0 +1,177 @@
# Introduction to Palo
Palo is an MPP-based interactive SQL data warehousing for reporting and analysis. Palo mainly integrates the technology of Google Mesa and Cloudera Impala. Unlike other popular SQL-on-Hadoop systems, Palo is designed to be a simple and single tightly coupled system, not depending on other systems. Palo not only provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis. Palo not only provides batch data loading, but also provides near real-time mini-batch data loading. Palo also provides high availability, reliability, fault tolerance, and scalability. The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main features of Palo.
## 1. Background
In Baidu, the largest Chinese search engine, we run a two-tiered data warehousing system for data processing, reporting and analysis. Similar to lambda architecture, the whole data warehouse comprises data processing and data serving. Data processing does the heavy lifting of big data: cleaning data, merging and transforming it, analyzing it and preparing it for use by end user queries; data serving is designed to serve queries against that data for different use cases. Currently data processing includes batch data processing and stream data processing technology, like Hadoop, Spark and Storm; Palo is a SQL data warehouse for serving online and interactive data reporting and analysis querying.
Prior to Palo, different tools were deployed to solve diverse requirements in many ways. For example, the advertising platform needs to provide some detailed statistics associated with each served ad for every advertiser. The platform must support continuous updates, both new rows and incremental updates to existing rows within minutes. It must support latency-sensitive users serving live customer reports with very low latency requirements and batch ad-hoc multiple dimensions data analysis requiring very high throughput. In the past,this platform was built on top of sharded MySQL. But with the growth of data, MySQL cannot meet the requirements. Then, based on our existing KV system, we developed our own proprietary distributed statistical database. But, the simple KV storage was not efficient on scan performance. Because the system depends on many other systems, it is very complex to operate and maintain. Using RPC API, more complex querying usually required code programming, but users wants an MPP SQL engine. In addition to advertising system, a large number of internal BI Reporting / Analysis, also used a variety of tools. Some used the combination of SparkSQL / Impala + HDFS / HBASE. Some used MySQL to store the results that were prepared by distributed MapReduce computing. Some also bought commercial databases to use.
However, when a use case requires the simultaneous availability of capabilities that cannot all be provided by a single tool, users were forced to build hybrid architectures that stitch multiple tools together. Users often choose to ingest and update data in one storage system, but later reorganize this data to optimize for an analytical reporting use-case served from another. Our users had been successfully deploying and maintaining these hybrid architectures, but we believe that they shouldn’t need to accept their inherent complexity. A storage system built to provide great performance across a broad range of workloads provides a more elegant solution to the problems that hybrid architectures aim to solve. Palo is the solution. Palo is designed to be a simple and single tightly coupled system, not depending on other systems. Palo provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis. Palo provides bulk-batch data loading, but also provides near real-time mini-batch data loading. Palo also provides high availability, reliability, fault tolerance, and scalability.
Generally speaking, Palo is the technology combination of Google Mesa and Cloudera Impala. Mesa is a highly scalable analytic data storage system that stores critical measurement data related to Google's Internet advertising business. Mesa is designed to satisfy complex and challenging set of users’ and systems’ requirements, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. At present, by virtue of its superior performance and rich functionality, Impala has been comparable to many commercial MPP database query engine. Mesa can satisfy the needs of many of our storage requirements, however Mesa itself does not provide a SQL query engine; Impala is a very good MPP SQL query engine, but the lack of a perfect distributed storage engine. So in the end we chose the combination of these two technologies.
Learning from Mesa’s data model, we developed a distributed storage engine. Unlike Mesa, this storage engine does not rely on any distributed file system. Then we deeply integrate this storage engine with Impala query engine. Query compiling, query execution coordination and catalog management of storage engine are integrated to be frontend daemon; query execution and data storage are integrated to be backend daemon. With this integration, we implemented a single, full-featured, high performance state the art of MPP database, as well as maintaining the simplicity.
## 2. System Overview
Palo’s implementation consists of two daemons: frontend (FE) and backend (BE). The following figures gives the overview of architecture and usage.
![](./docs/resources/palo_architecture.jpg)
![](./docs/resources/palo_usage.jpg)
Frontend daemon consists of query coordinator and catalog manager. Query coordinator is responsible for receiving users’ sql queries, compiling queries and managing queries execution. Catalog manager is responsible for managing metadata such as databases, tables, partitions, replicas and etc. Several frontend daemons could be deployed to guarantee fault-tolerance, and load balancing.
Backend daemon stores the data and executes the query fragments. Many backend daemons could also be deployed to provide scalability and fault-tolerance.
A typical Palo cluster generally composes of several frontend daemons and dozens to hundreds of backend daemons.
Clients can use MySQL-related tools to connect any frontend daemon to submit SQL query. The frontend receives the query and compiles it into query plans executable by the backends. Then frontend sends the query plan fragments to backend. Backends will build a query execution DAG. Data is fetched and pipelined into the DAG. The final result response is sent to client via frontend. The distribution of query fragment execution takes minimizing data movement and maximizing scan locality as the main goal. Because Palo is designed to provide interactive analysis, so the average execution time of queries is short. Considering this, we adopt query re-execution to meet the fault tolerance of query execution.
A table is splitted into many tablets. Tablets are managed by backends. The backend daemon could be configured to use multiple directories. Any directory’s IO failure doesn’t influence the normal running of backend daemon. Palo will recover and rebalance the whole cluster automatically when necessary.
## 3. Frontend
In-memory catalog, multiple frontends, MySQL networking protocol, consistency guarantee, and two-level table partitioning are the main features of Palo’s frontend design.
#### 3.1 In-Memory Catalog
Traditional data warehouse always uses a RDBMS database to store their catalog metadata. In order to produce query execution plan, frontend needs to look up the catalog metadata. This kind of catalog storage may be enough for low concurrent ad-hoc analysis queries. But for online high concurrent queries, its performance is very bad,resulting in increased response latency. For example, Hive metadata query latency is sometimes up to tens of seconds or even minutes. In order to speedup the metadata access, we adopt the in-memory catalog storage.
![](./docs/resources/log_replication.jpg)
In-memory catalog storage has three functional modules: real-time memory data structures, memory checkpoints on local disk and an operation relay log. When modifing catalog, the mutation operation is written into the log file firstly. Then, the mutation operation is applied into the memory data structures. Periodically, a thread does the checkpoint that dumps memory data structure image into local disk. Checkpoint mechanism enables the fast startup of frontend and reduces the disk storage occupancy. Actually, in-memory catalog also simplifies the implementation of multiple frontends.
#### 3.2 Multiple Frontends
Many data warehouses only support single frontend-like node. There are some systems supporting master and slave deploying. But for online data serving, high availability is an essential feature. Further, the number of queries per seconds may be very large, so high scalability is also needed. In Palo, we provide the feature of multiple frontends using replicated-state-machine technology.
Frontends can be configured to three kinds of roles: leader, follower and observer. Through a voting protocol, follower frontends firstly elect a leader frontend. All the write requests of metadata are forwarded to the leader, then the leader writes the operation into the replicated log file. If the new log entry will be replicated to at least quorum followers successfully, the leader commits the operation into memory, and responses the write request. Followers always replay the replicated logs to apply them into their memory metadata. If the leader crashes, a new leader will be elected from the leftover followers. Leader and follower mainly solve the problem of write availability and partly solve the problem of read scalability.
Usually one leader frontend and several follower frontends can meet most applications’ write availability and read scalability requirements. For very high concurrent reading, continuing to increase the number of followers is not a good practice. Leader replicates log stream to followers synchronously, so adding more followers will increases write latency. Like Zookeeper,we have introduced a new type of frontend node called observer that helps addressing this problem and further improving metadata read scalability. Leader replicates log stream to observers asynchronously. Observers don’t involve leader election.
The replicated-state-machine is implemented based on BerkeleyDB java version (BDB-JE). BDB-JE has achieved high availability by implementing a Paxos-like consensus algorithm. We use BDB-JE to implement Palo’s log replication and leader election.
#### 3.3 Consistency Guarantee
If a client process connects to the leader, it will see up-to-date metadata, so that strong consistency semantics is guaranteed. If the client connects to followers or observers, it will see metadata lagging a little behind of the leader, but the monotonic consistency is guaranteed. In most Palo’s use cases, monotonic consistency is accepted.
If the client always connects to the same frontend, monotonic consistency semantics is obviously guaranteed; however if the client connects to other frontends due to failover, the semantics may be violated. Palo provides a SYNC command to guarantee metadata monotonic consistency semantics during failover. When failover happens, the client can send a SYNC command to the new connected frontend, who will get the latest operation log number from the leader. The SYNC command will not return to client as long as local applied log number is still less than fetched operation log number. This mechanism can guarantee the metadata on the connected frontend is newer than the client have seen during its last connection.
#### 3.4 MySQL Networking Protocol
MySQL compatible networking protocol is implemented in Palo’s frontend. Firstly, SQL interface is preferred for engineers; Secondly, compatibility with MySQL protocol makes the integrating with current existing BI software, such as Tableau, easier; Lastly, rich MySQL client libraries and tools reduce our development costs, but also reduces the users’ using cost.
Through the SQL interface, administrator can adjust system configuration, add and remove frontend nodes or backend nodes, and create new database for user; user can create tables, load data, and submit SQL query.
Online help document and Linux Proc-like mechanism are also supported in SQL. Users can submit queries to get the help of related SQL statements or show Palo’s internal running state.
In frontend, a small response buffer is allocated to every MySQL connection. The maximum size of this buffer is limited to 1MB. The buffer is responsible for buffering the query response data. Only if the response is finished or the buffer size reaches the 1MB,the response data will begin to be sent to client. Through this small trick, frontend can re-execution most of queries if errors occurred during query execution.
#### 3.5 Two-Level Partitioning
Like most of the distributed database system, data in Palo is horizontally partitioned. However, a single-level partitioning rule (hash partitioning or range partitioning) may not be a good solution to all scenarios. For example, there have a user-based fact table that stores rows of the form (date, userid, metric). Choosing only hash partitioning by column userid may lead to uneven distribution of data, when one user's data is very large. If choosing range partitioning according to column date, it will also lead to uneven distribution of data due to the likely data explosion in a certain period of time.
Therefore we support the two-level partitioning rule. The first level is range partitioning. User can specify a column (usually the time series column) range of values for the data partition. In one partition, the user can also specify one or more columns and a number of buckets to do the hash partitioning. User can combine with different partitioning rules to better divide the data. Figure 4 gives an example of two-level partitioning.
Three benefits are gained by using the two-level partitioning mechanism. Firstly, old and new data could be separated, and stored on different storage mediums; Secondly, storage engine of backend can reduce the consumption of IO and CPU for unnecessary data merging, because the data in some partitions is no longer be updated; Lastly,every partition’s buckets number can be different and adjusted according to the change of data size.
![](./docs/resources/two_level_partition.jpg)
## 4. Backend
#### 4.1 Data Storage Model
Palo combines Google Mesa’s data model and ORCFile / Parquet storage technology.
Data in Mesa is inherently multi-dimensional fact table. These facts in table typically consist of two types of attributes: dimensional attributes (which we call keys) and measure attributes (which we call values). The table schema also specifies the aggregation function F: V ×V → V which is used to aggregate the values corresponding to the same key. To achieve high update throughput, Mesa loads data in batch. Each batch of data will be converted to a delta file. Mesa uses MVCC approach to manage these delta files, and so to enforce update atomicity. Mesa also supports creating materialized rollups, which contain a column subset of schema to gain better aggregation effect.
Mesa’s data model performs well in many interactive data service, but it also has some drawbacks:
1. Users have difficulty in understanding key and value space, as well as aggregation function, especially when they rarely have such aggregation demand in analysis query scenarios.
2. In order to ensure the aggregation semantic, count operation on a single column must read all columns in key space, resulting in a large number of additional read overheads. There is also unable to push down the predicates on the value column to storage engine, which also leads to additional read overheads.
3. Essentially, it is still a key-value model. In order to aggregate the values corresponding to the same key, all key columns must store in order. When a table contains hundreds of columns, sorting cost becomes the bottleneck of ETL process.
To solve these problems, we introduce ORCFile / Parquet technology widely used in the open source community, such as MapReduce + ORCFile, SparkSQL + Parquet, mainly used for ad-hoc analysis of large amounts of data with low concurrency. These data does not distinguish between key and value. In addition, compared with the row-oriented database, column-oriented organization is more efficient when an aggregate needs to be computed over many rows but only for a small subset of all columns of data, because reading that smaller subset of data can be faster than reading all data. And columnar storage is also space-friendly due to the high compression ratio of each column. Further, column support block-level storage technology such as min/max index and bloom filter index. Query executor can filter out a lot of blocks that do not meet the predicate, to further improve the query performance. However, due to the underlying storage does not require data order, query time complexity is linear corresponding to the data volume.
Like traditional databases, Palo stores structured data represented as tables. Each table has a well-defined schema consisting of a finite number of columns. We combine Mesa data model and ORCFile/Parquet technology to develop a distributed analytical database. User can create two types of table to meet different needs in interactive query scenarios.
In non-aggregation type of table, columns are not distinguished between dimensions and metrics, but should specify the sort columns in order to sort all rows. Palo will sort the table data according to the sort columns without any aggregation. The following figure gives an example of creating non-aggregation table.
![](./docs/resources/duplicate_key.jpg)
In aggregation data analysis case, we reference Mesa’s data model, and distinguish columns between key and value, and specify the value columns with aggregation method, such as SUM, REPLACE, etc. In the following figure, we create an aggregation table like the non-aggregation table, including two SUM aggregation columns (clicks, cost). Different from the non-aggregation table, data in the table needs to be sorted on all key columns for delta compaction and value aggregation.
![](./docs/resources/aggregate_key.jpg)
Rollup is a materialized view that contains a column subset of schema in Palo. A table may contain multiple rollups with columns in different order. According to sort key index and column covering of the rollups, Palo can select the best rollup for different query. Because most rollups only contain a few columns, the size of aggregated data is typically much smaller and query performance can greatly be improved. All the rollups in the same table are updated atomically. Because rollups are materialized, users should make a trade-off between query latency and storage space when using them.
To achieve high update throughput, Palo only applies updates in batches at the smallest frequency of every minute. Each update batch specifies an increased version number and generates a delta data file, commits the version when updates of quorum replicas are complete. You can query all committed data using the committed version, and the uncommitted version would not be used in query. All update versions are strictly be in increasing order. If an update contains more than one table, the versions of these tables are committed atomically. The MVCC mechanism allows Palo to guarantee multiple table atomic updates and query consistency. In addition, Palo uses compaction policies to merge delta files to reduce delta number, also reduce the cost of delta merging during query for higher performance.
Palo’s data file is stored by column. The rows are stored in sorted order by the sort columns in delta data files, and are organized into row blocks, each block is compressed by type-specific columnar encodings, such as run-length encoding for integer columns, then stored into separate streams. In order to improve the performance of queries that have a specific key, we also store a sparse sort key index file corresponding to each delta data file. An index entry contains the short key for the row block, which is a fixed size prefix of the first sort columns for the row block, and the block id in the data file. Index files are usually directly loaded into memory, as they are very small. The algorithm for querying a specific key includes two steps. First, use a binary search on the sort key index to find blocks that may contain the specific key, and then perform a binary search on the compressed blocks in the data files to find the desired key. We also store block-level min/max index into separate index streams, and queries can use this to filter undesired blocks. In addition to those basic columnar features, we also offers an optional block-level bloom filter index for queries with IN or EQUAL conditions to further filter undesired blocks. Bloom filter index is stored in a separate stream, and is loaded on demand.
#### 4.2 Data Loading
Palo applies updates in batches. Three types of data loading are supported: Hadoop-batch loading, loading ,mini-batch loading.
1. Hadoop-batch loading. When a large amount of data volume needs to be loaded into Palo, the hadoop-batch loading is recommended to achieve high loading throughput. The data batches themselves are produced by an external Hadoop system, typically at a frequency of every few minutes. Unlike traditional data warehouses that use their own computing resource to do the heavy data preparation, Palo could use Hadoop to prepare the data (shuffle, sort and aggregate, etc.). By using this approach, the most time-consuming computations are handed over to Hadoop to complete. This will not only improve computational efficiency, but also reduce the performance pressure of Palo cluster and ensure the stability of the query service. The stability of the online data services is the most important point.
2. Loading. After deploying the fs-brokers, you can use palo's query engine to import data. This type of loading is recommended for incremental data loading.
3. Mini-batch loading. When a small amount of data needs to be loaded into Palo, the mini-batch loading is recommended to achieve low loading latency. By using http interface, raw data is pushed into a backend. Then the backend does the data preparing computing and completes the final loading. Http tools could connect frontend or backend. If frontend is connected, it will redirect the request randomly to a backend.
All the loading work is handled asynchronously. When load request is submitted, a label needs to be provided. By using the load label, users can submit show load request to get the loading status or submit cancel load request to cancel the loading. If the status of loading task is successful or in progress, its load label is not allowed to reuse again. The label of failed task is allowed to be reused.
#### 4.3 Resouce Isolation
1. Multi-tenancy Isolation:Multiple virtual cluster can be created in one pysical Palo cluster. Every backend node can deploy multiple backend processes. Every backend process only belongs to one virtual cluster. Virtual cluster is one tenancy.
2. User Isolation: There are many users in one virtual cluster. You can allocate the resouce among different users and ensure that all users’ tasks are executed under limited resource quota.
3. Priority Isolation: There are three priorities isolation group for one user. User could control resource allocated to different tasks submitted by themselves, for example user's query task and loading tasks require different resource quota.
#### 4.4 Multi-Medium Storage
Most machines in modern datacenter are equipped with both SSDs and HDDs. SSD has good random read capability that is the ideal medium for query that needs a large number of random read operations. However, SSD’s capacity is small and is very expensive, we could not deploy it at a large scale. HDD is cheap and has huge capacity that is suitable to store large scale data but with high read latency. In OLAP scenario, we find user usually submit a lot of queries to query the latest data (hot data) and expect low latency. User occasionally executes query on historical data (cold data). This kind of query usually needs to scan large scale of data and is high latency. Multi-Medium Storage allows users to manage the storage medium of the data to meet different query scenarios and reduce the latency. For example, user could put latest data on SSD and historical data which is not used frequently on HDD, user will get low latency when quering latest data while get high latency when query historical data which is normal because it needs scan large scale data.
In the following figure, user alters partition 'p201601' storage_medium to SSD and storage_cooldown_time to '2016-07-01 00:00:00'. The setting means data in this partition will be put on SSD and it will start to migrate to HDD after the time of storage_cooldown_time.
![](./docs/resources/multi_medium_storage.jpg)
-----------------------------------------------------------
#### 4.5 Vectorized Query Execution
Runtime code generation using LLVM is one of the techniques employed extensively by Impala to improve query execution times. Performance could gains of 5X or more are typical for representative workloads.
But, runtime code generation is not suitable for low latency query, because the generation overhead costs about 100ms. Runtime code generation is more suitable for large-scale ad-hoc query. To accelerate the small queries (of course, big queries will also obtain benefits), we introduced vectorized query execution into Palo.
Vectorized query execution is a feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of many rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type). Simple operations like arithmetic and comparisons are done by quickly iterating through the vectors in a tight loop, with no or very few function calls or conditional branches inside the loop. These loops compile in a streamlined way that uses relatively few instructions and finishes each instruction in fewer clock cycles, on average, by effectively using the processor pipeline and cache memory.
The result of benchmark shows 2x~4x speedup in our typical queries.
## 5. Backup and Recovery
Data backup function is provided to enhance data security. The minimum granularity of backup and recovery is partition. Users can develop plug-ins to backup data to any specified remote storage. The backup data can always be recovered to Palo at all time, to achieve the data rollback purpose.
Currently we only support full data backup data rather than incremental backups for the following reasons:
1. Remote storage system is beyond the control of the Palo system. We cannot guarantee whether the data has been changed between two backup operations. And data verification operations always come at a high price.
2. We support data backup on partition granularity. And majority of applications are time series applications. By dividing data using time column, it has been able to meet the needs of the vast majority of incremental backup in chronological order.
In addition to improving data security, the backup function also provides a way to export the data. Data can be exported to other downstream systems for further processing.
# Install
Palo only supports Linux System. Oracle JDK 8.0+ and GCC 4.8.2+ are required. See the document of [INSTALL](./docs/admin_guide/install.md).
# User Guide
See the [Tutorial](./docs/user_guide/tutorial.md) and [SQL Reference](./docs/user_guide/sql_reference.md)
# Contact us
palo-rd@baidu.com
需要加入Palo微信技术讨论群的,请加微信号:myh13161636186,然后备注一下:加入Palo技术讨论群

453
be/CMakeLists.txt Normal file
View File

@ -0,0 +1,453 @@
# Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
cmake_minimum_required(VERSION 2.6.0)
project(palo)
# Set dirs
set(BASE_DIR "${CMAKE_CURRENT_SOURCE_DIR}")
set(ENV{PALO_HOME} "${BASE_DIR}/../")
set(BUILD_DIR "${CMAKE_CURRENT_BINARY_DIR}")
set(THIRDPARTY_DIR "${BASE_DIR}/../thirdparty/installed/")
set(GENSRC_DIR "${BASE_DIR}/../gensrc/build/")
set(SRC_DIR "${BASE_DIR}/src/")
set(TEST_DIR "${CMAKE_SOURCE_DIR}/test/")
set(OUTPUT_DIR "${BASE_DIR}/output")
option(MAKE_TEST "ON for make unit test or OFF for not" OFF)
# Check gcc
if (CMAKE_COMPILER_IS_GNUCC)
execute_process(COMMAND ${CMAKE_C_COMPILER} -dumpversion
OUTPUT_VARIABLE GCC_VERSION)
string(REGEX MATCHALL "[0-9]+" GCC_VERSION_COMPONENTS ${GCC_VERSION})
list(GET GCC_VERSION_COMPONENTS 0 GCC_MAJOR)
list(GET GCC_VERSION_COMPONENTS 1 GCC_MINOR)
message(STATUS "GCC version: ${GCC_VERSION}")
message(STATUS "GCC major version: ${GCC_MAJOR}")
message(STATUS "GCC minor version: ${GCC_MINOR}")
if(GCC_VERSION VERSION_LESS "4.8.2")
message(FATAL_ERROR "Need GCC version at least 4.8.2")
endif(GCC_VERSION VERSION_LESS "4.8.2")
else()
message(FATAL_ERROR "Compiler should be GNU")
endif(CMAKE_COMPILER_IS_GNUCC)
# Set compiler
set(CMAKE_CXX_COMPILER $ENV{CXX})
set(CMAKE_C_COMPILER $ENV{CC})
set(PIC_LIB_PATH "${THIRDPARTY_DIR}")
if(PIC_LIB_PATH)
message(STATUS "defined PIC_LIB_PATH")
set(CMAKE_SKIP_RPATH TRUE)
set(Boost_USE_STATIC_LIBS ON)
set(Boost_USE_STATIC_RUNTIME ON)
set(LIBBZ2 ${PIC_LIB_PATH}/lib/libbz2.a)
set(LIBZ ${PIC_LIB_PATH}/lib/libz.a)
set(LIBEVENT ${PIC_LIB_PATH}/lib/libevent.a)
else()
message(STATUS "undefined PIC_LIB_PATH")
set(Boost_USE_STATIC_LIBS ON)
set(Boost_USE_STATIC_RUNTIME ON)
set(LIBBZ2 -lbz2)
set(LIBZ -lz)
set(LIBEVENT event)
endif()
# Compile generated source if necessary
message(STATUS "build gensrc if necessary")
execute_process(COMMAND make -C ${BASE_DIR}/../gensrc/
RESULT_VARIABLE MAKE_GENSRC_RESULT)
if(NOT ${MAKE_GENSRC_RESULT} EQUAL 0)
message(FATAL_ERROR "Failed to build ${BASE_DIR}/../gensrc/")
endif()
# Set Boost
set(Boost_DEBUG FALSE)
set(Boost_USE_MULTITHREADED ON)
set(BOOST_ROOT ${THIRDPARTY_DIR})
find_package(Boost 1.55.0 REQUIRED COMPONENTS thread regex system filesystem date_time program_options)
include_directories(${Boost_INCLUDE_DIRS})
# Set all libraries
add_library(gflags STATIC IMPORTED)
set_target_properties(gflags PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libgflags.a)
add_library(glog STATIC IMPORTED)
set_target_properties(glog PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libglog.a)
add_library(re2 STATIC IMPORTED)
set_target_properties(re2 PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libre2.a)
add_library(pprof STATIC IMPORTED)
set_target_properties(pprof PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libprofiler.a)
add_library(tcmalloc STATIC IMPORTED)
set_target_properties(tcmalloc PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libtcmalloc.a)
add_library(unwind STATIC IMPORTED)
set_target_properties(unwind PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libunwind.a)
add_library(protobuf STATIC IMPORTED)
set_target_properties(protobuf PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libprotobuf.a)
add_library(protoc STATIC IMPORTED)
set_target_properties(protoc PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libprotoc.a)
add_library(gtest STATIC IMPORTED)
set_target_properties(gtest PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libgtest.a)
add_library(gmock STATIC IMPORTED)
set_target_properties(gmock PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libgmock.a)
add_library(snappy STATIC IMPORTED)
set_target_properties(snappy PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libsnappy.a)
add_library(curl STATIC IMPORTED)
set_target_properties(curl PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libcurl.a)
add_library(lz4 STATIC IMPORTED)
set_target_properties(lz4 PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/liblz4.a)
add_library(thrift STATIC IMPORTED)
set_target_properties(thrift PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libthrift.a)
add_library(thriftnb STATIC IMPORTED)
set_target_properties(thriftnb PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libthriftnb.a)
add_library(lzo STATIC IMPORTED)
set_target_properties(lzo PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/liblzo2.a)
add_library(mysql STATIC IMPORTED)
set_target_properties(mysql PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libmysqlclient.a)
add_library(libevent STATIC IMPORTED)
set_target_properties(libevent PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libevent.a)
add_library(LLVMSupport STATIC IMPORTED)
set_target_properties(LLVMSupport PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libLLVMSupport.a)
add_library(crypto STATIC IMPORTED)
set_target_properties(crypto PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libcrypto.a)
add_library(openssl STATIC IMPORTED)
set_target_properties(openssl PROPERTIES IMPORTED_LOCATION ${THIRDPARTY_DIR}/lib/libssl.a)
find_program(THRIFT_COMPILER thrift ${CMAKE_SOURCE_DIR}/bin)
# LLVM
set(LLVM_BIN "${THIRDPARTY_DIR}/bin")
message(STATUS ${LLVM_HOME})
# llvm-config
find_program(LLVM_CONFIG_EXECUTABLE llvm-config
PATHS
${LLVM_BIN}
NO_DEFAULT_PATH
)
if (NOT LLVM_CONFIG_EXECUTABLE)
message(FATAL_ERROR "Could not find llvm-config")
endif (NOT LLVM_CONFIG_EXECUTABLE)
# clang++
find_program(LLVM_CLANG_EXECUTABLE clang++
PATHS
${LLVM_BIN}
NO_DEFAULT_PATH
)
if (NOT LLVM_CLANG_EXECUTABLE)
message(FATAL_ERROR "Could not find clang++")
endif (NOT LLVM_CLANG_EXECUTABLE)
# opt
find_program(LLVM_OPT_EXECUTABLE opt
PATHS
${LLVM_BIN}
NO_DEFAULT_PATH
)
if (NOT LLVM_OPT_EXECUTABLE)
message(FATAL_ERROR "Could not find llvm opt")
endif (NOT LLVM_OPT_EXECUTABLE)
message(STATUS "LLVM llvm-config found at: ${LLVM_CONFIG_EXECUTABLE}")
message(STATUS "LLVM clang++ found at: ${LLVM_CLANG_EXECUTABLE}")
message(STATUS "LLVM opt found at: ${LLVM_OPT_EXECUTABLE}")
# Get all llvm depends
execute_process(
COMMAND ${LLVM_CONFIG_EXECUTABLE} --includedir
OUTPUT_VARIABLE LLVM_INCLUDE_DIR
OUTPUT_STRIP_TRAILING_WHITESPACE
)
execute_process(
COMMAND ${LLVM_CONFIG_EXECUTABLE} --libdir
OUTPUT_VARIABLE LLVM_LIBRARY_DIR
OUTPUT_STRIP_TRAILING_WHITESPACE
)
execute_process(
COMMAND ${LLVM_CONFIG_EXECUTABLE} --ldflags
OUTPUT_VARIABLE LLVM_LFLAGS
OUTPUT_STRIP_TRAILING_WHITESPACE
)
# Get the link libs we need. llvm has many and we don't want to link all of the libs
# if we don't need them.
execute_process(
COMMAND ${LLVM_CONFIG_EXECUTABLE} --libnames core jit native ipo bitreader target
OUTPUT_VARIABLE LLVM_MODULE_LIBS
OUTPUT_STRIP_TRAILING_WHITESPACE
)
# TODO: this does not work well. the config file will output -I/<include path> and
# also -DNDEBUG. I've hard coded the #define that are necessary but we should make
# this better. The necesesary flags are only #defines so maybe just def/undef those
# around #include to llvm headers?
#execute_process(
# COMMAND ${LLVM_CONFIG_EXECUTABLE} --cppflags
# OUTPUT_VARIABLE LLVM_CFLAGS
# OUTPUT_STRIP_TRAILING_WHITESPACE
#)
set(LLVM_CFLAGS
"-D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS")
if(GCC_VERSION VERSION_LESS "5.0.0")
message(STATUS "GCC version is less than 5.0.0, no need to set -D__GLIBCXX_BITSIZE_INT_N_0=128 and -D__GLIBCXX_TYPE_INT_N_0=__int128")
else()
SET(LLVM_CFLAGS "${LLVM_LFLAGS} -D__GLIBCXX_BITSIZE_INT_N_0=128 -D__GLIBCXX_TYPE_INT_N_0=__int128")
endif()
# Set clang flags for cross-compiling to IR.
# IR_COMPILE is #defined for the cross compile to remove code that bloats the IR.
# Note that we don't enable any optimization. We want unoptimized IR since we will be
# modifying it at runtime, then re-compiling (and optimizing) the modified code. The final
# optimizations will be less effective if the initial code is also optimized.
set(CLANG_IR_CXX_FLAGS "-std=gnu++11" "-c" "-emit-llvm" "-D__STDC_CONSTANT_MACROS" "-D__STDC_FORMAT_MACROS" "-D__STDC_LIMIT_MACROS" "-DIR_COMPILE" "-DNDEBUG" "-DHAVE_INTTYPES_H" "-DHAVE_NETINET_IN_H" "-DBOOST_DATE_TIME_POSIX_TIME_STD_CONFIG" "-D__GLIBCXX_BITSIZE_INT_N_0=128" "-D__GLIBCXX_TYPE_INT_N_0=__int128" "-U_GLIBCXX_USE_FLOAT128")
# CMake really doesn't like adding link directories and wants absolute paths
# Reconstruct it with LLVM_MODULE_LIBS and LLVM_LIBRARY_DIR
string(REPLACE " " ";" LIBS_LIST ${LLVM_MODULE_LIBS})
set (LLVM_MODULE_LIBS "-ldl")
foreach (LIB ${LIBS_LIST})
set(LLVM_MODULE_LIBS ${LLVM_MODULE_LIBS} "${LLVM_LIBRARY_DIR}/${LIB}")
endforeach(LIB)
message(STATUS "LLVM include dir: ${LLVM_INCLUDE_DIR}")
message(STATUS "LLVM lib dir: ${LLVM_LIBRARY_DIR}")
message(STATUS "LLVM libs: ${LLVM_MODULE_LIBS}")
message(STATUS "LLVM compile flags: ${LLVM_CFLAGS}")
# When the Toolchain is used we use LLVM 3.3 that was built in a different path that it
# is invoked from, and a GCC that resides in a different location. LVVM 3.3 relies on
# hard-coded path information about where to find the system headers and does not support
# specifying the -gcc-toolchain flag to dynamically provide this information. Because of
# these reasons we need to manually add the system c++ headers to the path when we
# compile the IR code with clang.
# Check the release version of the system to set the correct flags.
# You may have to modify the ${CLANG_BASE_FLAGS} by you own.
execute_process(COMMAND lsb_release -si OUTPUT_VARIABLE LINUX_VERSION)
string(TOLOWER ${LINUX_VERSION} LINUX_VERSION_LOWER)
message(STATUS "${LINUX_VERSION_LOWER}")
if(${LINUX_VERSION_LOWER} MATCHES "ubuntu")
set(CLANG_BASE_FLAGS
"-I/usr/include/c++/5/"
"-I/usr/include/x86_64-linux-gnu/c++/5/")
elseif(${LINUX_VERSION_LOWER} MATCHES "centos")
set(CLANG_BASE_FLAGS
"-I/usr/include/c++/4.8.5/"
"-I/usr/include/c++/4.8.5/x86_64-redhat-linux/")
else()
message(FATAL_ERROR "Currently not support system ${LINUX_VERSION}")
endif()
set(CLANG_INCLUDE_FLAGS
"-I${BASE_DIR}/src"
"-I${GENSRC_DIR}"
"-I${THIRDPARTY_DIR}/include"
"-I${THIRDPARTY_DIR}/include/thrift/"
"-I${THIRDPARTY_DIR}/include/event/"
)
# Set include dirs
include_directories(${LLVM_INCLUDE_DIR})
include_directories(
${SRC_DIR}/
${TEST_DIR}/
${GENSRC_DIR}/
${THIRDPARTY_DIR}/include/
${THIRDPARTY_DIR}/include/thrift/
${THIRDPARTY_DIR}/include/event/
)
# Set libraries
set(WL_START_GROUP "-Wl,--start-group")
set(WL_END_GROUP "-Wl,--end-group")
# Set Palo libraries
set (PALO_LINK_LIBS
${WL_START_GROUP}
Agent
CodeGen
Common
Exec
Exprs
Gutil
Olap
Runtime
RPC
Service
Udf
Util
PaloGen
Webserver
TestUtil
AES
${WL_END_GROUP}
)
# Set thirdparty libraries
set (PALO_LINK_LIBS ${PALO_LINK_LIBS}
protobuf
lzo
snappy
${Boost_LIBRARIES}
${LLVM_MODULE_LIBS}
# popt
thrift
thriftnb
${WL_START_GROUP}
glog
gflags
re2
pprof
unwind
tcmalloc
unwind
lz4
libevent
${LIBZ}
${LIBBZ2}
mysql
curl
${WL_END_GROUP}
-lrt
-lbfd
-liberty
openssl
crypto
#-fsanitize=address
#-lboost_date_time
)
# Set libraries for test
set (TEST_LINK_LIBS ${PALO_LINK_LIBS} gmock LLVMSupport)
# Set CXX flags
SET(CXX_COMMON_FLAGS "-msse4.2 -Wall -Wno-sign-compare -Wno-deprecated -pthread")
SET(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -DBOOST_DATE_TIME_POSIX_TIME_STD_CONFIG -D__STDC_FORMAT_MACROS")
# Add by zhaochun: use gnu++11 for make_unsigned<__int128>
SET(CMAKE_CXX_FLAGS "-g -O2 -ggdb -Wno-unused-local-typedefs -Wno-strict-aliasing -std=gnu++11 -DPERFORMANCE -D_FILE_OFFSET_BITS=64")
# use address sanitizer, commented the malloc in ld flags
# SET(CMAKE_CXX_FLAGS "-g -ggdb -Wno-unused-local-typedefs -Wno-strict-aliasing -std=gnu++11 -DPERFORMANCE -fsanitize=address -fno-omit-frame-pointer -DADDRESS_SANITIZER")
SET(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}")
MESSAGE(STATUS "Compiler Flags: ${CMAKE_CXX_FLAGS}")
# Thrift requires these two definitions for some types that we use
add_definitions(-DHAVE_INTTYPES_H -DHAVE_NETINET_IN_H)
# Only build static libs
set(BUILD_SHARED_LIBS OFF)
if (${MAKE_TEST} STREQUAL "ON")
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fprofile-arcs -ftest-coverage")
SET(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fprofile-arcs -ftest-coverage -lgcov")
add_definitions(-DBE_TEST)
endif ()
add_subdirectory(${SRC_DIR}/codegen)
add_subdirectory(${SRC_DIR}/common)
add_subdirectory(${SRC_DIR}/util)
add_subdirectory(${SRC_DIR}/gen_cpp)
add_subdirectory(${SRC_DIR}/gutil)
add_subdirectory(${SRC_DIR}/olap)
add_subdirectory(${SRC_DIR}/agent)
add_subdirectory(${SRC_DIR}/http)
add_subdirectory(${SRC_DIR}/service)
add_subdirectory(${SRC_DIR}/exec)
add_subdirectory(${SRC_DIR}/exprs)
add_subdirectory(${SRC_DIR}/udf)
add_subdirectory(${SRC_DIR}/runtime)
add_subdirectory(${SRC_DIR}/testutil)
add_subdirectory(${SRC_DIR}/rpc)
add_subdirectory(${SRC_DIR}/aes)
if (${MAKE_TEST} STREQUAL "ON")
add_subdirectory(${TEST_DIR}/agent)
add_subdirectory(${TEST_DIR}/olap)
add_subdirectory(${TEST_DIR}/common)
add_subdirectory(${TEST_DIR}/util)
add_subdirectory(${TEST_DIR}/udf)
add_subdirectory(${TEST_DIR}/exec)
add_subdirectory(${TEST_DIR}/exprs)
add_subdirectory(${TEST_DIR}/runtime)
add_subdirectory(${TEST_DIR}/udf)
endif ()
# Install be
install(DIRECTORY DESTINATION ${OUTPUT_DIR})
install(DIRECTORY DESTINATION ${OUTPUT_DIR}/bin)
install(DIRECTORY DESTINATION ${OUTPUT_DIR}/conf)
install(FILES
${BASE_DIR}/../bin/start_be.sh
${BASE_DIR}/../bin/stop_be.sh
DESTINATION ${OUTPUT_DIR}/bin)
install(FILES
${BASE_DIR}/../conf/be.conf
DESTINATION ${OUTPUT_DIR}/conf)
# Utility CMake function to make specifying tests and benchmarks less verbose
FUNCTION(ADD_BE_TEST TEST_NAME)
# This gets the directory where the test is from (e.g. 'exprs' or 'runtime')
get_filename_component(DIR_NAME ${CMAKE_CURRENT_SOURCE_DIR} NAME)
get_filename_component(TEST_DIR_NAME ${TEST_NAME} PATH)
get_filename_component(TEST_FILE_NAME ${TEST_NAME} NAME)
ADD_EXECUTABLE(${TEST_FILE_NAME} ${TEST_NAME}.cpp)
TARGET_LINK_LIBRARIES(${TEST_FILE_NAME} ${TEST_LINK_LIBS})
SET_TARGET_PROPERTIES(${TEST_FILE_NAME} PROPERTIES COMPILE_FLAGS "-Dprivate=public -Dprotected=public")
if (NOT "${TEST_DIR_NAME}" STREQUAL "")
SET_TARGET_PROPERTIES(${TEST_FILE_NAME} PROPERTIES RUNTIME_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}/${DIR_NAME}/${TEST_DIR_NAME}")
endif()
ADD_TEST(${TEST_FILE_NAME} "${BUILD_OUTPUT_ROOT_DIRECTORY}/${DIR_NAME}/${TEST_NAME}")
ENDFUNCTION()

30
be/src/aes/CMakeLists.txt Normal file
View File

@ -0,0 +1,30 @@
# Modifications copyright (C) 2017, Baidu.com, Inc.
# Copyright 2017 The Apache Software Foundation
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
# where to put generated libraries
set(LIBRARY_OUTPUT_PATH "${BUILD_DIR}/src/aes")
# where to put generated binaries
set(EXECUTABLE_OUTPUT_PATH "${BUILD_DIR}/src/aes")
add_library(AES STATIC
my_aes.cpp
my_aes_openssl.cpp
)

60
be/src/aes/my_aes.cpp Normal file
View File

@ -0,0 +1,60 @@
/* Copyright (c) 2002, 2014, Oracle and/or its affiliates. All rights reserved.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; version 2 of the License.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */
//#include <my_global.h>
//#include <m_string.h>
#include "my_aes.h"
#include "my_aes_impl.h"
#include <cstring>
/**
Transforms an arbitrary long key into a fixed length AES key
AES keys are of fixed length. This routine takes an arbitrary long key
iterates over it in AES key length increment and XORs the bytes with the
AES key buffer being prepared.
The bytes from the last incomplete iteration are XORed to the start
of the key until their depletion.
Needed since crypto function routines expect a fixed length key.
@param key [in] Key to use for real key creation
@param key_length [in] Length of the key
@param rkey [out] Real key (used by OpenSSL/YaSSL)
@param opmode [out] encryption mode
*/
namespace palo {
void my_aes_create_key(const unsigned char *key, uint key_length,
uint8 *rkey, enum my_aes_opmode opmode)
{
const uint key_size= my_aes_opmode_key_sizes[opmode] / 8;
uint8 *rkey_end; /* Real key boundary */
uint8 *ptr; /* Start of the real key*/
uint8 *sptr; /* Start of the working key */
uint8 *key_end= ((uint8 *)key) + key_length; /* Working key boundary*/
rkey_end= rkey + key_size;
memset(rkey, 0, key_size); /* Set initial key */
for (ptr= rkey, sptr= (uint8 *)key; sptr < key_end; ptr++, sptr++)
{
if (ptr == rkey_end)
/* Just loop over tmp_key until we used all key */
ptr= rkey;
*ptr^= *sptr;
}
}
}

140
be/src/aes/my_aes.h Normal file
View File

@ -0,0 +1,140 @@
#ifndef MY_AES_INCLUDED
#define MY_AES_INCLUDED
/* Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; version 2 of the License.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */
/* Header file for my_aes.c */
/* Wrapper to give simple interface for MySQL to AES standard encryption */
//C_MODE_START
#include <stdint.h>
/** AES IV size is 16 bytes for all supported ciphers except ECB */
#define MY_AES_IV_SIZE 16
/** AES block size is fixed to be 128 bits for CBC and ECB */
#define MY_AES_BLOCK_SIZE 16
typedef uint32_t uint32;
typedef bool my_bool;
typedef uint32_t uint;
/** Supported AES cipher/block mode combos */
enum my_aes_opmode
{
my_aes_128_ecb,
my_aes_192_ecb,
my_aes_256_ecb,
my_aes_128_cbc,
my_aes_192_cbc,
my_aes_256_cbc
#ifndef HAVE_YASSL
,my_aes_128_cfb1,
my_aes_192_cfb1,
my_aes_256_cfb1,
my_aes_128_cfb8,
my_aes_192_cfb8,
my_aes_256_cfb8,
my_aes_128_cfb128,
my_aes_192_cfb128,
my_aes_256_cfb128,
my_aes_128_ofb,
my_aes_192_ofb,
my_aes_256_ofb
#endif
};
#define MY_AES_BEGIN my_aes_128_ecb
#ifdef HAVE_YASSL
#define MY_AES_END my_aes_256_cbc
#else
#define MY_AES_END my_aes_256_ofb
#endif
/* If bad data discovered during decoding */
#define MY_AES_BAD_DATA -1
/** String representations of the supported AES modes. Keep in sync with my_aes_opmode */
extern const char *my_aes_opmode_names[];
namespace palo {
/**
Encrypt a buffer using AES
@param source [in] Pointer to data for encryption
@param source_length [in] Size of encryption data
@param dest [out] Buffer to place encrypted data (must be large enough)
@param key [in] Key to be used for encryption
@param key_length [in] Length of the key. Will handle keys of any length
@param mode [in] encryption mode
@param iv [in] 16 bytes initialization vector if needed. Otherwise NULL
@param padding [in] if padding needed.
@return size of encrypted data, or negative in case of error
*/
int my_aes_encrypt(const unsigned char *source, uint32 source_length,
unsigned char *dest,
const unsigned char *key, uint32 key_length,
enum my_aes_opmode mode, const unsigned char *iv,
bool padding = true);
/**
Decrypt an AES encrypted buffer
@param source Pointer to data for decryption
@param source_length size of encrypted data
@param dest buffer to place decrypted data (must be large enough)
@param key Key to be used for decryption
@param key_length Length of the key. Will handle keys of any length
@param mode encryption mode
@param iv 16 bytes initialization vector if needed. Otherwise NULL
@param padding if padding needed.
@return size of original data.
*/
int my_aes_decrypt(const unsigned char *source, uint32 source_length,
unsigned char *dest,
const unsigned char *key, uint32 key_length,
enum my_aes_opmode mode, const unsigned char *iv,
bool padding = true);
/**
Calculate the size of a buffer large enough for encrypted data
@param source_length length of data to be encrypted
@param mode encryption mode
@return size of buffer required to store encrypted data
*/
int my_aes_get_size(uint32 source_length, enum my_aes_opmode mode);
/**
Return true if the AES cipher and block mode requires an IV
SYNOPSIS
my_aes_needs_iv()
@param mode encryption mode
@retval TRUE IV needed
@retval FALSE IV not needed
*/
my_bool my_aes_needs_iv(my_aes_opmode opmode);
}
//C_MODE_END
#endif /* MY_AES_INCLUDED */

37
be/src/aes/my_aes_impl.h Normal file
View File

@ -0,0 +1,37 @@
/* Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; version 2 of the License.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */
#ifndef BDG_PALO_BE_EXPRS_MY_AES_IMPL_H
#define BDG_PALO_BE_EXPRS_MY_AES_IMPL_H
/** Maximum supported key kength */
const int MAX_AES_KEY_LENGTH = 256;
/* TODO: remove in a future version */
/* Guard against using an old export control restriction #define */
#ifdef AES_USE_KEY_BITS
#error AES_USE_KEY_BITS not supported
#endif
typedef uint32_t uint;
typedef uint8_t uint8;
namespace palo {
extern uint *my_aes_opmode_key_sizes;
void my_aes_create_key(const unsigned char *key, uint key_length,
uint8 *rkey, enum my_aes_opmode opmode);
}
#endif

View File

@ -0,0 +1,218 @@
/* Copyright (c) 2015, 2016 Oracle and/or its affiliates. All rights reserved.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; version 2 of the License.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */
//#include <my_global.h>
//#include <m_string.h>
#include "my_aes.h"
#include "my_aes_impl.h"
#include <string>
#include <assert.h>
#include <openssl/aes.h>
#include <openssl/evp.h>
#include <openssl/err.h>
#define DBUG_ASSERT(A) assert(A)
#define TRUE true
#define FALSE false
namespace palo {
/* keep in sync with enum my_aes_opmode in my_aes.h */
const char *my_aes_opmode_names[]=
{
"aes-128-ecb",
"aes-192-ecb",
"aes-256-ecb",
"aes-128-cbc",
"aes-192-cbc",
"aes-256-cbc",
"aes-128-cfb1",
"aes-192-cfb1",
"aes-256-cfb1",
"aes-128-cfb8",
"aes-192-cfb8",
"aes-256-cfb8",
"aes-128-cfb128",
"aes-192-cfb128",
"aes-256-cfb128",
"aes-128-ofb",
"aes-192-ofb",
"aes-256-ofb",
NULL /* needed for the type enumeration */
};
/* keep in sync with enum my_aes_opmode in my_aes.h */
static uint my_aes_opmode_key_sizes_impl[]=
{
128 /* aes-128-ecb */,
192 /* aes-192-ecb */,
256 /* aes-256-ecb */,
128 /* aes-128-cbc */,
192 /* aes-192-cbc */,
256 /* aes-256-cbc */,
128 /* aes-128-cfb1 */,
192 /* aes-192-cfb1 */,
256 /* aes-256-cfb1 */,
128 /* aes-128-cfb8 */,
192 /* aes-192-cfb8 */,
256 /* aes-256-cfb8 */,
128 /* aes-128-cfb128 */,
192 /* aes-192-cfb128 */,
256 /* aes-256-cfb128 */,
128 /* aes-128-ofb */,
192 /* aes-192-ofb */,
256 /* aes-256-ofb */
};
uint *my_aes_opmode_key_sizes= my_aes_opmode_key_sizes_impl;
static const EVP_CIPHER *
aes_evp_type(const my_aes_opmode mode)
{
switch (mode)
{
case my_aes_128_ecb: return EVP_aes_128_ecb();
case my_aes_128_cbc: return EVP_aes_128_cbc();
case my_aes_128_cfb1: return EVP_aes_128_cfb1();
case my_aes_128_cfb8: return EVP_aes_128_cfb8();
case my_aes_128_cfb128: return EVP_aes_128_cfb128();
case my_aes_128_ofb: return EVP_aes_128_ofb();
case my_aes_192_ecb: return EVP_aes_192_ecb();
case my_aes_192_cbc: return EVP_aes_192_cbc();
case my_aes_192_cfb1: return EVP_aes_192_cfb1();
case my_aes_192_cfb8: return EVP_aes_192_cfb8();
case my_aes_192_cfb128: return EVP_aes_192_cfb128();
case my_aes_192_ofb: return EVP_aes_192_ofb();
case my_aes_256_ecb: return EVP_aes_256_ecb();
case my_aes_256_cbc: return EVP_aes_256_cbc();
case my_aes_256_cfb1: return EVP_aes_256_cfb1();
case my_aes_256_cfb8: return EVP_aes_256_cfb8();
case my_aes_256_cfb128: return EVP_aes_256_cfb128();
case my_aes_256_ofb: return EVP_aes_256_ofb();
default: return NULL;
}
}
int my_aes_encrypt(const unsigned char *source, uint32 source_length,
unsigned char *dest,
const unsigned char *key, uint32 key_length,
enum my_aes_opmode mode, const unsigned char *iv,
bool padding)
{
EVP_CIPHER_CTX ctx;
const EVP_CIPHER *cipher= aes_evp_type(mode);
int u_len, f_len;
/* The real key to be used for encryption */
unsigned char rkey[MAX_AES_KEY_LENGTH / 8];
my_aes_create_key(key, key_length, rkey, mode);
if (!cipher || (EVP_CIPHER_iv_length(cipher) > 0 && !iv))
return MY_AES_BAD_DATA;
if (!EVP_EncryptInit(&ctx, cipher, rkey, iv))
goto aes_error; /* Error */
if (!EVP_CIPHER_CTX_set_padding(&ctx, padding))
goto aes_error; /* Error */
if (!EVP_EncryptUpdate(&ctx, dest, &u_len, source, source_length))
goto aes_error; /* Error */
if (!EVP_EncryptFinal(&ctx, dest + u_len, &f_len))
goto aes_error; /* Error */
EVP_CIPHER_CTX_cleanup(&ctx);
return u_len + f_len;
aes_error:
/* need to explicitly clean up the error if we want to ignore it */
ERR_clear_error();
EVP_CIPHER_CTX_cleanup(&ctx);
return MY_AES_BAD_DATA;
}
int my_aes_decrypt(const unsigned char *source, uint32 source_length,
unsigned char *dest,
const unsigned char *key, uint32 key_length,
enum my_aes_opmode mode, const unsigned char *iv,
bool padding)
{
EVP_CIPHER_CTX ctx;
const EVP_CIPHER *cipher= aes_evp_type(mode);
int u_len, f_len;
/* The real key to be used for decryption */
unsigned char rkey[MAX_AES_KEY_LENGTH / 8];
my_aes_create_key(key, key_length, rkey, mode);
if (!cipher || (EVP_CIPHER_iv_length(cipher) > 0 && !iv))
return MY_AES_BAD_DATA;
EVP_CIPHER_CTX_init(&ctx);
if (!EVP_DecryptInit(&ctx, aes_evp_type(mode), rkey, iv))
goto aes_error; /* Error */
if (!EVP_CIPHER_CTX_set_padding(&ctx, padding))
goto aes_error; /* Error */
if (!EVP_DecryptUpdate(&ctx, dest, &u_len, source, source_length))
goto aes_error; /* Error */
if (!EVP_DecryptFinal_ex(&ctx, dest + u_len, &f_len))
goto aes_error; /* Error */
EVP_CIPHER_CTX_cleanup(&ctx);
return u_len + f_len;
aes_error:
/* need to explicitly clean up the error if we want to ignore it */
ERR_clear_error();
EVP_CIPHER_CTX_cleanup(&ctx);
return MY_AES_BAD_DATA;
}
int my_aes_get_size(uint32 source_length, my_aes_opmode opmode)
{
const EVP_CIPHER *cipher= aes_evp_type(opmode);
size_t block_size;
block_size= EVP_CIPHER_block_size(cipher);
return block_size > 1 ?
block_size * (source_length / block_size) + block_size :
source_length;
}
/**
Return true if the AES cipher and block mode requires an IV
SYNOPSIS
my_aes_needs_iv()
@param mode encryption mode
@retval TRUE IV needed
@retval FALSE IV not needed
*/
my_bool my_aes_needs_iv(my_aes_opmode opmode)
{
const EVP_CIPHER *cipher= aes_evp_type(opmode);
int iv_length;
iv_length= EVP_CIPHER_iv_length(cipher);
DBUG_ASSERT(iv_length == 0 || iv_length == MY_AES_IV_SIZE);
return iv_length != 0 ? TRUE : FALSE;
}
}

View File

@ -0,0 +1,38 @@
# Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
# where to put generated libraries
set(LIBRARY_OUTPUT_PATH "${BUILD_DIR}/src/agent")
# where to put generated binaries
set(EXECUTABLE_OUTPUT_PATH "${BUILD_DIR}/src/agent")
add_library(Agent STATIC
agent_server.cpp
pusher.cpp
file_downloader.cpp
heartbeat_server.cpp
task_worker_pool.cpp
utils.cpp
cgroups_mgr.cpp
topic_subscriber.cpp
user_resource_listener.cpp
)

View File

@ -0,0 +1,444 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "agent/agent_server.h"
#include <string>
#include "boost/filesystem.hpp"
#include "boost/lexical_cast.hpp"
#include "thrift/concurrency/ThreadManager.h"
#include "thrift/concurrency/PosixThreadFactory.h"
#include "thrift/server/TThreadPoolServer.h"
#include "thrift/server/TThreadedServer.h"
#include "thrift/transport/TSocket.h"
#include "thrift/transport/TTransportUtils.h"
#include "util/thrift_server.h"
#include "agent/status.h"
#include "agent/task_worker_pool.h"
#include "agent/user_resource_listener.h"
#include "common/status.h"
#include "common/logging.h"
#include "gen_cpp/AgentService_types.h"
#include "gen_cpp/MasterService_types.h"
#include "gen_cpp/Status_types.h"
#include "olap/utils.h"
#include "olap/command_executor.h"
#include "runtime/exec_env.h"
#include "runtime/etl_job_mgr.h"
#include "util/debug_util.h"
using apache::thrift::transport::TProcessor;
using std::deque;
using std::list;
using std::map;
using std::nothrow;
using std::set;
using std::string;
using std::to_string;
using std::vector;
namespace palo {
AgentServer::AgentServer(ExecEnv* exec_env,
const TMasterInfo& master_info) :
_exec_env(exec_env),
_master_info(master_info),
_topic_subscriber(new TopicSubscriber()) {
// clean dpp download dir
_command_executor = new CommandExecutor();
vector<OLAPRootPathStat>* root_paths_stat = new vector<OLAPRootPathStat>();
_command_executor->get_all_root_path_stat(root_paths_stat);
for (auto root_path_stat : *root_paths_stat) {
try {
string dpp_download_path_str = root_path_stat.root_path + DPP_PREFIX;
boost::filesystem::path dpp_download_path(dpp_download_path_str);
if (boost::filesystem::exists(dpp_download_path)) {
boost::filesystem::remove_all(dpp_download_path);
}
} catch (...) {
OLAP_LOG_WARNING("boost exception when remove dpp download path. [path='%s']",
root_path_stat.root_path.c_str());
}
}
// create tmp dir
boost::filesystem::path tmp_path(config::agent_tmp_dir);
if (boost::filesystem::exists(tmp_path)) {
boost::filesystem::remove_all(tmp_path);
}
boost::filesystem::create_directories(config::agent_tmp_dir);
// init task worker pool
_create_table_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::CREATE_TABLE,
master_info);
_drop_table_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::DROP_TABLE,
master_info);
_push_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::PUSH,
master_info);
_delete_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::DELETE,
master_info);
_alter_table_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::ALTER_TABLE,
master_info);
_clone_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::CLONE,
master_info);
_storage_medium_migrate_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::STORAGE_MEDIUM_MIGRATE,
master_info);
_cancel_delete_data_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::CANCEL_DELETE_DATA,
master_info);
_check_consistency_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::CHECK_CONSISTENCY,
master_info);
_report_task_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::REPORT_TASK,
master_info);
_report_disk_state_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::REPORT_DISK_STATE,
master_info);
_report_olap_table_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::REPORT_OLAP_TABLE,
master_info);
_upload_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::UPLOAD,
master_info);
_restore_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::RESTORE,
master_info);
_make_snapshot_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::MAKE_SNAPSHOT,
master_info);
_release_snapshot_workers = new TaskWorkerPool(
TaskWorkerPool::TaskWorkerType::RELEASE_SNAPSHOT,
master_info);
#ifndef BE_TEST
_create_table_workers->start();
_drop_table_workers->start();
_push_workers->start();
_delete_workers->start();
_alter_table_workers->start();
_clone_workers->start();
_storage_medium_migrate_workers->start();
_cancel_delete_data_workers->start();
_check_consistency_workers->start();
_report_task_workers->start();
_report_disk_state_workers->start();
_report_olap_table_workers->start();
_upload_workers->start();
_restore_workers->start();
_make_snapshot_workers->start();
_release_snapshot_workers->start();
// Add subscriber here and register listeners
TopicListener* user_resource_listener = new UserResourceListener(exec_env, master_info);
LOG(INFO) << "Register user resource listener";
_topic_subscriber->register_listener(palo::TTopicType::type::RESOURCE, user_resource_listener);
#endif
}
AgentServer::~AgentServer() {
if (_command_executor != NULL) {
delete _command_executor;
}
if (_create_table_workers != NULL) {
delete _create_table_workers;
}
if (_drop_table_workers != NULL) {
delete _drop_table_workers;
}
if (_push_workers != NULL) {
delete _push_workers;
}
if (_delete_workers != NULL) {
delete _delete_workers;
}
if (_alter_table_workers != NULL) {
delete _alter_table_workers;
}
if (_clone_workers != NULL) {
delete _clone_workers;
}
if (_storage_medium_migrate_workers != NULL) {
delete _storage_medium_migrate_workers;
}
if (_cancel_delete_data_workers != NULL) {
delete _cancel_delete_data_workers;
}
if (_check_consistency_workers != NULL) {
delete _check_consistency_workers;
}
if (_report_task_workers != NULL) {
delete _report_task_workers;
}
if (_report_disk_state_workers != NULL) {
delete _report_disk_state_workers;
}
if (_report_olap_table_workers != NULL) {
delete _report_olap_table_workers;
}
if (_upload_workers != NULL) {
delete _upload_workers;
}
if (_restore_workers != NULL) {
delete _restore_workers;
}
if (_make_snapshot_workers != NULL) {
delete _make_snapshot_workers;
}
if (_release_snapshot_workers != NULL) {
delete _release_snapshot_workers;
}
if (_topic_subscriber !=NULL) {
delete _topic_subscriber;
}
}
void AgentServer::submit_tasks(
TAgentResult& return_value,
const vector<TAgentTaskRequest>& tasks) {
// Set result to dm
vector<string> error_msgs;
TStatusCode::type status_code = TStatusCode::OK;
// TODO check require master same to heartbeat master
if (_master_info.network_address.hostname == ""
|| _master_info.network_address.port == 0) {
error_msgs.push_back("Not get master heartbeat yet.");
return_value.status.__set_error_msgs(error_msgs);
return_value.status.__set_status_code(TStatusCode::CANCELLED);
return;
}
for (auto task : tasks) {
TTaskType::type task_type = task.task_type;
int64_t signature = task.signature;
switch (task_type) {
case TTaskType::CREATE:
if (task.__isset.create_tablet_req) {
_create_table_workers->submit_task(task);
} else {
status_code = TStatusCode::ANALYSIS_ERROR;
}
break;
case TTaskType::DROP:
if (task.__isset.drop_tablet_req) {
_drop_table_workers->submit_task(task);
} else {
status_code = TStatusCode::ANALYSIS_ERROR;
}
break;
case TTaskType::PUSH:
if (task.__isset.push_req) {
if (task.push_req.push_type == TPushType::LOAD
|| task.push_req.push_type == TPushType::LOAD_DELETE) {
_push_workers->submit_task(task);
} else if (task.push_req.push_type == TPushType::DELETE) {
_delete_workers->submit_task(task);
} else {
status_code = TStatusCode::ANALYSIS_ERROR;
}
} else {
status_code = TStatusCode::ANALYSIS_ERROR;
}
break;
case TTaskType::ROLLUP:
case TTaskType::SCHEMA_CHANGE:
if (task.__isset.alter_tablet_req) {
_alter_table_workers->submit_task(task);
} else {
status_code = TStatusCode::ANALYSIS_ERROR;
}
break;
case TTaskType::CLONE:
if (task.__isset.clone_req) {
_clone_workers->submit_task(task);
} else {
status_code = TStatusCode::ANALYSIS_ERROR;
}
break;
case TTaskType::STORAGE_MEDIUM_MIGRATE:
if (task.__isset.storage_medium_migrate_req) {
_storage_medium_migrate_workers->submit_task(task);
} else {
status_code = TStatusCode::ANALYSIS_ERROR;
}
break;
case TTaskType::CANCEL_DELETE:
if (task.__isset.cancel_delete_data_req) {
_cancel_delete_data_workers->submit_task(task);
} else {
status_code = TStatusCode::ANALYSIS_ERROR;
}
break;
case TTaskType::CHECK_CONSISTENCY:
if (task.__isset.check_consistency_req) {
_check_consistency_workers->submit_task(task);
} else {
status_code = TStatusCode::ANALYSIS_ERROR;
}
break;
case TTaskType::UPLOAD:
if (task.__isset.upload_req) {
_upload_workers->submit_task(task);
} else {
status_code = TStatusCode::ANALYSIS_ERROR;
}
break;
case TTaskType::RESTORE:
if (task.__isset.restore_req) {
_restore_workers->submit_task(task);
} else {
status_code = TStatusCode::ANALYSIS_ERROR;
}
break;
case TTaskType::MAKE_SNAPSHOT:
if (task.__isset.snapshot_req) {
_make_snapshot_workers->submit_task(task);
} else {
status_code = TStatusCode::ANALYSIS_ERROR;
}
break;
case TTaskType::RELEASE_SNAPSHOT:
if (task.__isset.release_snapshot_req) {
_release_snapshot_workers->submit_task(task);
} else {
status_code = TStatusCode::ANALYSIS_ERROR;
}
break;
default:
status_code = TStatusCode::ANALYSIS_ERROR;
break;
}
if (status_code == TStatusCode::ANALYSIS_ERROR) {
OLAP_LOG_WARNING("task anaysis_error, signature: %ld", signature);
error_msgs.push_back("the task signature is:" + to_string(signature) + " has wrong request.");
}
}
return_value.status.__set_error_msgs(error_msgs);
return_value.status.__set_status_code(status_code);
}
void AgentServer::make_snapshot(TAgentResult& return_value,
const TSnapshotRequest& snapshot_request) {
TStatus status;
vector<string> error_msgs;
TStatusCode::type status_code = TStatusCode::OK;
string snapshot_path;
OLAPStatus make_snapshot_status =
_command_executor->make_snapshot(snapshot_request, &snapshot_path);
if (make_snapshot_status != OLAP_SUCCESS) {
status_code = TStatusCode::RUNTIME_ERROR;
OLAP_LOG_WARNING("make_snapshot failed. tablet_id: %ld, schema_hash: %ld, status: %d",
snapshot_request.tablet_id, snapshot_request.schema_hash,
make_snapshot_status);
error_msgs.push_back("make_snapshot failed. status: " +
boost::lexical_cast<string>(make_snapshot_status));
} else {
OLAP_LOG_INFO("make_snapshot success. tablet_id: %ld, schema_hash: %ld, snapshot_path: %s",
snapshot_request.tablet_id, snapshot_request.schema_hash,
snapshot_path.c_str());
return_value.__set_snapshot_path(snapshot_path);
}
status.__set_error_msgs(error_msgs);
status.__set_status_code(status_code);
return_value.__set_status(status);
}
void AgentServer::release_snapshot(TAgentResult& return_value, const std::string& snapshot_path) {
vector<string> error_msgs;
TStatusCode::type status_code = TStatusCode::OK;
OLAPStatus release_snapshot_status =
_command_executor->release_snapshot(snapshot_path);
if (release_snapshot_status != OLAP_SUCCESS) {
status_code = TStatusCode::RUNTIME_ERROR;
OLAP_LOG_WARNING("release_snapshot failed. snapshot_path: %s, status: %d",
snapshot_path.c_str(), release_snapshot_status);
error_msgs.push_back("release_snapshot failed. status: " +
boost::lexical_cast<string>(release_snapshot_status));
} else {
OLAP_LOG_INFO("release_snapshot success. snapshot_path: %s, status: %d",
snapshot_path.c_str(), release_snapshot_status);
}
return_value.status.__set_error_msgs(error_msgs);
return_value.status.__set_status_code(status_code);
}
void AgentServer::publish_cluster_state(TAgentResult& _return, const TAgentPublishRequest& request) {
vector<string> error_msgs;
_topic_subscriber->handle_updates(request);
OLAP_LOG_INFO("AgentService receive contains %d publish updates", request.updates.size());
_return.status.__set_status_code(TStatusCode::OK);
}
void AgentServer::submit_etl_task(TAgentResult& return_value,
const TMiniLoadEtlTaskRequest& request) {
Status status = _exec_env->etl_job_mgr()->start_job(request);
if (status.ok()) {
VLOG_RPC << "start etl task successfull id="
<< request.params.params.fragment_instance_id;
} else {
VLOG_RPC << "start etl task failed id="
<< request.params.params.fragment_instance_id
<< " and err_msg=" << status.get_error_msg();
}
status.to_thrift(&return_value.status);
}
void AgentServer::get_etl_status(TMiniLoadEtlStatusResult& return_value,
const TMiniLoadEtlStatusRequest& request) {
Status status = _exec_env->etl_job_mgr()->get_job_state(request.mini_load_id, &return_value);
if (!status.ok()) {
LOG(WARNING) << "get job state failed. [id=" << request.mini_load_id << "]";
} else {
VLOG_RPC << "get job state successful. [id=" << request.mini_load_id << ",status="
<< return_value.status.status_code << ",etl_state=" << return_value.etl_state
<< ",files=";
for (auto& item : return_value.file_map) {
VLOG_RPC << item.first << ":" << item.second << ";";
}
VLOG_RPC << "]";
}
}
void AgentServer::delete_etl_files(TAgentResult& result,
const TDeleteEtlFilesRequest& request) {
Status status = _exec_env->etl_job_mgr()->erase_job(request);
if (!status.ok()) {
LOG(WARNING) << "delete etl files failed. because " << status.get_error_msg()
<< " with request " << request;
} else {
VLOG_RPC << "delete etl files successful with param " << request;
}
status.to_thrift(&result.status);
}
} // namesapce palo

122
be/src/agent/agent_server.h Normal file
View File

@ -0,0 +1,122 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_AGENT_AGENT_SERVER_H
#define BDG_PALO_BE_SRC_AGENT_AGENT_SERVER_H
#include "thrift/transport/TTransportUtils.h"
#include "agent/status.h"
#include "agent/task_worker_pool.h"
#include "agent/topic_subscriber.h"
#include "agent/utils.h"
#include "gen_cpp/AgentService_types.h"
#include "gen_cpp/Types_types.h"
#include "olap/command_executor.h"
#include "olap/olap_define.h"
#include "olap/utils.h"
#include "runtime/exec_env.h"
namespace palo {
class AgentServer {
public:
explicit AgentServer(ExecEnv* exec_env, const TMasterInfo& master_info);
~AgentServer();
// Receive agent task from dm
//
// Input parameters:
// * tasks: The list of agent tasks
//
// Output parameters:
// * return_value: The result of receive agent task,
// contains return code and error messages.
void submit_tasks(
TAgentResult& return_value,
const std::vector<TAgentTaskRequest>& tasks);
// Make a snapshot for a local tablet
//
// Input parameters:
// * tablet_id: The tablet id of local tablet.
// * schema_hash: The schema hash of local tablet
//
// Output parameters:
// * return_value: The result of make snapshot,
// contains return code and error messages.
void make_snapshot(
TAgentResult& return_value,
const TSnapshotRequest& snapshot_request);
// Release useless snapshot
//
// Input parameters:
// * snapshot_path: local useless snapshot path
//
// Output parameters:
// * return_value: The result of release snapshot,
// contains return code and error messages.
void release_snapshot(TAgentResult& return_value, const std::string& snapshot_path);
// Publish state to agent
//
// Input parameters:
// request:
void publish_cluster_state(TAgentResult& return_value,
const TAgentPublishRequest& request);
// Master call this rpc to submit a etl task
void submit_etl_task(TAgentResult& return_value,
const TMiniLoadEtlTaskRequest& request);
// Master call this rpc to fetch status of elt task
void get_etl_status(TMiniLoadEtlStatusResult& return_value,
const TMiniLoadEtlStatusRequest& request);
void delete_etl_files(TAgentResult& result,
const TDeleteEtlFilesRequest& request);
private:
ExecEnv* _exec_env;
const TMasterInfo& _master_info;
CommandExecutor* _command_executor;
TaskWorkerPool* _create_table_workers;
TaskWorkerPool* _drop_table_workers;
TaskWorkerPool* _push_workers;
TaskWorkerPool* _delete_workers;
TaskWorkerPool* _alter_table_workers;
TaskWorkerPool* _clone_workers;
TaskWorkerPool* _storage_medium_migrate_workers;
TaskWorkerPool* _cancel_delete_data_workers;
TaskWorkerPool* _check_consistency_workers;
TaskWorkerPool* _report_task_workers;
TaskWorkerPool* _report_disk_state_workers;
TaskWorkerPool* _report_olap_table_workers;
TaskWorkerPool* _upload_workers;
TaskWorkerPool* _restore_workers;
TaskWorkerPool* _make_snapshot_workers;
TaskWorkerPool* _release_snapshot_workers;
DISALLOW_COPY_AND_ASSIGN(AgentServer);
TopicSubscriber* _topic_subscriber;
}; // class AgentServer
} // namespace palo
#endif // BDG_PALO_BE_SRC_AGENT_AGENT_SERVER_H

View File

@ -0,0 +1,511 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "agent/cgroups_mgr.h"
#include <fstream>
#include <future>
#include <linux/magic.h>
#include <map>
#include <unistd.h>
#include <asm/unistd.h>
#include <sstream>
#include <sys/stat.h>
#include <sys/vfs.h>
#include "boost/filesystem.hpp"
#include "common/logging.h"
#include "olap/olap_rootpath.h"
#include "runtime/exec_env.h"
#include "runtime/load_path_mgr.h"
using std::string;
using std::map;
using std::vector;
using std::stringstream;
using apache::thrift::TException;
using apache::thrift::transport::TTransportException;
namespace palo {
static CgroupsMgr *s_global_cg_mgr;
const std::string CgroupsMgr::_s_system_user = "system";
const std::string CgroupsMgr::_s_system_group = "normal";
std::map<TResourceType::type, std::string> CgroupsMgr::_s_resource_cgroups =
{{TResourceType::type::TRESOURCE_CPU_SHARE, "cpu.shares"},
{TResourceType::type::TRESOURCE_IO_SHARE, "blkio.weight"}};
CgroupsMgr::CgroupsMgr(ExecEnv* exec_env, const string& root_cgroups_path)
: _exec_env(exec_env),
_root_cgroups_path(root_cgroups_path),
_is_cgroups_init_success(false),
_cur_version(-1) {
if (s_global_cg_mgr == nullptr) {
s_global_cg_mgr = this;
}
}
CgroupsMgr::~CgroupsMgr() {
}
AgentStatus CgroupsMgr::update_local_cgroups(const TFetchResourceResult& new_fetched_resource) {
LOG(INFO) << "Current resource version is " << _cur_version
<< ". Resource version is " << new_fetched_resource.resourceVersion;
std::lock_guard<std::mutex> lck(_update_cgroups_mtx);
if (!_is_cgroups_init_success) {
LOG(WARNING) << "Cgroups manager initialized failed, will not update local cgroups!";
return AgentStatus::PALO_ERROR;
}
if (_cur_version >= new_fetched_resource.resourceVersion) {
return AgentStatus::PALO_SUCCESS;
}
const std::map<std::string, TUserResource>& new_user_resource
= new_fetched_resource.resourceByUser;
if (!_local_users.empty()) {
std::set<std::string>::const_iterator old_it = _local_users.begin();
for (; old_it != _local_users.end(); ++old_it) {
if (new_user_resource.count(*old_it) == 0) {
this->delete_user_cgroups(*old_it);
}
}
}
// Clear local users set, because it will be inserted again
_local_users.clear();
std::map<std::string, TUserResource>::const_iterator new_it = new_user_resource.begin();
for (; new_it != new_user_resource.end(); ++new_it) {
const string& user_name = new_it->first;
const std::map<std::string, int32_t>& level_share = new_it->second.shareByGroup;
std::map<std::string, int32_t> user_share;
const std::map<TResourceType::type, int32_t>& resource_share =
new_it->second.resource.resourceByType;
std::map<TResourceType::type, int32_t>::const_iterator resource_it = resource_share.begin();
for (; resource_it != resource_share.end(); ++resource_it) {
if (_s_resource_cgroups.count(resource_it->first) > 0) {
user_share[_s_resource_cgroups[resource_it->first]] =
resource_it->second;
}
}
modify_user_cgroups(user_name, user_share, level_share);
_config_user_disk_throttle(user_name, resource_share);
// Insert user to local user's set
_local_users.insert(user_name);
}
// Using resource version, not subscribe version
_cur_version = new_fetched_resource.resourceVersion;
return AgentStatus::PALO_SUCCESS;
}
void CgroupsMgr::_config_user_disk_throttle(std::string user_name,
const std::map<TResourceType::type, int32_t>& resource_share) {
int64_t hdd_read_iops = _get_resource_value(TResourceType::type::TRESOURCE_HDD_READ_IOPS,
resource_share);
int64_t hdd_write_iops = _get_resource_value(TResourceType::type::TRESOURCE_HDD_WRITE_IOPS,
resource_share);
int64_t hdd_read_mbps = _get_resource_value(TResourceType::type::TRESOURCE_HDD_READ_MBPS,
resource_share);
int64_t hdd_write_mbps = _get_resource_value(TResourceType::type::TRESOURCE_HDD_WRITE_MBPS,
resource_share);
int64_t ssd_read_iops = _get_resource_value(TResourceType::type::TRESOURCE_SSD_READ_IOPS,
resource_share);
int64_t ssd_write_iops = _get_resource_value(TResourceType::type::TRESOURCE_SSD_WRITE_IOPS,
resource_share);
int64_t ssd_read_mbps = _get_resource_value(TResourceType::type::TRESOURCE_SSD_READ_MBPS,
resource_share);
int64_t ssd_write_mbps = _get_resource_value(TResourceType::type::TRESOURCE_SSD_WRITE_MBPS,
resource_share);
_config_disk_throttle(user_name, "", hdd_read_iops, hdd_write_iops,
hdd_read_mbps, hdd_write_mbps,
ssd_read_iops, ssd_write_iops,
ssd_read_mbps, ssd_write_mbps);
_config_disk_throttle(user_name, "low", hdd_read_iops, hdd_write_iops,
hdd_read_mbps, hdd_write_mbps,
ssd_read_iops, ssd_write_iops,
ssd_read_mbps, ssd_write_mbps);
_config_disk_throttle(user_name, "normal", hdd_read_iops, hdd_write_iops,
hdd_read_mbps, hdd_write_mbps,
ssd_read_iops, ssd_write_iops,
ssd_read_mbps, ssd_write_mbps);
_config_disk_throttle(user_name, "high", hdd_read_iops, hdd_write_iops,
hdd_read_mbps, hdd_write_mbps,
ssd_read_iops, ssd_write_iops,
ssd_read_mbps, ssd_write_mbps);
}
int64_t CgroupsMgr::_get_resource_value(const TResourceType::type resource_type,
const std::map<TResourceType::type, int32_t>& resource_share) {
int64_t resource_value = -1;
std::map<TResourceType::type, int32_t>::const_iterator it = resource_share.find(resource_type);
if (it != resource_share.end()) {
resource_value = it->second;
}
return resource_value;
}
AgentStatus CgroupsMgr::_config_disk_throttle(std::string user_name,
std::string level,
int64_t hdd_read_iops,
int64_t hdd_write_iops,
int64_t hdd_read_mbps,
int64_t hdd_write_mbps,
int64_t ssd_read_iops,
int64_t ssd_write_iops,
int64_t ssd_read_mbps,
int64_t ssd_write_mbps) {
string cgroups_path = this->_root_cgroups_path + "/" + user_name + "/" + level;
string read_bps_path = cgroups_path + "/blkio.throttle.read_bps_device";
string write_bps_path = cgroups_path + "/blkio.throttle.write_bps_device";
string read_iops_path = cgroups_path + "/blkio.throttle.read_iops_device";
string write_iops_path = cgroups_path + "/blkio.throttle.write_iops_device";
if (!is_file_exist(cgroups_path.c_str())) {
if (!boost::filesystem::create_directory(cgroups_path)) {
LOG(ERROR) << "Create cgroups: " << cgroups_path << " failed";
return AgentStatus::PALO_ERROR;
}
}
// add olap engine data path here
vector<string> data_paths(0);
OLAPRootPath::get_instance()->get_table_data_path(&data_paths);
// buld load data path, it is alreay in data path
// _exec_env->load_path_mgr()->get_load_data_path(&data_paths);
stringstream ctrl_cmd;
for (vector<string>::iterator it = data_paths.begin();
it != data_paths.end();
++it) {
// check disk type
int64_t read_iops = hdd_read_iops;
int64_t write_iops = hdd_write_iops;
int64_t read_mbps = hdd_read_mbps;
int64_t write_mbps = hdd_write_mbps;
// if user set hdd not ssd, then use hdd for ssd
if (OLAPRootPath::is_ssd_disk(*it)) {
read_iops = ssd_read_iops == -1 ? hdd_read_iops : ssd_read_iops;
write_iops = ssd_write_iops == -1 ? hdd_write_iops : ssd_write_iops;
read_mbps = ssd_read_mbps == -1 ? hdd_read_mbps : ssd_read_mbps;
write_mbps = ssd_write_mbps == -1 ? hdd_write_mbps : ssd_write_mbps;
}
struct stat file_stat;
if (stat(it->c_str(), &file_stat) != 0) {
continue;
}
int major_number = major(file_stat.st_dev);
int minor_number = minor(file_stat.st_dev);
minor_number = (minor_number / 16) * 16;
if (read_iops != -1) {
ctrl_cmd << major_number << ":"
<< minor_number << " "
<< read_iops;
_echo_cmd_to_cgroup(ctrl_cmd, read_iops_path);
ctrl_cmd.clear();
ctrl_cmd.str(std::string());
}
if (write_iops != -1) {
ctrl_cmd << major_number << ":"
<< minor_number << " "
<< write_iops;
_echo_cmd_to_cgroup(ctrl_cmd, write_iops_path);
ctrl_cmd.clear();
ctrl_cmd.str(std::string());
}
if (read_mbps != -1) {
ctrl_cmd << major_number << ":"
<< minor_number << " "
<< (read_mbps << 20);
_echo_cmd_to_cgroup(ctrl_cmd, read_bps_path);
ctrl_cmd.clear();
ctrl_cmd.str(std::string());
}
if (write_mbps != -1) {
ctrl_cmd << major_number << ":"
<< minor_number << " "
<< (write_mbps << 20);
_echo_cmd_to_cgroup(ctrl_cmd, write_bps_path);
ctrl_cmd.clear();
ctrl_cmd.str(std::string());
}
}
return AgentStatus::PALO_SUCCESS;
}
AgentStatus CgroupsMgr::modify_user_cgroups(const string& user_name,
const map<string, int32_t>& user_share,
const map<string, int32_t>& level_share) {
// Check if the user's cgroups exists, if not create it
string user_cgroups_path = this->_root_cgroups_path + "/" + user_name;
if (!is_file_exist(user_cgroups_path.c_str())) {
if (!boost::filesystem::create_directory(user_cgroups_path)) {
LOG(ERROR) << "Create cgroups for user " << user_name << " failed";
return AgentStatus::PALO_ERROR;
}
}
// Traverse the user resource share map to append share value to cgroup's file
for (map<string, int32_t>::const_iterator user_resource = user_share.begin();
user_resource != user_share.end(); ++user_resource){
string resource_file_name = user_resource->first;
int32_t user_share_weight = user_resource->second;
// Append the share_weight value to the file
string user_resource_path = user_cgroups_path + "/" + resource_file_name;
std::ofstream user_cgroups(user_resource_path.c_str(), std::ios::out | std::ios::app);
if (!user_cgroups.is_open()) {
return AgentStatus::PALO_ERROR;
}
user_cgroups << user_share_weight << std::endl;
user_cgroups.close();
LOG(INFO) << "Append " << user_share_weight << " to " << user_resource_path;
for (map<string, int32_t>::const_iterator level_resource = level_share.begin();
level_resource != level_share.end(); ++level_resource){
// Append resource share to level shares
string level_name = level_resource->first;
int32_t level_share_weight = level_resource->second;
// Check if the level cgroups exist
string level_cgroups_path = user_cgroups_path + "/" + level_name;
if (!is_file_exist(level_cgroups_path.c_str())) {
if (!boost::filesystem::create_directory(level_cgroups_path)) {
return AgentStatus::PALO_ERROR;
}
}
// Append the share_weight value to the file
string level_resource_path = level_cgroups_path + "/" + resource_file_name;
std::ofstream level_cgroups(level_resource_path.c_str(),
std::ios::out | std::ios::app);
if (!level_cgroups.is_open()) {
return AgentStatus::PALO_ERROR;
}
level_cgroups << level_share_weight << std::endl;
level_cgroups.close();
LOG(INFO) << "Append " << level_share_weight << " to " << level_resource_path;
}
}
return AgentStatus::PALO_SUCCESS;
}
AgentStatus CgroupsMgr::init_cgroups() {
std::string root_cgroups_tasks_path = this->_root_cgroups_path + "/tasks";
// Check if the root cgroups exists
if (is_directory(this->_root_cgroups_path.c_str())
&& is_file_exist(root_cgroups_tasks_path.c_str())) {
// Check the folder's virtual filesystem to find whether it is a cgroup file system
#ifndef BE_TEST
struct statfs fs_type;
statfs(root_cgroups_tasks_path.c_str(), &fs_type);
if (fs_type.f_type != CGROUP_SUPER_MAGIC) {
LOG(ERROR) << _root_cgroups_path << " is not a cgroups file system.";
_is_cgroups_init_success = false;
return AgentStatus::PALO_ERROR;
}
#endif
// Check if current user have write permission to cgroup folder
if (access(_root_cgroups_path.c_str(), W_OK) != 0) {
LOG(ERROR) << "Palo does not have write permission to "
<< _root_cgroups_path;
_is_cgroups_init_success = false;
return AgentStatus::PALO_ERROR;
}
// If root folder exists, then delete all subfolders under it
boost::filesystem::directory_iterator item_begin(this->_root_cgroups_path);
boost::filesystem::directory_iterator item_end;
for (; item_begin != item_end; item_begin++) {
if (is_directory(item_begin->path().string().c_str())) {
// Delete the sub folder
if (delete_user_cgroups(item_begin->path().filename().string())
!= AgentStatus::PALO_SUCCESS) {
LOG(ERROR) << "Could not clean subfolder "
<< item_begin->path().string();
_is_cgroups_init_success = false;
return AgentStatus::PALO_ERROR;
}
}
}
LOG(INFO) << "Initialize palo cgroups successfully under folder "
<< _root_cgroups_path;
_is_cgroups_init_success = true;
return AgentStatus::PALO_SUCCESS;
} else {
LOG(ERROR) << "Could not find a valid cgroups path for resource isolation,"
<< "current value is " << _root_cgroups_path;
_is_cgroups_init_success = false;
return AgentStatus::PALO_ERROR;
}
}
#define gettid() syscall(__NR_gettid)
void CgroupsMgr::apply_cgroup(const string& user_name, const string& level) {
if (s_global_cg_mgr == nullptr) {
return;
}
s_global_cg_mgr->assign_to_cgroups(user_name, level);
}
AgentStatus CgroupsMgr::assign_to_cgroups(const string& user_name,
const string& level) {
if (!_is_cgroups_init_success) {
return AgentStatus::PALO_ERROR;
}
int64_t tid = gettid();
return assign_thread_to_cgroups(tid, user_name, level);
}
AgentStatus CgroupsMgr::assign_thread_to_cgroups(int64_t thread_id,
const string& user_name,
const string& level) {
if (!_is_cgroups_init_success) {
return AgentStatus::PALO_ERROR;
}
string tasks_path = _root_cgroups_path + "/" + user_name + "/" + level + "/tasks";
if (!is_file_exist(_root_cgroups_path + "/" + user_name)) {
tasks_path = this->_root_cgroups_path + "/"
+ _default_user_name + "/"
+ _default_level + "/tasks";
} else if (!is_file_exist(_root_cgroups_path + "/" + user_name + "/" + level)) {
tasks_path = this->_root_cgroups_path + "/" + user_name + "/tasks";
}
if (!is_file_exist(tasks_path.c_str())) {
LOG(ERROR) << "Cgroups path " << tasks_path << " not exist!";
return AgentStatus::PALO_ERROR;
}
std::ofstream tasks(tasks_path.c_str(), std::ios::out | std::ios::app);
if (!tasks.is_open()) {
// This means palo could not open this file. May be it does not have access to it
LOG(ERROR) << "Echo thread: " << thread_id << " to " << tasks_path << " failed!";
return AgentStatus::PALO_ERROR;
}
// Append thread id to the tasks file directly
tasks << thread_id << std::endl;
tasks.close();
return AgentStatus::PALO_SUCCESS;
}
AgentStatus CgroupsMgr::delete_user_cgroups(const string& user_name) {
string user_cgroups_path = this->_root_cgroups_path + "/" + user_name;
if (is_file_exist(user_cgroups_path.c_str())) {
// Delete sub cgroups --> level cgroups
boost::filesystem::directory_iterator item_begin(user_cgroups_path);
boost::filesystem::directory_iterator item_end;
for (; item_begin != item_end; item_begin++) {
if (is_directory(item_begin->path().string().c_str())) {
string cur_cgroups_path = item_begin->path().string();
if (this->drop_cgroups(cur_cgroups_path) < 0) {
return AgentStatus::PALO_ERROR;
}
}
}
// Delete user cgroups
if (this->drop_cgroups(user_cgroups_path) < 0) {
return AgentStatus::PALO_ERROR;
}
}
return AgentStatus::PALO_SUCCESS;
}
AgentStatus CgroupsMgr::drop_cgroups(const string& deleted_cgroups_path) {
// Try to delete the cgroups folder
// If failed then there maybe exist active tasks under it and try to relocate them
// Currently, try 10 times to relocate and delete the cgroups.
int32_t i = 0;
while (is_file_exist(deleted_cgroups_path)
&& rmdir(deleted_cgroups_path.c_str()) < 0
&& i < this->_drop_retry_times) {
this->relocate_tasks(deleted_cgroups_path, this->_root_cgroups_path);
++i;
#ifdef BE_TEST
boost::filesystem::remove_all(deleted_cgroups_path);
#endif
if (i == this->_drop_retry_times){
LOG(ERROR) << "drop cgroups under path: " << deleted_cgroups_path
<< " failed.";
return AgentStatus::PALO_ERROR;
}
}
return AgentStatus::PALO_SUCCESS;
}
AgentStatus CgroupsMgr::relocate_tasks(const string& src_cgroups, const string& dest_cgroups) {
string src_tasks_path = src_cgroups + "/tasks";
string dest_tasks_path = dest_cgroups + "/tasks";
std::ifstream src_tasks(src_tasks_path.c_str());
if (!src_tasks) {
return AgentStatus::PALO_ERROR;
}
std::ofstream dest_tasks(dest_tasks_path.c_str(), std::ios::out | std::ios::app);
if (!dest_tasks) {
return AgentStatus::PALO_ERROR;
}
int64_t taskid;
while (src_tasks >> taskid) {
dest_tasks << taskid << std::endl;
// If the thread id or process id not exists, then error occurs in the stream.
// Clear the error state for every append.
dest_tasks.clear();
}
src_tasks.close();
dest_tasks.close();
return AgentStatus::PALO_SUCCESS;
}
void CgroupsMgr::_echo_cmd_to_cgroup(stringstream& ctrl_cmd, string& cgroups_path) {
std::ofstream cgroups_stream(cgroups_path.c_str(),
std::ios::out | std::ios::app);
if (cgroups_stream.is_open()) {
cgroups_stream << ctrl_cmd.str() << std::endl;
cgroups_stream.close();
}
}
bool CgroupsMgr::is_directory(const char* file_path) {
struct stat file_stat;
if (stat(file_path, &file_stat) != 0) {
return false;
}
if (S_ISDIR(file_stat.st_mode)) {
return true;
} else {
return false;
}
}
bool CgroupsMgr::is_file_exist(const char* file_path) {
struct stat file_stat;
if (stat(file_path, &file_stat) != 0) {
return false;
}
return true;
}
bool CgroupsMgr::is_file_exist(const std::string& file_path) {
return is_file_exist(file_path.c_str());
}
} // namespace palo

185
be/src/agent/cgroups_mgr.h Normal file
View File

@ -0,0 +1,185 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_AGENT_CGROUPS_MGR_H
#define BDG_PALO_BE_SRC_AGENT_CGROUPS_MGR_H
#include <cstdint>
#include <map>
#include <mutex>
#include <string>
#include <sys/types.h>
#include "agent/status.h"
#include "gen_cpp/MasterService_types.h"
namespace palo {
class ExecEnv;
class CgroupsMgr {
public:
// Input parameters:
// exec_env: global variable to get global objects
// cgroups_root_path: root cgroup allocated to palo by admin
explicit CgroupsMgr(ExecEnv* exec_env, const std::string& root_cgroups_path);
~CgroupsMgr();
// Compare the old user resource and new user resource to find deleted user
// then delete nonexisting cgroups, create new user cgroups, update all user cgroups
AgentStatus update_local_cgroups(const TFetchResourceResult& new_fetched_resource);
// Delete all existing cgroups under root path
AgentStatus init_cgroups();
// Modify cgroup resource shares under cgroups_root_path.
// Create related cgroups if it not exist.
//
// Input parameters:
// user_name: unique name for the user. it is a dir under cgroups_root_path
//
// user_share: a mapping for shares for different resource like (cpu.share, 100)
// mapping key is resource file name in cgroup; value is share weight
//
// level_share: a mapping for shares for different levels under the user.
// mapping key is level name; value is level's share. Currently, different resource using the same share.
AgentStatus modify_user_cgroups(const std::string& user_name,
const std::map<std::string, int32_t>& user_share,
const std::map<std::string, int32_t>& level_share);
static void apply_cgroup(const std::string& user_name,
const std::string& level);
static void apply_system_cgroup() {
apply_cgroup(_s_system_user, _s_system_group);
}
// Assign the thread calling this funciton to the cgroup identified by user name and level
//
// Input parameters:
// user_name&level: the user name and level used to find the cgroup
AgentStatus assign_to_cgroups(const std::string& user_name,
const std::string& level);
// Assign the thread identified by thread id to the cgroup identified by user name and level
//
// Input parameters:
// thread_id: the unique id for the thread
// user_name&level: the user name and level used to find the cgroup
AgentStatus assign_thread_to_cgroups(int64_t thread_id,
const std::string& user_name,
const std::string& level);
// Delete the user's cgroups and its sub level cgroups using DropCgroups
// Input parameters:
// user name: user name to be deleted
AgentStatus delete_user_cgroups(const std::string& user_name);
// Delete a cgroup
// If there are active tasks in this cgroups, they will be relocated
// to root cgroups.
// If there are sub cgroups, it will return error.
// Input parameters:
// deleted_cgroups_path: the absolute cgroups path to be deleted
AgentStatus drop_cgroups(const std::string& deleted_cgroups_path);
// Relocate all threads or processes in src cgroups to dest cgroups
// Ignore errors when echo to dest cgroups
// Input parameters:
// src_cgroups: absolute path for src cgroups folder
// dest_cgroups: absolute path for dest cgroups folder
AgentStatus relocate_tasks(const std::string& src_cgroups, const std::string& dest_cgroups);
int64_t get_cgroups_version() {
return _cur_version;
}
// set the disk throttle for the user by getting resource value from the map and echo it to the cgroups.
// currently, both the user and groups under the user are set to the same value
// because throttle does not support hierachy.
// Input parameters:
// user_name: name for the user
// resource_share: resource value get from fe
void _config_user_disk_throttle(std::string user_name,
const std::map<TResourceType::type, int32_t>& resource_share);
// get user resource share value from the map
int64_t _get_resource_value(const TResourceType::type resource_type,
const std::map<TResourceType::type, int32_t>& resource_share);
// set disk throttle according to the parameters. currently, we set different
// values for hdd and ssd.
// Input parameters:
// hdd_read_iops: read iops number for hdd disk.
// hdd_write_iops: write iops number for hdd disk.
// hdd_read_mbps: read bps number for hdd disk, using mb not byte or kb.
// hdd_write_mbps: write bps number for hdd disk, using mb not byte or kb.
// ssd_read_iops: read iops number for ssd disk.
// ssd_write_iops: write iops number for ssd disk.
// ssd_read_mbps: read bps number for ssd disk, using mb not byte or kb.
// ssd_write_mbps: write bps number for ssd disk, using mb not byte or kb.
AgentStatus _config_disk_throttle(std::string user_name,
std::string level,
int64_t hdd_read_iops,
int64_t hdd_write_iops,
int64_t hdd_read_mbps,
int64_t hdd_write_mbps,
int64_t ssd_read_iops,
int64_t ssd_write_iops,
int64_t ssd_read_mbps,
int64_t ssd_write_mbps);
// echo command in string stream to the cgroup file
// Input parameters:
// ctrl_cmd: stringstream that contains the string to echo
// cgroups_path: target cgroup file path
void _echo_cmd_to_cgroup(std::stringstream& ctrl_cmd, std::string& cgroups_path);
// check if the path exists and it is a directory
// Input parameters:
// file_path: path to the file
bool is_directory(const char* file_path);
// check if the path exists
// Input parameters:
// file_path: path to the file
bool is_file_exist(const char* file_path);
// check if the path exists
// Input parameters:
// file_path: string value of the path
bool is_file_exist(const std::string& file_path);
public:
const static std::string _s_system_user;
const static std::string _s_system_group;
private:
ExecEnv* _exec_env;
std::string _root_cgroups_path;
int32_t _drop_retry_times = 10;
bool _is_cgroups_init_success;
std::string _default_user_name = "default";
std::string _default_level = "normal";
int64_t _cur_version;
std::set<std::string> _local_users;
std::mutex _update_cgroups_mtx;
// A static mapping from fe's resource type to cgroups file
static std::map<TResourceType::type, std::string> _s_resource_cgroups;
};
}
#endif

View File

@ -0,0 +1,441 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "agent/file_downloader.h"
#include <cstdio>
#include <cstring>
#include <fstream>
#include <iostream>
#include <sstream>
#include "olap/olap_define.h"
#include "olap/file_helper.h"
#include "olap/utils.h"
using std::ofstream;
using std::ostream;
using std::string;
using std::stringstream;
namespace palo {
FileDownloader::FileDownloader(const FileDownloaderParam& param) :
_downloader_param(param) {
}
size_t FileDownloader::_write_file_callback(
void* buffer, size_t size, size_t nmemb, void* param) {
OLAPStatus status = OLAP_SUCCESS;
size_t len = size * nmemb;
if (param == NULL) {
status = OLAP_ERR_OTHER_ERROR;
OLAP_LOG_WARNING("File downloader output file handler is NULL pointer.");
return -1;
}
if (status == OLAP_SUCCESS) {
status = static_cast<FileHandler*>(param)->write(buffer, len);
if (status != OLAP_SUCCESS) {
OLAP_LOG_WARNING("File downloader callback write failed.");
}
}
return len;
}
size_t FileDownloader::_write_stream_callback(
void* buffer, size_t size, size_t nmemb, void* param) {
AgentStatus status = PALO_SUCCESS;
size_t len = size * nmemb;
if (param == NULL) {
status = PALO_FILE_DOWNLOAD_INVALID_PARAM;
OLAP_LOG_WARNING("File downloader output stream is NULL pointer.");
return -1;
}
if (status == PALO_SUCCESS) {
static_cast<ostream*>(param)->write(static_cast<const char*>(buffer), len);
}
return len;
}
AgentStatus FileDownloader::_install_opt(
OutputType output_type, CURL* curl, char* errbuf,
stringstream* output_stream, FileHandler* file_handler) {
AgentStatus status = PALO_SUCCESS;
CURLcode curl_ret = CURLE_OK;
// Set request URL
curl_ret = curl_easy_setopt(curl, CURLOPT_URL, _downloader_param.remote_file_path.c_str());
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED;
OLAP_LOG_WARNING("curl setopt URL failed.[error=%s]", curl_easy_strerror(curl_ret));
}
// Set username
if (status == PALO_SUCCESS) {
curl_ret = curl_easy_setopt(curl, CURLOPT_USERNAME, _downloader_param.username.c_str());
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED;
OLAP_LOG_WARNING("curl setopt USERNAME failed.[error=%s]",
curl_easy_strerror(curl_ret));
}
}
// Set password
if (status == PALO_SUCCESS) {
curl_ret = curl_easy_setopt(curl, CURLOPT_PASSWORD, _downloader_param.password.c_str());
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED;
OLAP_LOG_WARNING("curl setopt USERNAME failed.[error=%s]",
curl_easy_strerror(curl_ret));
}
}
// Set process timeout
if (status == PALO_SUCCESS) {
curl_ret = curl_easy_setopt(curl, CURLOPT_TIMEOUT, _downloader_param.curl_opt_timeout);
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED;
OLAP_LOG_WARNING("curl setopt TIMEOUT failed.[error=%s]", curl_easy_strerror(curl_ret));
}
}
// Set low speed limit and low speed time
if (status == PALO_SUCCESS) {
curl_ret = curl_easy_setopt(
curl, CURLOPT_LOW_SPEED_LIMIT,
config::download_low_speed_limit_kbps * 1024);
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED;
OLAP_LOG_WARNING(
"curl setopt CURLOPT_LOW_SPEED_LIMIT failed.[error=%s]",
curl_easy_strerror(curl_ret));
}
}
if (status == PALO_SUCCESS) {
curl_ret = curl_easy_setopt(
curl, CURLOPT_LOW_SPEED_TIME, config::download_low_speed_time);
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED;
OLAP_LOG_WARNING(
"curl setopt CURLOPT_LOW_SPEED_TIME failed.[error=%s]",
curl_easy_strerror(curl_ret));
}
}
// Set max recv speed(bytes/s)
if (status == PALO_SUCCESS) {
curl_ret = curl_easy_setopt(
curl, CURLOPT_MAX_RECV_SPEED_LARGE, config::max_download_speed_kbps * 1024);
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED;
OLAP_LOG_WARNING(
"curl setopt MAX_RECV_SPEED failed.[error=%s]", curl_easy_strerror(curl_ret));
}
}
// Forbid signals
if (status == PALO_SUCCESS) {
curl_ret = curl_easy_setopt(curl, CURLOPT_NOSIGNAL, 1L);
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED;
OLAP_LOG_WARNING("curl setopt nosignal failed.[error=%s]",
curl_easy_strerror(curl_ret));
}
}
if (strncmp(_downloader_param.remote_file_path.c_str(), "http", 4) == 0) {
if (status == PALO_SUCCESS) {
curl_ret = curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
if (curl_ret != CURLE_OK) {
OLAP_LOG_WARNING(
"curl setopt CURLOPT_FOLLOWLOCATION failed.[error=%s]",
curl_easy_strerror(curl_ret));
}
}
if (status == PALO_SUCCESS) {
curl_ret = curl_easy_setopt(curl, CURLOPT_MAXREDIRS, 20);
if (curl_ret != CURLE_OK) {
OLAP_LOG_WARNING(
"curl setopt CURLOPT_MAXREDIRS failed.[error=%s]",
curl_easy_strerror(curl_ret));
}
}
}
// Set nobody
if (status == PALO_SUCCESS) {
if (output_type == OutputType::NONE) {
curl_ret = curl_easy_setopt(curl, CURLOPT_NOBODY, 1L);
if (curl_ret != CURLE_OK) {
OLAP_LOG_WARNING("curl setopt CURLOPT_NOBODY failed.[error=%s]",
curl_easy_strerror(curl_ret));
}
} else if (output_type == OutputType::STREAM) {
// Set callback function
curl_ret = curl_easy_setopt(
curl,
CURLOPT_WRITEFUNCTION,
&FileDownloader::_write_stream_callback);
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED;
OLAP_LOG_WARNING("curl setopt WRITEDATA failed.[error=%s]",
curl_easy_strerror(curl_ret));
}
// Set callback function args
if (status == PALO_SUCCESS) {
curl_ret = curl_easy_setopt(curl, CURLOPT_WRITEDATA,
static_cast<void*>(output_stream));
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED;
OLAP_LOG_WARNING("curl setopt WRITEDATA failed.[error=%s]",
curl_easy_strerror(curl_ret));
}
}
} else if (output_type == OutputType::FILE) {
// Set callback function
curl_ret = curl_easy_setopt(
curl,
CURLOPT_WRITEFUNCTION,
&FileDownloader::_write_file_callback);
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED;
OLAP_LOG_WARNING("curl setopt WRITEDATA failed.[error=%s]",
curl_easy_strerror(curl_ret));
}
// Set callback function args
if (status == PALO_SUCCESS) {
curl_ret = curl_easy_setopt(curl, CURLOPT_WRITEDATA,
static_cast<void*>(file_handler));
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED;
OLAP_LOG_WARNING("curl setopt WRITEDATA failed.[error=%s]",
curl_easy_strerror(curl_ret));
}
}
}
}
// set verbose mode
/*
if (status == PALO_SUCCESS) {
curl_easy_setopt(curl, CURLOPT_VERBOSE, config::curl_verbose_mode);
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED;
OLAP_LOG_WARNING("curl setopt VERBOSE MODE failed.[error=%s]",
curl_easy_strerror(curl_ret));
}
}
*/
// set err buf
if (status == PALO_SUCCESS) {
curl_easy_setopt(curl, CURLOPT_ERRORBUFFER, errbuf);
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED;
OLAP_LOG_WARNING("curl setopt ERR BUF failed.[error=%s]",
curl_easy_strerror(curl_ret));
}
errbuf[0] = 0;
}
return status;
}
void FileDownloader::_get_err_info(char * errbuf, CURLcode res) {
if (res != CURLE_OK) {
size_t len = strlen(errbuf);
if (len) {
OLAP_LOG_WARNING("(%d): %s%s", res, errbuf, ((errbuf[len - 1] != '\n') ? "\n" : ""));
} else {
OLAP_LOG_WARNING("(%d): %s", res, curl_easy_strerror(res));
}
}
}
AgentStatus FileDownloader::get_length(uint64_t* length) {
AgentStatus status = PALO_SUCCESS;
CURL* curl = NULL;
CURLcode curl_ret = CURLE_OK;
curl = curl_easy_init();
// Init curl
if (curl == NULL) {
status = PALO_FILE_DOWNLOAD_CURL_INIT_FAILED;
OLAP_LOG_WARNING("internal error to get NULL curl");
}
// Set curl opt
char errbuf[CURL_ERROR_SIZE];
if (status == PALO_SUCCESS) {
status = _install_opt(OutputType::NONE, curl, errbuf, NULL, NULL);
if (PALO_SUCCESS != status) {
OLAP_LOG_WARNING("install curl opt failed.");
}
}
// Get result
if (status == PALO_SUCCESS) {
curl_ret = curl_easy_perform(curl);
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_GET_LENGTH_FAILED;
OLAP_LOG_WARNING("curl get length failed.[path=%s]",
_downloader_param.remote_file_path.c_str());
_get_err_info(errbuf, curl_ret);
} else {
double content_length = 0.0f;
curl_easy_getinfo(curl, CURLINFO_CONTENT_LENGTH_DOWNLOAD, &content_length);
*length = (uint64_t)content_length;
}
}
if (curl != NULL) {
curl_easy_cleanup(curl);
}
return status;
}
AgentStatus FileDownloader::download_file() {
AgentStatus status = PALO_SUCCESS;
CURL* curl = NULL;
CURLcode curl_ret = CURLE_OK;
curl = curl_easy_init();
if (curl == NULL) {
status = PALO_FILE_DOWNLOAD_CURL_INIT_FAILED;
OLAP_LOG_WARNING("internal error to get NULL curl");
}
FileHandler* file_handler = new FileHandler();
OLAPStatus olap_status = OLAP_SUCCESS;
// Prepare some infomation
if (status == PALO_SUCCESS) {
olap_status = file_handler->open_with_mode(
_downloader_param.local_file_path,
O_CREAT | O_TRUNC | O_WRONLY, S_IRUSR | S_IWUSR);
if (olap_status != OLAP_SUCCESS) {
status = PALO_FILE_DOWNLOAD_INVALID_PARAM;
OLAP_LOG_WARNING("open loacal file failed.[file_path=%s]",
_downloader_param.local_file_path.c_str());
}
}
char errbuf[CURL_ERROR_SIZE];
if (status == PALO_SUCCESS) {
status = _install_opt(OutputType::FILE, curl, errbuf, NULL, file_handler);
if (PALO_SUCCESS != status) {
OLAP_LOG_WARNING("install curl opt failed.");
}
}
if (status == PALO_SUCCESS) {
curl_ret = curl_easy_perform(curl);
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_FAILED;
OLAP_LOG_WARNING(
"curl easy perform failed.[path=%s]",
_downloader_param.remote_file_path.c_str());
_get_err_info(errbuf, curl_ret);
}
}
if (file_handler != NULL) {
file_handler->close();
delete file_handler;
file_handler = NULL;
}
if (curl != NULL) {
curl_easy_cleanup(curl);
}
return status;
}
AgentStatus FileDownloader::list_file_dir(string* file_list_string) {
AgentStatus status = PALO_SUCCESS;
CURL* curl = NULL;
CURLcode curl_ret = CURLE_OK;
curl = curl_easy_init();
// Init curl
if (curl == NULL) {
status = PALO_FILE_DOWNLOAD_CURL_INIT_FAILED;
OLAP_LOG_WARNING("internal error to get NULL curl");
}
stringstream output_string_stream;
// Set curl opt
char errbuf[CURL_ERROR_SIZE];
if (status == PALO_SUCCESS) {
status = _install_opt(OutputType::STREAM, curl, errbuf, &output_string_stream, NULL);
if (PALO_SUCCESS != status) {
OLAP_LOG_WARNING("install curl opt failed.");
}
}
// Get result
if (status == PALO_SUCCESS) {
curl_ret = curl_easy_perform(curl);
if (curl_ret != CURLE_OK) {
status = PALO_FILE_DOWNLOAD_LIST_DIR_FAIL;
OLAP_LOG_WARNING(
"curl list file dir failed.[path=%s]",
_downloader_param.remote_file_path.c_str());
_get_err_info(errbuf, curl_ret);
}
}
if (curl != NULL) {
curl_easy_cleanup(curl);
}
*file_list_string = output_string_stream.str();
return status;
}
} // namespace palo

View File

@ -0,0 +1,85 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_AGENT_FILE_DOWNLOADER_H
#define BDG_PALO_BE_SRC_AGENT_FILE_DOWNLOADER_H
#include <iostream>
#include <pthread.h>
#include <cstdint>
#include <sstream>
#include <string>
#include "curl/curl.h"
#include "agent/status.h"
#include "olap/olap_define.h"
#include "olap/file_helper.h"
namespace palo {
const uint32_t GET_LENGTH_TIMEOUT = 10;
const uint32_t CURL_OPT_CONNECTTIMEOUT = 120;
// down load file
class FileDownloader {
public:
enum OutputType {
NONE,
STREAM,
FILE
};
struct FileDownloaderParam {
std::string username;
std::string password;
std::string remote_file_path;
std::string local_file_path;
uint32_t curl_opt_timeout;
};
explicit FileDownloader(const FileDownloaderParam& param);
virtual ~FileDownloader() {};
// Download file from remote server
virtual AgentStatus download_file();
// List remote dir file
virtual AgentStatus list_file_dir(std::string* file_list_string);
// Get file length of remote file
//
// Output parameters:
// * length: The pointer of size of remote file
virtual AgentStatus get_length(uint64_t* length);
private:
static size_t _write_file_callback(
void* buffer, size_t size, size_t nmemb, void* downloader);
static size_t _write_stream_callback(
void* buffer, size_t size, size_t nmemb, void* downloader);
AgentStatus _install_opt(
OutputType output_type, CURL* curl, char* errbuf,
std::stringstream* output_stream, FileHandler* file_handler);
void _get_err_info(char * errbuf, CURLcode res);
const FileDownloaderParam& _downloader_param;
DISALLOW_COPY_AND_ASSIGN(FileDownloader);
}; // class FileDownloader
} // namespace palo
#endif // BDG_PALO_BE_SRC_AGENT_FILE_DOWNLOADER_H

View File

@ -0,0 +1,146 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "agent/heartbeat_server.h"
#include <ctime>
#include <fstream>
#include "boost/filesystem.hpp"
#include "thrift/TProcessor.h"
#include "gen_cpp/HeartbeatService.h"
#include "gen_cpp/Status_types.h"
#include "olap/olap_rootpath.h"
#include "olap/utils.h"
using std::fstream;
using std::nothrow;
using std::string;
using std::vector;
using apache::thrift::transport::TProcessor;
namespace palo {
HeartbeatServer::HeartbeatServer(TMasterInfo* master_info) :
_master_info(master_info),
_epoch(0) {
_olap_rootpath_instance = OLAPRootPath::get_instance();
}
void HeartbeatServer::init_cluster_id() {
_master_info->cluster_id = _olap_rootpath_instance->effective_cluster_id();
}
void HeartbeatServer::heartbeat(
THeartbeatResult& heartbeat_result,
const TMasterInfo& master_info) {
AgentStatus status = PALO_SUCCESS;
TStatusCode::type status_code = TStatusCode::OK;
vector<string> error_msgs;
TStatus heartbeat_status;
OLAP_LOG_INFO("get heartbeat, host: %s, port: %d, cluster id: %d",
master_info.network_address.hostname.c_str(),
master_info.network_address.port,
master_info.cluster_id);
// Check cluster id
if (_master_info->cluster_id == -1) {
OLAP_LOG_INFO("get first heartbeat. update cluster id");
// write and update cluster id
OLAPStatus res = _olap_rootpath_instance->set_cluster_id(master_info.cluster_id);
if (res != OLAP_SUCCESS) {
OLAP_LOG_WARNING("fail to set cluster id. [res=%d]", res);
error_msgs.push_back("fail to set cluster id.");
status = PALO_ERROR;
} else {
_master_info->cluster_id = master_info.cluster_id;
OLAP_LOG_INFO("record cluster id."
"host: %s, port: %d, cluster id: %d",
master_info.network_address.hostname.c_str(),
master_info.network_address.port,
master_info.cluster_id);
}
} else {
if (_master_info->cluster_id != master_info.cluster_id) {
OLAP_LOG_WARNING("invalid cluster id: %d. ignore.", master_info.cluster_id);
error_msgs.push_back("invalid cluster id. ignore.");
status = PALO_ERROR;
}
}
if (status == PALO_SUCCESS) {
if (_master_info->network_address.hostname != master_info.network_address.hostname
|| _master_info->network_address.port != master_info.network_address.port) {
if (master_info.epoch > _epoch) {
_master_info->network_address.hostname = master_info.network_address.hostname;
_master_info->network_address.port = master_info.network_address.port;
_epoch = master_info.epoch;
OLAP_LOG_INFO("master change, new master host: %s, port: %d, epoch: %ld",
_master_info->network_address.hostname.c_str(),
_master_info->network_address.port,
_epoch);
} else {
OLAP_LOG_WARNING("epoch is not greater than local. ignore heartbeat."
"host: %s, port: %d, local epoch: %ld, received epoch: %ld",
_master_info->network_address.hostname.c_str(),
_master_info->network_address.port,
_epoch, master_info.epoch);
error_msgs.push_back("epoch is not greater than local. ignore heartbeat.");
status = PALO_ERROR;
}
}
}
TBackendInfo backend_info;
if (status == PALO_SUCCESS) {
backend_info.__set_be_port(config::be_port);
backend_info.__set_http_port(config::webserver_port);
backend_info.__set_be_rpc_port(config::be_rpc_port);
} else {
status_code = TStatusCode::RUNTIME_ERROR;
}
heartbeat_status.__set_status_code(status_code);
heartbeat_status.__set_error_msgs(error_msgs);
heartbeat_result.__set_status(heartbeat_status);
heartbeat_result.__set_backend_info(backend_info);
}
AgentStatus create_heartbeat_server(
ExecEnv* exec_env,
uint32_t server_port,
ThriftServer** thrift_server,
uint32_t worker_thread_num,
TMasterInfo* local_master_info) {
HeartbeatServer* heartbeat_server = new (nothrow) HeartbeatServer(local_master_info);
if (heartbeat_server == NULL) {
return PALO_ERROR;
}
heartbeat_server->init_cluster_id();
boost::shared_ptr<HeartbeatServer> handler(heartbeat_server);
boost::shared_ptr<TProcessor> server_processor(new HeartbeatServiceProcessor(handler));
string server_name("heartbeat");
*thrift_server = new ThriftServer(
server_name,
server_processor,
server_port,
exec_env->metrics(),
worker_thread_num);
return PALO_SUCCESS;
}
} // namesapce palo

View File

@ -0,0 +1,64 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_AGENT_HEARTBEAT_SERVER_H
#define BDG_PALO_BE_SRC_AGENT_HEARTBEAT_SERVER_H
#include "thrift/transport/TTransportUtils.h"
#include "agent/status.h"
#include "gen_cpp/HeartbeatService.h"
#include "gen_cpp/Status_types.h"
#include "olap/olap_define.h"
#include "olap/olap_rootpath.h"
#include "runtime/exec_env.h"
namespace palo {
const uint32_t HEARTBEAT_INTERVAL = 10;
class HeartbeatServer : public HeartbeatServiceIf {
public:
explicit HeartbeatServer(TMasterInfo* master_info);
virtual ~HeartbeatServer() {};
virtual void init_cluster_id();
// Master send heartbeat to this server
//
// Input parameters:
// * master_info: The struct of master info, contains host ip and port
//
// Output parameters:
// * heartbeat_result: The result of heartbeat set
virtual void heartbeat(THeartbeatResult& heartbeat_result, const TMasterInfo& master_info);
private:
TMasterInfo* _master_info;
OLAPRootPath* _olap_rootpath_instance;
int64_t _epoch;
DISALLOW_COPY_AND_ASSIGN(HeartbeatServer);
}; // class HeartBeatServer
AgentStatus create_heartbeat_server(
ExecEnv* exec_env,
uint32_t heartbeat_server_port,
ThriftServer** heart_beat_server,
uint32_t worker_thread_num,
TMasterInfo* local_master_info);
} // namespace palo
#endif // BDG_PALO_BE_SRC_AGENT_HEARTBEAT_SERVER_H

288
be/src/agent/pusher.cpp Normal file
View File

@ -0,0 +1,288 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "agent/pusher.h"
#include <pthread.h>
#include <cstdio>
#include <ctime>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include "boost/filesystem.hpp"
#include "boost/lexical_cast.hpp"
#include "agent/cgroups_mgr.h"
#include "agent/file_downloader.h"
#include "gen_cpp/AgentService_types.h"
#include "olap/command_executor.h"
#include "olap/olap_common.h"
#include "olap/olap_define.h"
#include "olap/olap_engine.h"
#include "olap/olap_table.h"
using std::list;
using std::string;
using std::vector;
namespace palo {
Pusher::Pusher(const TPushReq& push_req) :
_push_req(push_req) {
_command_executor = new CommandExecutor();
_download_status = PALO_SUCCESS;
}
Pusher::~Pusher() {
if (_command_executor != NULL) {
delete _command_executor;
}
}
AgentStatus Pusher::init() {
AgentStatus status = PALO_SUCCESS;
if (_is_init) {
OLAP_LOG_DEBUG("has been inited");
return status;
}
// Check replica exist
SmartOLAPTable olap_table;
olap_table = _command_executor->get_table(
_push_req.tablet_id,
_push_req.schema_hash);
if (olap_table.get() == NULL) {
OLAP_LOG_WARNING("get tables failed. tablet_id: %ld, schema_hash: %ld",
_push_req.tablet_id, _push_req.schema_hash);
status = PALO_PUSH_INVALID_TABLE;
}
// Empty remote_path
if (status == PALO_SUCCESS && !_push_req.__isset.http_file_path) {
_is_init = true;
return status;
}
// Check remote path
string remote_full_path;
string tmp_file_dir;
if (status == PALO_SUCCESS) {
remote_full_path = _push_req.http_file_path;
// Get local download path
OLAP_LOG_INFO("start get file. remote_full_path:%s", remote_full_path.c_str());
string root_path = olap_table->storage_root_path_name();
status = _get_tmp_file_dir(root_path, &tmp_file_dir);
if (PALO_SUCCESS != status) {
OLAP_LOG_WARNING("get local path failed. tmp file dir: %s", tmp_file_dir.c_str());
}
}
// Set download param
if (status == PALO_SUCCESS) {
string tmp_file_name;
_get_file_name_from_path(_push_req.http_file_path, &tmp_file_name);
_downloader_param.username = "";
_downloader_param.password = "";
_downloader_param.remote_file_path = remote_full_path;
_downloader_param.local_file_path = tmp_file_dir + "/" + tmp_file_name;
_is_init = true;
}
return status;
}
// Get replica root path
AgentStatus Pusher::_get_tmp_file_dir(const string& root_path, string* download_path) {
AgentStatus status = PALO_SUCCESS;
*download_path = root_path + DPP_PREFIX;
// Check path exist
boost::filesystem::path full_path(*download_path);
if (!boost::filesystem::exists(full_path)) {
OLAP_LOG_INFO("download dir not exist: %s", download_path->c_str());
boost::system::error_code error_code;
boost::filesystem::create_directories(*download_path, error_code);
if (0 != error_code) {
status = PALO_ERROR;
OLAP_LOG_WARNING("create download dir failed.path: %s, error code: %d",
download_path->c_str(), error_code);
}
}
return status;
}
AgentStatus Pusher::_download_file() {
OLAP_LOG_INFO("begin download file. tablet=%d", _push_req.tablet_id);
time_t start = time(NULL);
AgentStatus status = PALO_SUCCESS;
status = _file_downloader->download_file();
time_t cost = time(NULL) - start;
if (cost <= 0) {
cost = 1;
}
// KB/s
double rate = -1.0;
if (_push_req.__isset.http_file_size) {
rate = (double) _push_req.http_file_size / cost / 1024;
}
if (status == PALO_SUCCESS) {
OLAP_LOG_INFO("down load file success. local_file=%s, remote_file=%s, "
"tablet=%d, cost=%ld, file size: %ld B, download rate: %f KB/s",
_downloader_param.local_file_path.c_str(),
_downloader_param.remote_file_path.c_str(),
_push_req.tablet_id, cost, _push_req.http_file_size, rate);
} else {
OLAP_LOG_WARNING("down load file failed. remote_file=%s, tablet=%d, cost=%ld, "
"file size: %ld B",
_downloader_param.remote_file_path.c_str(), _push_req.tablet_id, cost,
_push_req.http_file_size);
}
// todo check data length and mv name tmp
return status;
}
void Pusher::_get_file_name_from_path(const string& file_path, string* file_name) {
size_t found = file_path.find_last_of("/\\");
pthread_t tid = pthread_self();
*file_name = file_path.substr(found + 1) + "_" + boost::lexical_cast<string>(tid);
}
AgentStatus Pusher::process(vector<TTabletInfo>* tablet_infos) {
AgentStatus status = PALO_SUCCESS;
if (!_is_init) {
OLAP_LOG_WARNING("has not init yet. tablet_id: %d", _push_req.tablet_id);
return PALO_ERROR;
}
// Remote file not empty, need to download
if (_push_req.__isset.http_file_path) {
// Get file length
uint64_t file_size = 0;
uint64_t estimate_time_out = DEFAULT_DOWNLOAD_TIMEOUT;
if (_push_req.__isset.http_file_size) {
file_size = _push_req.http_file_size;
estimate_time_out =
file_size / config::download_low_speed_limit_kbps / 1024;
}
if (estimate_time_out < config::download_low_speed_time) {
estimate_time_out = config::download_low_speed_time;
}
// Download file from hdfs
for (uint32_t i = 0; i < MAX_RETRY; ++i) {
// Check timeout and set timeout
time_t now = time(NULL);
if (_push_req.timeout > 0) {
OLAP_LOG_DEBUG(
"check time out. time_out:%ld, now:%d",
_push_req.timeout, now);
if (_push_req.timeout < now) {
OLAP_LOG_WARNING("push time out");
status = PALO_PUSH_TIME_OUT;
break;
}
}
_downloader_param.curl_opt_timeout = estimate_time_out;
uint64_t timeout = _push_req.timeout > 0 ? _push_req.timeout - now : 0;
if (timeout > 0 && timeout < estimate_time_out) {
_downloader_param.curl_opt_timeout = timeout;
}
OLAP_LOG_DEBUG(
"estimate_time_out: %d, download_timeout: %d, curl_opt_timeout: %d",
estimate_time_out,
_push_req.timeout,
_downloader_param.curl_opt_timeout);
OLAP_LOG_DEBUG("download file, retry time:%d", i);
#ifndef BE_TEST
_file_downloader = new FileDownloader(_downloader_param);
_download_status = _download_file();
if (_file_downloader != NULL) {
delete _file_downloader;
_file_downloader = NULL;
}
#endif
status = _download_status;
if (_push_req.__isset.http_file_size && status == PALO_SUCCESS) {
// Check file size
boost::filesystem::path local_file_path(_downloader_param.local_file_path);
uint64_t local_file_size = boost::filesystem::file_size(local_file_path);
OLAP_LOG_DEBUG(
"file_size: %d, local_file_size: %d",
file_size, local_file_size);
if (file_size != local_file_size) {
OLAP_LOG_WARNING(
"download_file size error. file_size: %d, local_file_size: %d",
file_size, local_file_size);
status = PALO_FILE_DOWNLOAD_FAILED;
}
}
if (status == PALO_SUCCESS) {
_push_req.http_file_path = _downloader_param.local_file_path;
break;
}
#ifndef BE_TEST
sleep(config::sleep_one_second);
#endif
}
}
if (status == PALO_SUCCESS) {
// Load delta file
time_t push_begin = time(NULL);
OLAPStatus push_status = _command_executor->push(_push_req, tablet_infos);
time_t push_finish = time(NULL);
OLAP_LOG_INFO("Push finish, cost time: %ld", push_finish - push_begin);
if (push_status != OLAPStatus::OLAP_SUCCESS) {
status = PALO_ERROR;
}
}
// Delete download file
boost::filesystem::path download_file_path(_downloader_param.local_file_path);
if (boost::filesystem::exists(download_file_path)) {
if (remove(_downloader_param.local_file_path.c_str()) == -1) {
OLAP_LOG_WARNING("can not remove file: %s",
_downloader_param.local_file_path.c_str());
}
}
return status;
}
} // namespace palo

66
be/src/agent/pusher.h Normal file
View File

@ -0,0 +1,66 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_AGENT_PUSHER_H
#define BDG_PALO_BE_SRC_AGENT_PUSHER_H
#include <utility>
#include <vector>
#include "agent/file_downloader.h"
#include "agent/status.h"
#include "gen_cpp/AgentService_types.h"
#include "olap/command_executor.h"
#include "olap/olap_common.h"
#include "olap/olap_define.h"
namespace palo {
const uint32_t MAX_RETRY = 3;
const uint32_t DEFAULT_DOWNLOAD_TIMEOUT = 3600;
class Pusher {
public:
explicit Pusher(const TPushReq& push_req);
virtual ~Pusher();
// The initial function of pusher
virtual AgentStatus init();
// The process of push data to olap engine
//
// Output parameters:
// * tablet_infos: The info of pushed tablet after push data
virtual AgentStatus process(std::vector<TTabletInfo>* tablet_infos);
private:
AgentStatus _get_tmp_file_dir(const std::string& root_path, std::string* local_path);
AgentStatus _download_file();
void _get_file_name_from_path(const std::string& file_path, std::string* file_name);
bool _is_init = false;
TPushReq _push_req;
FileDownloader::FileDownloaderParam _downloader_param;
CommandExecutor* _command_executor;
FileDownloader* _file_downloader;
AgentStatus _download_status;
DISALLOW_COPY_AND_ASSIGN(Pusher);
}; // class Pusher
} // namespace palo
#endif // BDG_PALO_BE_SRC_AGENT_SERVICE_PUSHER_H

48
be/src/agent/status.h Normal file
View File

@ -0,0 +1,48 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_AGENT_STATUS_H
#define BDG_PALO_BE_SRC_AGENT_STATUS_H
namespace palo {
enum AgentStatus {
PALO_SUCCESS = 0,
PALO_ERROR = -1,
PALO_TASK_REQUEST_ERROR = -101,
PALO_FILE_DOWNLOAD_INVALID_PARAM = -201,
PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED = -202,
PALO_FILE_DOWNLOAD_CURL_INIT_FAILED = -203,
PALO_FILE_DOWNLOAD_FAILED = -204,
PALO_FILE_DOWNLOAD_GET_LENGTH_FAILED = -205,
PALO_FILE_DOWNLOAD_NOT_EXIST = -206,
PALO_FILE_DOWNLOAD_LIST_DIR_FAIL = -207,
PALO_CREATE_TABLE_EXIST = -301,
PALO_CREATE_TABLE_DIFF_SCHEMA_EXIST = -302,
PALO_CREATE_TABLE_NOT_EXIST = -303,
PALO_DROP_TABLE_NOT_EXIST = -401,
PALO_PUSH_INVALID_TABLE = -501,
PALO_PUSH_INVALID_VERSION = -502,
PALO_PUSH_TIME_OUT = -503,
PALO_PUSH_HAD_LOADED = -504,
PALO_TIMEOUT = -901,
PALO_INTERNAL_ERROR = -902,
};
} // namespace palo
#endif // BDG_PALO_BE_SRC_AGENT_STATUS_H

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,166 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_TASK_WORKER_POOL_H
#define BDG_PALO_BE_SRC_TASK_WORKER_POOL_H
#include <atomic>
#include <deque>
#include <utility>
#include <vector>
#include "agent/pusher.h"
#include "agent/status.h"
#include "agent/utils.h"
#include "gen_cpp/AgentService_types.h"
#include "gen_cpp/HeartbeatService_types.h"
#include "olap/command_executor.h"
#include "olap/olap_define.h"
#include "olap/utils.h"
namespace palo {
const uint32_t DOWNLOAD_FILE_MAX_RETRY = 3;
const uint32_t TASK_FINISH_MAX_RETRY = 3;
const uint32_t PUSH_MAX_RETRY = 1;
const uint32_t REPORT_TASK_WORKER_COUNT = 1;
const uint32_t REPORT_DISK_STATE_WORKER_COUNT = 1;
const uint32_t REPORT_OLAP_TABLE_WORKER_COUNT = 1;
const uint32_t LIST_REMOTE_FILE_TIMEOUT = 15;
const std::string HTTP_REQUEST_PREFIX = "/api/_tablet/_download?file=";
class TaskWorkerPool {
public:
enum TaskWorkerType {
CREATE_TABLE,
DROP_TABLE,
PUSH,
DELETE,
ALTER_TABLE,
QUERY_SPLIT_KEY,
CLONE,
STORAGE_MEDIUM_MIGRATE,
CANCEL_DELETE_DATA,
CHECK_CONSISTENCY,
REPORT_TASK,
REPORT_DISK_STATE,
REPORT_OLAP_TABLE,
UPLOAD,
RESTORE,
MAKE_SNAPSHOT,
RELEASE_SNAPSHOT
};
typedef void* (*CALLBACK_FUNCTION)(void*);
TaskWorkerPool(
const TaskWorkerType task_worker_type,
const TMasterInfo& master_info);
virtual ~TaskWorkerPool();
// Start the task worker callback thread
virtual void start();
// Submit task to task pool
//
// Input parameters:
// * task: the task need callback thread to do
virtual void submit_task(const TAgentTaskRequest& task);
private:
bool _record_task_info(
const TTaskType::type task_type, int64_t signature, const std::string& user);
void _remove_task_info(
const TTaskType::type task_type, int64_t signature, const std::string& user);
void _spawn_callback_worker_thread(CALLBACK_FUNCTION callback_func);
void _finish_task(const TFinishTaskRequest& finish_task_request);
uint32_t _get_next_task_index(int32_t thread_count, std::deque<TAgentTaskRequest>& tasks,
TPriority::type priority);
static void* _create_table_worker_thread_callback(void* arg_this);
static void* _drop_table_worker_thread_callback(void* arg_this);
static void* _push_worker_thread_callback(void* arg_this);
static void* _alter_table_worker_thread_callback(void* arg_this);
static void* _clone_worker_thread_callback(void* arg_this);
static void* _storage_medium_migrate_worker_thread_callback(void* arg_this);
static void* _cancel_delete_data_worker_thread_callback(void* arg_this);
static void* _check_consistency_worker_thread_callback(void* arg_this);
static void* _report_task_worker_thread_callback(void* arg_this);
static void* _report_disk_state_worker_thread_callback(void* arg_this);
static void* _report_olap_table_worker_thread_callback(void* arg_this);
static void* _upload_worker_thread_callback(void* arg_this);
static void* _restore_worker_thread_callback(void* arg_this);
static void* _make_snapshot_thread_callback(void* arg_this);
static void* _release_snapshot_thread_callback(void* arg_this);
AgentStatus _clone_copy(
const TCloneReq& clone_req,
int64_t signature,
const std::string& local_data_path,
TBackend* src_host,
std::string* src_file_path,
std::vector<std::string>* error_msgs);
void _alter_table(
const TAlterTabletReq& create_rollup_request,
int64_t signature,
const TTaskType::type task_type,
TFinishTaskRequest* finish_task_request);
AlterTableStatus _show_alter_table_status(
const TTabletId tablet_id,
const TSchemaHash schema_hash);
AgentStatus _drop_table(const TDropTabletReq drop_tablet_req);
AgentStatus _get_tablet_info(
const TTabletId tablet_id,
const TSchemaHash schema_hash,
int64_t signature,
TTabletInfo* tablet_info);
const TMasterInfo& _master_info;
TBackend _backend;
AgentUtils* _agent_utils;
MasterServerClient* _master_client;
CommandExecutor* _command_executor;
#ifdef BE_TEST
AgentServerClient* _agent_client;
FileDownloader* _file_downloader_ptr;
Pusher * _pusher;
#endif
std::deque<TAgentTaskRequest> _tasks;
MutexLock _worker_thread_lock;
Condition _worker_thread_condition_lock;
uint32_t _worker_count;
TaskWorkerType _task_worker_type;
CALLBACK_FUNCTION _callback_function;
static std::atomic_ulong _s_report_version;
static std::map<TTaskType::type, std::set<int64_t>> _s_task_signatures;
static std::map<TTaskType::type, std::map<std::string, uint32_t>> _s_running_task_user_count;
static std::map<TTaskType::type, std::map<std::string, uint32_t>> _s_total_task_user_count;
static std::map<TTaskType::type, uint32_t> _s_total_task_count;
static MutexLock _s_task_signatures_lock;
static MutexLock _s_running_task_user_count_lock;
static FrontendServiceClientCache _master_service_client_cache;
DISALLOW_COPY_AND_ASSIGN(TaskWorkerPool);
}; // class TaskWorkerPool
} // namespace palo
#endif // BDG_PALO_BE_SRC_TASK_WORKER_POOL_H

View File

@ -0,0 +1,41 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_AGENT_TOPIC_LISTENER_H
#define BDG_PALO_BE_SRC_AGENT_TOPIC_LISTENER_H
#include "gen_cpp/AgentService_types.h"
namespace palo {
class TopicListener {
public:
virtual ~TopicListener(){}
// Deal with a single update
//
// Input parameters:
// protocol version: the version for the protocol, listeners should deal with the msg according to the protocol
// topic_update: single update
virtual void handle_update(const TAgentServiceVersion::type& protocol_version,
const TTopicUpdate& topic_update) = 0;
};
}
#endif

View File

@ -0,0 +1,65 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "agent/topic_subscriber.h"
#include "common/logging.h"
namespace palo {
TopicSubscriber::TopicSubscriber() {
}
TopicSubscriber::~TopicSubscriber() {
// Delete all listeners in the register
std::map<TTopicType::type, std::vector<TopicListener*>>::iterator it
= _registed_listeners.begin();
for (; it != _registed_listeners.end(); ++it) {
std::vector<TopicListener*>& listeners = it->second;
std::vector<TopicListener*>::iterator listener_it = listeners.begin();
for (; listener_it != listeners.end(); ++listener_it) {
delete *listener_it;
}
}
}
void TopicSubscriber::register_listener(TTopicType::type topic_type, TopicListener* listener) {
// Unique lock here to prevent access to listeners
boost::unique_lock<boost::shared_mutex> lock(_listener_mtx);
this->_registed_listeners[topic_type].push_back(listener);
}
void TopicSubscriber::handle_updates(const TAgentPublishRequest& agent_publish_request) {
LOG(INFO) << "Received master's published state, begin to handle";
// Shared lock here in order to avoid updates in listeners' map
boost::shared_lock<boost::shared_mutex> lock(_listener_mtx);
// Currently, not deal with protocol version, the listener should deal with protocol version
const std::vector<TTopicUpdate>& topic_updates = agent_publish_request.updates;
std::vector<TTopicUpdate>::const_iterator topic_update_it = topic_updates.begin();
for (; topic_update_it != topic_updates.end(); ++topic_update_it) {
std::vector<TopicListener*>& listeners = this->_registed_listeners[topic_update_it->type];
std::vector<TopicListener*>::iterator listener_it = listeners.begin();
// Send the update to all listeners with protocol version.
for (; listener_it != listeners.end(); ++listener_it) {
(*listener_it)->handle_update(agent_publish_request.protocol_version,
*topic_update_it);
}
}
LOG(INFO) << "Handle master's published state finished";
}
} // namespace palo

View File

@ -0,0 +1,46 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_AGENT_TOPIC_SUBSCRIBER_H
#define BDG_PALO_BE_SRC_AGENT_TOPIC_SUBSCRIBER_H
#include <map>
#include <boost/thread.hpp>
#include "agent/topic_listener.h"
#include "gen_cpp/AgentService_types.h"
namespace palo {
class TopicSubscriber {
public:
TopicSubscriber();
~TopicSubscriber();
// Put the topic type and listener to the map
void register_listener(TTopicType::type topic_type, TopicListener* listener);
// Handle all updates in the request
void handle_updates(const TAgentPublishRequest& agent_publish_request);
private:
std::map<TTopicType::type, std::vector<TopicListener*>> _registed_listeners;
boost::shared_mutex _listener_mtx;
};
} // namespace palo
#endif

View File

@ -0,0 +1,112 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "agent/user_resource_listener.h"
#include <map>
#include <future>
#include <thrift/Thrift.h>
#include <thrift/transport/TSocket.h>
#include <thrift/transport/TBufferTransports.h>
#include <thrift/protocol/TBinaryProtocol.h>
#include <thrift/TApplicationException.h>
#include "common/logging.h"
#include "gen_cpp/FrontendService.h"
namespace palo {
using std::string;
using apache::thrift::TException;
using apache::thrift::transport::TTransportException;
// Initialize the resource to cgroups file mapping
// TRESOURCE_IOPS not mapped
UserResourceListener::UserResourceListener(ExecEnv* exec_env,
const TMasterInfo& master_info)
: _master_info(master_info),
_master_client_cache(exec_env->frontend_client_cache()),
_cgroups_mgr(*(exec_env->cgroups_mgr())) {
}
UserResourceListener::~UserResourceListener() {
}
void UserResourceListener::handle_update(const TAgentServiceVersion::type& protocol_version,
const TTopicUpdate& topic_update) {
std::vector<TTopicItem> updates = topic_update.updates;
if (updates.size() > 0) {
int64_t new_version = updates[0].int_value;
// Async call to update users resource method
LOG(INFO) << "Latest version for master is " << new_version;
std::async(std::launch::async,
&UserResourceListener::update_users_resource,
this, new_version);
}
}
void UserResourceListener::update_users_resource(int64_t new_version) {
if (new_version <= _cgroups_mgr.get_cgroups_version()) {
return;
}
LOG(INFO) << "New version " << new_version
<< " is bigger than older version " << _cgroups_mgr.get_cgroups_version();
// Call fe to get latest user resource
Status master_status;
// using 500ms as default timeout value
FrontendServiceConnection client(_master_client_cache,
_master_info.network_address,
500,
&master_status);
TFetchResourceResult new_fetched_resource;
if (!master_status.ok()) {
LOG(ERROR) << "Get frontend client failed, with address:"
<< _master_info.network_address.hostname << ":"
<< _master_info.network_address.port;
return;
}
try {
try {
LOG(INFO) << "Call master to get resource";
client->fetchResource(new_fetched_resource);
LOG(INFO) << "Call master to get resource successfully";
} catch (TTransportException& e) {
// reopen the client and set timeout to 500ms
master_status = client.reopen(500);
if (!master_status.ok()) {
LOG(ERROR) << "Reopen to get frontend client failed, with address:"
<< _master_info.network_address.hostname << ":"
<< _master_info.network_address.port;
return;
}
LOG(WARNING) << "fetchResource from frontend failed, retry!";
client->fetchResource(new_fetched_resource);
}
} catch (TException& e) {
// Already try twice, log here
LOG(ERROR) << "retry to fetchResource from "
<< _master_info.network_address.hostname << ":"
<< _master_info.network_address.port << " failed:\n"
<< e.what();
return;
}
LOG(INFO) << "Begin to update user's cgroups resource";
_cgroups_mgr.update_local_cgroups(new_fetched_resource);
}
}

View File

@ -0,0 +1,52 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_AGENT_USER_RESOURCE_LISTENER_H
#define BDG_PALO_BE_SRC_AGENT_USER_RESOURCE_LISTENER_H
#include <string>
#include "agent/topic_listener.h"
#include "agent/cgroups_mgr.h"
#include "gen_cpp/AgentService_types.h"
#include "gen_cpp/MasterService_types.h"
#include "gen_cpp/HeartbeatService_types.h"
#include "runtime/exec_env.h"
namespace palo {
class UserResourceListener : public TopicListener {
public:
~UserResourceListener();
// Input parameters:
// root_cgroups_path: root cgroups allocated by admin to palo
UserResourceListener(ExecEnv* exec_env, const TMasterInfo& master_info);
// This method should be async
virtual void handle_update(const TAgentServiceVersion::type& protocol_version,
const TTopicUpdate& topic_update);
private:
const TMasterInfo& _master_info;
FrontendServiceClientCache* _master_client_cache;
CgroupsMgr& _cgroups_mgr;
// Call cgroups mgr to update user's cgroups resource share
// Also refresh local user resource's cache
void update_users_resource(int64_t new_version);
};
}
#endif

399
be/src/agent/utils.cpp Normal file
View File

@ -0,0 +1,399 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "agent/utils.h"
#include <arpa/inet.h>
#include <cstdio>
#include <errno.h>
#include <fstream>
#include <iostream>
#include <netdb.h>
#include <netinet/in.h>
#include <sstream>
#include <sys/socket.h>
#include <sys/wait.h>
#include <unistd.h>
#include <boost/filesystem.hpp>
#include <thrift/Thrift.h>
#include <thrift/transport/TSocket.h>
#include <thrift/transport/TTransportException.h>
#include <thrift/transport/TTransportUtils.h>
#include <rapidjson/document.h>
#include <rapidjson/rapidjson.h>
#include <rapidjson/stringbuffer.h>
#include <rapidjson/writer.h>
#include "common/status.h"
#include "gen_cpp/AgentService_types.h"
#include "gen_cpp/HeartbeatService_types.h"
#include "gen_cpp/FrontendService.h"
#include "gen_cpp/Status_types.h"
#include "olap/utils.h"
#include "runtime/exec_env.h"
using std::map;
using std::pair;
using std::string;
using std::stringstream;
using std::vector;
using apache::thrift::protocol::TBinaryProtocol;
using apache::thrift::TException;
using apache::thrift::transport::TSocket;
using apache::thrift::transport::TBufferedTransport;
using apache::thrift::transport::TTransportException;
namespace palo {
AgentServerClient::AgentServerClient(const TBackend backend) :
_socket(new TSocket(backend.host, backend.be_port)),
_transport(new TBufferedTransport(_socket)),
_protocol(new TBinaryProtocol(_transport)),
_agent_service_client(_protocol) {
}
AgentServerClient::~AgentServerClient() {
if (_transport != NULL) {
_transport->close();
}
}
AgentStatus AgentServerClient::make_snapshot(
const TSnapshotRequest& snapshot_request,
TAgentResult* result) {
AgentStatus status = PALO_SUCCESS;
TAgentResult thrift_result;
try {
_transport->open();
_agent_service_client.make_snapshot(thrift_result, snapshot_request);
*result = thrift_result;
_transport->close();
} catch (TException& e) {
OLAP_LOG_WARNING("agent clinet make snapshot, "
"get exception, error: %s", e.what());
_transport->close();
status = PALO_ERROR;
}
return status;
}
AgentStatus AgentServerClient::release_snapshot(
const string& snapshot_path,
TAgentResult* result) {
AgentStatus status = PALO_SUCCESS;
try {
_transport->open();
_agent_service_client.release_snapshot(*result, snapshot_path);
_transport->close();
} catch (TException& e) {
OLAP_LOG_WARNING("agent clinet make snapshot, "
"get exception, error: %s", e.what());
_transport->close();
status = PALO_ERROR;
}
return status;
}
MasterServerClient::MasterServerClient(
const TMasterInfo& master_info,
FrontendServiceClientCache* client_cache) :
_master_info(master_info),
_client_cache(client_cache) {
}
AgentStatus MasterServerClient::finish_task(
const TFinishTaskRequest request,
TMasterResult* result) {
Status client_status;
FrontendServiceConnection client(
_client_cache,
_master_info.network_address,
MASTER_CLIENT_TIMEOUT,
&client_status);
if (!client_status.ok()) {
OLAP_LOG_WARNING("master client, get client from cache failed."
"host: %s, port: %d, code: %d",
_master_info.network_address.hostname.c_str(),
_master_info.network_address.port,
client_status.code());
return PALO_ERROR;
}
try {
try {
client->finishTask(*result, request);
} catch (TTransportException& e) {
OLAP_LOG_WARNING("master client, retry finishTask: %s", e.what());
client_status = client.reopen(MASTER_CLIENT_TIMEOUT);
if (!client_status.ok()) {
OLAP_LOG_WARNING("master client, get client from cache failed."
"host: %s, port: %d, code: %d",
_master_info.network_address.hostname.c_str(),
_master_info.network_address.port,
client_status.code());
return PALO_ERROR;
}
client->finishTask(*result, request);
}
} catch (TException& e) {
OLAP_LOG_WARNING("master client, finishTask execute failed."
"host: %s, port: %d, error: %s",
_master_info.network_address.hostname.c_str(),
_master_info.network_address.port,
e.what());
return PALO_ERROR;
}
return PALO_SUCCESS;
}
AgentStatus MasterServerClient::report(const TReportRequest request, TMasterResult* result) {
Status client_status;
FrontendServiceConnection client(
_client_cache,
_master_info.network_address,
MASTER_CLIENT_TIMEOUT,
&client_status);
if (!client_status.ok()) {
OLAP_LOG_WARNING("master client, get client from cache failed."
"host: %s, port: %d, code: %d",
_master_info.network_address.hostname.c_str(),
_master_info.network_address.port,
client_status.code());
return PALO_ERROR;
}
try {
try {
client->report(*result, request);
} catch (TTransportException& e) {
TTransportException::TTransportExceptionType type = e.getType();
if (type != TTransportException::TTransportExceptionType::TIMED_OUT) {
// if not TIMED_OUT, retry
OLAP_LOG_WARNING("master client, retry report: %s", e.what());
client_status = client.reopen(MASTER_CLIENT_TIMEOUT);
if (!client_status.ok()) {
OLAP_LOG_WARNING("master client, get client from cache failed."
"host: %s, port: %d, code: %d",
_master_info.network_address.hostname.c_str(),
_master_info.network_address.port,
client_status.code());
return PALO_ERROR;
}
client->report(*result, request);
} else {
// TIMED_OUT exception. do not retry
// actually we don't care what FE returns.
OLAP_LOG_WARNING("master client, report failed: %s", e.what());
return PALO_ERROR;
}
}
} catch (TException& e) {
OLAP_LOG_WARNING("master client, finish report failed."
"host: %s, port: %d, code: %d",
_master_info.network_address.hostname.c_str(),
_master_info.network_address.port,
client_status.code());
return PALO_ERROR;
}
return PALO_SUCCESS;
}
AgentStatus AgentUtils::rsync_from_remote(
const string& remote_host,
const string& remote_file_path,
const string& local_file_path,
const vector<string>& exclude_file_patterns,
uint32_t transport_speed_limit_kbps,
uint32_t timeout_second) {
int ret_code = 0;
stringstream cmd_stream;
cmd_stream << "rsync -r -q -e \"ssh -o StrictHostKeyChecking=no\"";
for (auto exclude_file_pattern : exclude_file_patterns) {
cmd_stream << " --exclude=" << exclude_file_pattern;
}
if (transport_speed_limit_kbps != 0) {
cmd_stream << " --bwlimit=" << transport_speed_limit_kbps;
}
if (timeout_second != 0) {
cmd_stream << " --timeout=" << timeout_second;
}
cmd_stream << " " << remote_host << ":" << remote_file_path << " " << local_file_path;
OLAP_LOG_INFO("rsync cmd: %s", cmd_stream.str().c_str());
FILE* fp = NULL;
fp = popen(cmd_stream.str().c_str(), "r");
if (fp == NULL) {
return PALO_ERROR;
}
ret_code = pclose(fp);
if (ret_code != 0) {
return PALO_ERROR;
}
return PALO_SUCCESS;
}
char* AgentUtils::get_local_ip() {
char hname[128];
gethostname(hname, sizeof(hname));
// Let's hope this is not broken in the glibc we're using
struct hostent hent;
struct hostent *he = 0;
char hbuf[2048];
int err = 0;
if (gethostbyname_r(hname, &hent, hbuf, sizeof(hbuf), &he, &err) != 0
|| he == 0) {
LOG(ERROR) << "gethostbyname : " << hname << ", "
<< "error: " << err;
return NULL;
}
return inet_ntoa(*(struct in_addr*)(he->h_addr_list[0]));
}
std::string AgentUtils::print_agent_status(AgentStatus status) {
switch (status) {
case PALO_SUCCESS:
return "PALO_SUCCESS";
case PALO_ERROR:
return "PALO_ERROR";
case PALO_TASK_REQUEST_ERROR:
return "PALO_TASK_REQUEST_ERROR";
case PALO_FILE_DOWNLOAD_INVALID_PARAM:
return "PALO_FILE_DOWNLOAD_INVALID_PARAM";
case PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED:
return "PALO_FILE_DOWNLOAD_INSTALL_OPT_FAILED";
case PALO_FILE_DOWNLOAD_CURL_INIT_FAILED:
return "PALO_FILE_DOWNLOAD_CURL_INIT_FAILED";
case PALO_FILE_DOWNLOAD_FAILED:
return "PALO_FILE_DOWNLOAD_FAILED";
case PALO_FILE_DOWNLOAD_GET_LENGTH_FAILED:
return "PALO_FILE_DOWNLOAD_GET_LENGTH_FAILED";
case PALO_FILE_DOWNLOAD_NOT_EXIST:
return "PALO_FILE_DOWNLOAD_NOT_EXIST";
case PALO_FILE_DOWNLOAD_LIST_DIR_FAIL:
return "PALO_FILE_DOWNLOAD_LIST_DIR_FAIL";
case PALO_CREATE_TABLE_EXIST:
return "PALO_CREATE_TABLE_EXIST";
case PALO_CREATE_TABLE_DIFF_SCHEMA_EXIST:
return "PALO_CREATE_TABLE_DIFF_SCHEMA_EXIST";
case PALO_CREATE_TABLE_NOT_EXIST:
return "PALO_CREATE_TABLE_NOT_EXIST";
case PALO_DROP_TABLE_NOT_EXIST:
return "PALO_DROP_TABLE_NOT_EXIST";
case PALO_PUSH_INVALID_TABLE:
return "PALO_PUSH_INVALID_TABLE";
case PALO_PUSH_INVALID_VERSION:
return "PALO_PUSH_INVALID_VERSION";
case PALO_PUSH_TIME_OUT:
return "PALO_PUSH_TIME_OUT";
case PALO_PUSH_HAD_LOADED:
return "PALO_PUSH_HAD_LOADED";
case PALO_TIMEOUT:
return "PALO_TIMEOUT";
case PALO_INTERNAL_ERROR:
return "PALO_INTERNAL_ERROR";
default:
return "UNKNOWM";
}
}
bool AgentUtils::exec_cmd(const string& command, string* errmsg) {
// The exit status of the command.
uint32_t rc = 0;
// Redirect stderr to stdout to get error message.
string cmd = command + " 2>&1";
// Execute command.
FILE *fp = popen(cmd.c_str(), "r");
if (fp == NULL) {
stringstream err_stream;
err_stream << "popen failed. " << strerror(errno) << ", with errno: " << errno << ".\n";
*errmsg = err_stream.str();
return false;
}
// Get command output.
char result[1024] = {'\0'};
while (fgets(result, sizeof(result), fp) != NULL) {
*errmsg += result;
}
// Waits for the associated process to terminate and returns.
rc = pclose(fp);
if (rc == -1) {
if (errno==ECHILD) {
*errmsg += "pclose cannot obtain the child status.\n";
} else {
stringstream err_stream;
err_stream << "Close popen failed. " << strerror(errno) << ", with errno: "
<< errno << "\n";
*errmsg += err_stream.str();
}
return false;
}
// Get return code of command.
int32_t status_child = WEXITSTATUS(rc);
if (status_child == 0) {
return true;
} else {
return false;
}
}
bool AgentUtils::write_json_to_file(const map<string, string>& info, const string& path) {
rapidjson::Document json_info(rapidjson::kObjectType);
for (auto &it : info) {
json_info.AddMember(
rapidjson::Value(it.first.c_str(), json_info.GetAllocator()).Move(),
rapidjson::Value(it.second.c_str(), json_info.GetAllocator()).Move(),
json_info.GetAllocator());
}
rapidjson::StringBuffer json_info_str;
rapidjson::Writer<rapidjson::StringBuffer> writer(json_info_str);
json_info.Accept(writer);
std::ofstream fp(path);
if (!fp) {
return false;
}
fp << json_info_str.GetString() << std::endl;
fp.close();
return true;
}
} // namespace palo

152
be/src/agent/utils.h Normal file
View File

@ -0,0 +1,152 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_AGENT_UTILS_H
#define BDG_PALO_BE_SRC_AGENT_UTILS_H
#include <pthread.h>
#include <memory>
#include "thrift/transport/TSocket.h"
#include "thrift/transport/TTransportUtils.h"
#include "agent/status.h"
#include "gen_cpp/BackendService.h"
#include "gen_cpp/FrontendService.h"
#include "gen_cpp/AgentService_types.h"
#include "gen_cpp/HeartbeatService_types.h"
#include "gen_cpp/Status_types.h"
#include "gen_cpp/Types_types.h"
#include "olap/olap_define.h"
#include "runtime/client_cache.h"
namespace palo {
const uint32_t MASTER_CLIENT_TIMEOUT = 500;
// client cache
// All service client should be defined in client_cache.h
//class MasterServiceClient;
//typedef ClientCache<MasterServiceClient> MasterServiceClientCache;
//typedef ClientConnection<MasterServiceClient> MasterServiceConnection;
class AgentServerClient {
public:
explicit AgentServerClient(const TBackend backend);
virtual ~AgentServerClient();
// Make a snapshot of tablet
//
// Input parameters:
// * tablet_id: The id of tablet to make snapshot
// * schema_hash: The schema hash of tablet to make snapshot
//
// Output parameters:
// * result: The result of make snapshot
virtual AgentStatus make_snapshot(
const TSnapshotRequest& snapshot_request,
TAgentResult* result);
// Release the snapshot
//
// Input parameters:
// * snapshot_path: The path of snapshot
//
// Output parameters:
// * result: The result of release snapshot
virtual AgentStatus release_snapshot(const std::string& snapshot_path, TAgentResult* result);
private:
boost::shared_ptr<apache::thrift::transport::TTransport> _socket;
boost::shared_ptr<apache::thrift::transport::TTransport> _transport;
boost::shared_ptr<apache::thrift::protocol::TProtocol> _protocol;
BackendServiceClient _agent_service_client;
DISALLOW_COPY_AND_ASSIGN(AgentServerClient);
}; // class AgentServerClient
class MasterServerClient {
public:
MasterServerClient(const TMasterInfo& master_info, FrontendServiceClientCache* client_cache);
virtual ~MasterServerClient() {};
// Reprot finished task to the master server
//
// Input parameters:
// * request: The infomation of finished task
//
// Output parameters:
// * result: The result of report task
virtual AgentStatus finish_task(const TFinishTaskRequest request, TMasterResult* result);
// Report tasks/olap tablet/disk state to the master server
//
// Input parameters:
// * request: The infomation to report
//
// Output parameters:
// * result: The result of report task
virtual AgentStatus report(const TReportRequest request, TMasterResult* result);
private:
const TMasterInfo& _master_info;
FrontendServiceClientCache* _client_cache;
DISALLOW_COPY_AND_ASSIGN(MasterServerClient);
}; // class MasterServerClient
class AgentUtils {
public:
AgentUtils() {};
virtual ~AgentUtils() {};
// Use rsync synchronize folder from remote agent to local folder
//
// Input parameters:
// * remote_host: the host of remote server
// * remote_file_path: remote file folder path
// * local_file_path: local file folder path
// * exclude_file_patterns: the patterns of the exclude file
// * transport_speed_limit_kbps: speed limit of transport(kb/s)
// * timeout_second: timeout of synchronize
virtual AgentStatus rsync_from_remote(
const std::string& remote_host,
const std::string& remote_file_path,
const std::string& local_file_path,
const std::vector<std::string>& exclude_file_patterns,
const uint32_t transport_speed_limit_kbps,
const uint32_t timeout_second);
// Get ip of local service
virtual char* get_local_ip();
// Print AgentStatus as string
virtual std::string print_agent_status(AgentStatus status);
// Execute shell cmd
virtual bool exec_cmd(const std::string& command, std::string* errmsg);
// Write a map to file by json format
virtual bool write_json_to_file(
const std::map<std::string, std::string>& info,
const std::string& path);
private:
DISALLOW_COPY_AND_ASSIGN(AgentUtils);
}; // class AgentUtils
} // namespace palo
#endif // BDG_PALO_BE_SRC_AGENT_UTILS_H

View File

@ -0,0 +1,105 @@
# Modifications copyright (C) 2017, Baidu.com, Inc.
# Copyright 2017 The Apache Software Foundation
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
# where to put generated libraries
set(LIBRARY_OUTPUT_PATH "${BUILD_DIR}/src/codegen")
# where to put generated binaries
set(EXECUTABLE_OUTPUT_PATH "${BUILD_DIR}/src/codegen")
# Generated C files for IR
set(IR_SSE_C_FILE ${GENSRC_DIR}/palo_ir/palo_sse_ir.cpp)
set(IR_NO_SSE_C_FILE ${GENSRC_DIR}/palo_ir/palo_no_sse_ir.cpp)
add_library(CodeGen STATIC
codegen_anyval.cpp
llvm_codegen.cpp
subexpr_elimination.cpp
${IR_SSE_C_FILE}
${IR_NO_SSE_C_FILE}
)
add_dependencies(CodeGen gen_ir_descriptions compile_to_ir_sse compile_to_ir_no_sse)
# output cross compile to ir metadata
set(IR_DESC_GEN_OUTPUT
${GENSRC_DIR}/palo_ir/palo_ir_names.h
${GENSRC_DIR}/palo_ir/palo_ir_functions.h
)
add_custom_command(
OUTPUT ${IR_DESC_GEN_OUTPUT}
COMMAND python ${CMAKE_CURRENT_SOURCE_DIR}/gen_ir_descriptions.py
DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/gen_ir_descriptions.py
COMMENT "Generating ir cross compile metadata."
VERBATIM
)
add_custom_target(gen_ir_descriptions ALL DEPENDS ${IR_DESC_GEN_OUTPUT})
set(IR_INPUT_FILES ${CMAKE_CURRENT_SOURCE_DIR}/palo_ir.cpp)
set(IR_SSE_TMP_OUTPUT_FILE "${GENSRC_DIR}/palo_ir/palo_sse_tmp.bc")
set(IR_NO_SSE_TMP_OUTPUT_FILE "${GENSRC_DIR}/palo_ir/palo_no_sse_tmp.bc")
set(IR_SSE_OUTPUT_FILE "${GENSRC_DIR}/palo_ir/palo_sse.bc")
set(IR_NO_SSE_OUTPUT_FILE "${GENSRC_DIR}/palo_ir/palo_no_sse.bc")
set(IR_SSE_TMP_C_FILE ${IR_SSE_C_FILE}.tmp)
set(IR_NO_SSE_TMP_C_FILE ${IR_NO_SSE_C_FILE}.tmp)
# Run the clang compiler to generate IR. Then run their opt tool to remove
# unnamed instr. This makes the IR verifiable and more readable.
# We need to compile to IR twice, once with sse enabled and one without. At runtime
# impala will pick the correct file to load.
add_custom_command(
OUTPUT ${IR_SSE_OUTPUT_FILE}
COMMAND ${LLVM_CLANG_EXECUTABLE} ${CLANG_IR_CXX_FLAGS} "-msse4.2" ${CLANG_INCLUDE_FLAGS} ${IR_INPUT_FILES} -o ${IR_SSE_TMP_OUTPUT_FILE}
COMMAND ${LLVM_OPT_EXECUTABLE} --instnamer < ${IR_SSE_TMP_OUTPUT_FILE} > ${IR_SSE_OUTPUT_FILE}
COMMAND rm ${IR_SSE_TMP_OUTPUT_FILE}
DEPENDS Util Exec Exprs Udf ${IR_INPUT_FILES}
)
# Compile without sse enabled.
add_custom_command(
OUTPUT ${IR_NO_SSE_OUTPUT_FILE}
COMMAND ${LLVM_CLANG_EXECUTABLE} ${CLANG_IR_CXX_FLAGS} ${CLANG_INCLUDE_FLAGS} ${IR_INPUT_FILES} -o ${IR_NO_SSE_TMP_OUTPUT_FILE}
COMMAND ${LLVM_OPT_EXECUTABLE} --instnamer < ${IR_NO_SSE_TMP_OUTPUT_FILE} > ${IR_NO_SSE_OUTPUT_FILE}
COMMAND rm ${IR_NO_SSE_TMP_OUTPUT_FILE}
DEPENDS Util Exec Exprs Udf ${IR_INPUT_FILES}
)
add_custom_target(compile_to_ir_sse DEPENDS ${IR_SSE_OUTPUT_FILE})
add_custom_target(compile_to_ir_no_sse DEPENDS ${IR_NO_SSE_OUTPUT_FILE})
# Convert LLVM bytecode to C array.
add_custom_command(
OUTPUT ${IR_SSE_C_FILE}
COMMAND $ENV{PALO_HOME}/gensrc/script/file2array.sh -n -v palo_sse_llvm_ir ${IR_SSE_OUTPUT_FILE} > ${IR_SSE_TMP_C_FILE}
COMMAND mv ${IR_SSE_TMP_C_FILE} ${IR_SSE_C_FILE}
DEPENDS $ENV{PALO_HOME}/gensrc/script/file2array.sh
DEPENDS ${IR_SSE_OUTPUT_FILE}
)
# Convert LLVM bytecode to C array.
add_custom_command(
OUTPUT ${IR_NO_SSE_C_FILE}
COMMAND $ENV{PALO_HOME}/gensrc/script/file2array.sh -n -v palo_no_sse_llvm_ir ${IR_NO_SSE_OUTPUT_FILE} > ${IR_NO_SSE_TMP_C_FILE}
COMMAND mv ${IR_NO_SSE_TMP_C_FILE} ${IR_NO_SSE_C_FILE}
DEPENDS $ENV{PALO_HOME}/gensrc/script/file2array.sh
DEPENDS ${IR_NO_SSE_OUTPUT_FILE}
)

View File

@ -0,0 +1,711 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "codegen/codegen_anyval.h"
#include "runtime/multi_precision.h"
using llvm::Function;
using llvm::Type;
using llvm::Value;
using llvm::ConstantInt;
using llvm::Constant;
namespace palo {
const char* CodegenAnyVal::_s_llvm_booleanval_name = "struct.palo_udf::BooleanVal";
const char* CodegenAnyVal::_s_llvm_tinyintval_name = "struct.palo_udf::TinyIntVal";
const char* CodegenAnyVal::_s_llvm_smallintval_name = "struct.palo_udf::SmallIntVal";
const char* CodegenAnyVal::_s_llvm_intval_name = "struct.palo_udf::IntVal";
const char* CodegenAnyVal::_s_llvm_bigintval_name = "struct.palo_udf::BigIntVal";
const char* CodegenAnyVal::_s_llvm_largeintval_name = "struct.palo_udf::LargeIntVal";
const char* CodegenAnyVal::_s_llvm_floatval_name = "struct.palo_udf::FloatVal";
const char* CodegenAnyVal::_s_llvm_doubleval_name = "struct.palo_udf::DoubleVal";
const char* CodegenAnyVal::_s_llvm_stringval_name = "struct.palo_udf::StringVal";
const char* CodegenAnyVal::_s_llvm_datetimeval_name = "struct.palo_udf::DateTimeVal";
const char* CodegenAnyVal::_s_llvm_decimalval_name = "struct.palo_udf::DecimalVal";
Type* CodegenAnyVal::get_lowered_type(LlvmCodeGen* cg, const TypeDescriptor& type) {
switch (type.type) {
case TYPE_BOOLEAN: // i16
return cg->smallint_type();
case TYPE_TINYINT: // i16
return cg->smallint_type();
case TYPE_SMALLINT: // i32
return cg->int_type();
case TYPE_INT: // i64
return cg->bigint_type();
case TYPE_BIGINT: // { i8, i64 }
return llvm::StructType::get(cg->tinyint_type(), cg->bigint_type(), NULL);
case TYPE_LARGEINT: // %"struct.palo_udf::LargeIntVal" (isn't lowered)
// = { {i8}, [15 x i8], i128 }
return cg->get_type(_s_llvm_largeintval_name);
case TYPE_FLOAT: // i64
return cg->bigint_type();
case TYPE_DOUBLE: // { i8, double }
return llvm::StructType::get(cg->tinyint_type(), cg->double_type(), NULL);
case TYPE_VARCHAR: // { i64, i8* }
case TYPE_CHAR:
case TYPE_HLL:
return llvm::StructType::get(cg->bigint_type(), cg->ptr_type(), NULL);
case TYPE_DATE:
case TYPE_DATETIME: // %"struct.palo_udf::DateTimeVal" (isn't lowered)
// = { {i8}, i64, i32 }
return cg->get_type(_s_llvm_datetimeval_name);
case TYPE_DECIMAL: // %"struct.palo_udf::DecimalVal" (isn't lowered)
// = { {i8}, i8, i8, i8, [9 x i32] }
return cg->get_type(_s_llvm_decimalval_name);
default:
DCHECK(false) << "Unsupported type: " << type;
return NULL;
}
}
Type* CodegenAnyVal::get_lowered_ptr_type(LlvmCodeGen* cg, const TypeDescriptor& type) {
return get_lowered_type(cg, type)->getPointerTo();
}
Type* CodegenAnyVal::get_unlowered_type(LlvmCodeGen* cg, const TypeDescriptor& type) {
Type* result = NULL;
switch (type.type) {
case TYPE_BOOLEAN:
result = cg->get_type(_s_llvm_booleanval_name);
break;
case TYPE_TINYINT:
result = cg->get_type(_s_llvm_tinyintval_name);
break;
case TYPE_SMALLINT:
result = cg->get_type(_s_llvm_smallintval_name);
break;
case TYPE_INT:
result = cg->get_type(_s_llvm_intval_name);
break;
case TYPE_BIGINT:
result = cg->get_type(_s_llvm_bigintval_name);
break;
case TYPE_LARGEINT:
result = cg->get_type(_s_llvm_largeintval_name);
break;
case TYPE_FLOAT:
result = cg->get_type(_s_llvm_floatval_name);
break;
case TYPE_DOUBLE:
result = cg->get_type(_s_llvm_doubleval_name);
break;
case TYPE_CHAR:
case TYPE_VARCHAR:
case TYPE_HLL:
result = cg->get_type(_s_llvm_stringval_name);
break;
case TYPE_DATE:
case TYPE_DATETIME:
result = cg->get_type(_s_llvm_datetimeval_name);
break;
case TYPE_DECIMAL:
result = cg->get_type(_s_llvm_decimalval_name);
break;
default:
DCHECK(false) << "Unsupported type: " << type;
return NULL;
}
DCHECK(result != NULL) << type;
return result;
}
Type* CodegenAnyVal::get_unlowered_ptr_type(LlvmCodeGen* cg, const TypeDescriptor& type) {
return get_unlowered_type(cg, type)->getPointerTo();
}
Value* CodegenAnyVal::create_call(
LlvmCodeGen* cg, LlvmCodeGen::LlvmBuilder* builder, llvm::Function* fn,
llvm::ArrayRef<Value*> args, const char* name, Value* result_ptr) {
if (fn->getReturnType()->isVoidTy()) {
// Void return type indicates that this function returns a DecimalVal via the first
// argument (which should be a DecimalVal*).
llvm::Function::arg_iterator ret_arg = fn->arg_begin();
DCHECK(ret_arg->getType()->isPointerTy());
Type* ret_type = ret_arg->getType()->getPointerElementType();
// We need to pass a DecimalVal pointer to 'fn' that will be populated with the result
// value. Use 'result_ptr' if specified, otherwise alloca one.
Value* ret_ptr = (result_ptr == NULL) ?
cg->create_entry_block_alloca(*builder, ret_type, name) : result_ptr;
std::vector<Value*> new_args = args.vec();
new_args.insert(new_args.begin(), ret_ptr);
builder->CreateCall(fn, new_args);
// If 'result_ptr' was specified, we're done. Otherwise load and return the result.
if (result_ptr != NULL) {
return NULL;
}
return builder->CreateLoad(ret_ptr, name);
} else {
// Function returns *Val normally (note that it could still be returning a DecimalVal,
// since we generate non-complaint functions)
Value* ret = builder->CreateCall(fn, args, name);
if (result_ptr == NULL) {
return ret;
}
builder->CreateStore(ret, result_ptr);
return NULL;
}
}
CodegenAnyVal CodegenAnyVal::create_call_wrapped(
LlvmCodeGen* cg, LlvmCodeGen::LlvmBuilder* builder, const TypeDescriptor& type,
llvm::Function* fn, llvm::ArrayRef<Value*> args,
const char* name, Value* result_ptr) {
Value* v = create_call(cg, builder, fn, args, name, result_ptr);
return CodegenAnyVal(cg, builder, type, v, name);
}
CodegenAnyVal::CodegenAnyVal(
LlvmCodeGen* codegen, LlvmCodeGen::LlvmBuilder* builder,
const TypeDescriptor& type, Value* value, const char* name) :
_type(type),
_value(value),
_name(name),
_codegen(codegen),
_builder(builder) {
Type* value_type = get_lowered_type(codegen, type);
if (_value == NULL) {
// No Value* was specified, so allocate one on the stack and load it.
Value* ptr = _codegen->create_entry_block_alloca(*builder, value_type, "");
_value = _builder->CreateLoad(ptr, _name);
}
DCHECK_EQ(_value->getType(), value_type);
}
Value* CodegenAnyVal::get_is_null(const char* name) {
switch (_type.type) {
case TYPE_BIGINT:
case TYPE_DOUBLE: {
// Lowered type is of form { i8, * }. Get the i8 value.
Value* is_null_i8 = _builder->CreateExtractValue(_value, 0);
DCHECK(is_null_i8->getType() == _codegen->tinyint_type());
return _builder->CreateTrunc(is_null_i8, _codegen->boolean_type(), name);
}
case TYPE_DATE:
case TYPE_DATETIME:
case TYPE_LARGEINT:
case TYPE_DECIMAL: {
// Lowered type is of the form { {i8}, ... }
uint32_t idxs[] = {0, 0};
Value* is_null_i8 = _builder->CreateExtractValue(_value, idxs);
DCHECK(is_null_i8->getType() == _codegen->tinyint_type());
return _builder->CreateTrunc(is_null_i8, _codegen->boolean_type(), name);
}
case TYPE_VARCHAR:
case TYPE_HLL:
case TYPE_CHAR: {
// Lowered type is of form { i64, *}. Get the first byte of the i64 value.
Value* v = _builder->CreateExtractValue(_value, 0);
DCHECK(v->getType() == _codegen->bigint_type());
return _builder->CreateTrunc(v, _codegen->boolean_type(), name);
}
case TYPE_BOOLEAN:
case TYPE_TINYINT:
case TYPE_SMALLINT:
case TYPE_INT:
case TYPE_FLOAT:
// Lowered type is an integer. Get the first byte.
return _builder->CreateTrunc(_value, _codegen->boolean_type(), name);
default:
DCHECK(false);
return NULL;
}
}
void CodegenAnyVal::set_is_null(Value* is_null) {
switch (_type.type) {
case TYPE_BIGINT:
case TYPE_DOUBLE: {
// Lowered type is of form { i8, * }. Set the i8 value to 'is_null'.
Value* is_null_ext =
_builder->CreateZExt(is_null, _codegen->tinyint_type(), "is_null_ext");
_value = _builder->CreateInsertValue(_value, is_null_ext, 0, _name);
break;
}
case TYPE_DATE:
case TYPE_DATETIME:
case TYPE_LARGEINT:
case TYPE_DECIMAL: {
// Lowered type is of form { {i8}, [15 x i8], {i128} }. Set the i8 value to
// 'is_null'.
Value* is_null_ext =
_builder->CreateZExt(is_null, _codegen->tinyint_type(), "is_null_ext");
// Index into the {i8} struct as well as the outer struct.
uint32_t idxs[] = {0, 0};
_value = _builder->CreateInsertValue(_value, is_null_ext, idxs, _name);
break;
}
case TYPE_VARCHAR:
case TYPE_HLL:
case TYPE_CHAR: {
// Lowered type is of the form { i64, * }. Set the first byte of the i64 value to
// 'is_null'
Value* v = _builder->CreateExtractValue(_value, 0);
v = _builder->CreateAnd(v, -0x100LL, "masked");
Value* is_null_ext = _builder->CreateZExt(is_null, v->getType(), "is_null_ext");
v = _builder->CreateOr(v, is_null_ext);
_value = _builder->CreateInsertValue(_value, v, 0, _name);
break;
}
case TYPE_BOOLEAN:
case TYPE_TINYINT:
case TYPE_SMALLINT:
case TYPE_INT:
case TYPE_FLOAT: {
// Lowered type is an integer. Set the first byte to 'is_null'.
_value = _builder->CreateAnd(_value, -0x100LL, "masked");
Value* is_null_ext = _builder->CreateZExt(is_null, _value->getType(), "is_null_ext");
_value = _builder->CreateOr(_value, is_null_ext, _name);
break;
}
default:
DCHECK(false) << "NYI: " << _type.debug_string();
}
}
Value* CodegenAnyVal::get_val(const char* name) {
DCHECK(_type.type != TYPE_VARCHAR) << "Use get_ptr and get_len for Varchar";
DCHECK(_type.type != TYPE_HLL) << "Use get_ptr and get_len for Hll";
DCHECK(_type.type != TYPE_CHAR) << "Use get_ptr and get_len for Char";
switch (_type.type) {
case TYPE_BOOLEAN:
case TYPE_TINYINT:
case TYPE_SMALLINT:
case TYPE_INT: {
// Lowered type is an integer. Get the high bytes.
int num_bits = _type.get_byte_size() * 8;
Value* val = get_high_bits(num_bits, _value, name);
if (_type.type == TYPE_BOOLEAN) {
// Return booleans as i1 (vs. i8)
val = _builder->CreateTrunc(val, _builder->getInt1Ty(), name);
}
return val;
}
case TYPE_FLOAT: {
// Same as above, but we must cast the value to a float.
Value* val = get_high_bits(32, _value);
return _builder->CreateBitCast(val, _codegen->float_type());
}
case TYPE_BIGINT:
case TYPE_DOUBLE:
// Lowered type is of form { i8, * }. Get the second value.
return _builder->CreateExtractValue(_value, 1, name);
case TYPE_LARGEINT:
// Lowered type is of form { i8, [], * }. Get the third value.
return _builder->CreateExtractValue(_value, 2, name);
case TYPE_DATE:
case TYPE_DATETIME:
/// TYPE_DATETIME/DateTimeVal: { {i8}, i64, i32 } Not Lowered
return _builder->CreateExtractValue(_value, 1, name);
default:
DCHECK(false) << "Unsupported type: " << _type;
return NULL;
}
}
void CodegenAnyVal::set_val(Value* val) {
DCHECK(_type.type != TYPE_VARCHAR) << "Use set_ptr and set_len for StringVals";
DCHECK(_type.type != TYPE_HLL) << "Use set_ptr and set_len for StringVals";
DCHECK(_type.type != TYPE_CHAR) << "Use set_ptr and set_len for StringVals";
switch (_type.type) {
case TYPE_BOOLEAN:
case TYPE_TINYINT:
case TYPE_SMALLINT:
case TYPE_INT: {
// Lowered type is an integer. Set the high bytes to 'val'.
int num_bits = _type.get_byte_size() * 8;
_value = set_high_bits(num_bits, val, _value, _name);
break;
}
case TYPE_FLOAT:
// Same as above, but we must cast 'val' to an integer type.
val = _builder->CreateBitCast(val, _codegen->int_type());
_value = set_high_bits(32, val, _value, _name);
break;
case TYPE_BIGINT:
case TYPE_DOUBLE:
// Lowered type is of form { i8, * }. Set the second value to 'val'.
_value = _builder->CreateInsertValue(_value, val, 1, _name);
break;
case TYPE_LARGEINT:
// Lowered type is of form { i8, [], * }. Set the third value to 'val'.
_value = _builder->CreateInsertValue(_value, val, 2, _name);
break;
case TYPE_DATE:
case TYPE_DATETIME:
/// TYPE_DATETIME/DateTimeVal: { {i8}, i64, i32 } Not Lowered
_value = _builder->CreateInsertValue(_value, val, 1, _name);
break;
default:
DCHECK(false) << "Unsupported type: " << _type;
}
}
void CodegenAnyVal::set_val(bool val) {
DCHECK_EQ(_type.type, TYPE_BOOLEAN);
set_val(_builder->getInt1(val));
}
void CodegenAnyVal::set_val(int8_t val) {
DCHECK_EQ(_type.type, TYPE_TINYINT);
set_val(_builder->getInt8(val));
}
void CodegenAnyVal::set_val(int16_t val) {
DCHECK_EQ(_type.type, TYPE_SMALLINT);
set_val(_builder->getInt16(val));
}
void CodegenAnyVal::set_val(int32_t val) {
DCHECK(_type.type == TYPE_INT);
set_val(_builder->getInt32(val));
}
void CodegenAnyVal::set_val(int64_t val) {
DCHECK(_type.type == TYPE_BIGINT);
set_val(_builder->getInt64(val));
}
void CodegenAnyVal::set_val(__int128 val) {
DCHECK_EQ(_type.type, TYPE_LARGEINT);
// TODO: is there a better way to do this?
// Set high bits
Value* ir_val = llvm::ConstantInt::get(_codegen->i128_type(), high_bits(val));
ir_val = _builder->CreateShl(ir_val, 64, "tmp");
// Set low bits
ir_val = _builder->CreateOr(ir_val, low_bits(val), "tmp");
set_val(ir_val);
}
void CodegenAnyVal::set_val(float val) {
DCHECK_EQ(_type.type, TYPE_FLOAT);
set_val(llvm::ConstantFP::get(_builder->getFloatTy(), val));
}
void CodegenAnyVal::set_val(double val) {
DCHECK_EQ(_type.type, TYPE_DOUBLE);
set_val(llvm::ConstantFP::get(_builder->getDoubleTy(), val));
}
Value* CodegenAnyVal::get_ptr() {
// Set the second pointer value to 'ptr'.
DCHECK(_type.is_string_type());
return _builder->CreateExtractValue(_value, 1, _name);
}
Value* CodegenAnyVal::get_len() {
// Get the high bytes of the first value.
DCHECK(_type.is_string_type());
Value* v = _builder->CreateExtractValue(_value, 0);
return get_high_bits(32, v);
}
void CodegenAnyVal::set_ptr(Value* ptr) {
// Set the second pointer value to 'ptr'.
DCHECK(_type.is_string_type());
_value = _builder->CreateInsertValue(_value, ptr, 1, _name);
}
void CodegenAnyVal::set_len(Value* len) {
// Set the high bytes of the first value to 'len'.
DCHECK(_type.is_string_type());
Value* v = _builder->CreateExtractValue(_value, 0);
v = set_high_bits(32, len, v);
_value = _builder->CreateInsertValue(_value, v, 0, _name);
}
Value* CodegenAnyVal::get_unlowered_ptr() {
Value* value_ptr = _codegen->create_entry_block_alloca(*_builder, _value->getType(), "");
_builder->CreateStore(_value, value_ptr);
return _builder->CreateBitCast(value_ptr, get_unlowered_ptr_type(_codegen, _type));
}
void CodegenAnyVal::set_from_raw_ptr(Value* raw_ptr) {
Value* val_ptr =
_builder->CreateBitCast(raw_ptr, _codegen->get_ptr_type(_type), "val_ptr");
Value* val = _builder->CreateLoad(val_ptr);
set_from_raw_value(val);
}
void CodegenAnyVal::set_from_raw_value(Value* raw_val) {
DCHECK_EQ(raw_val->getType(), _codegen->get_type(_type))
<< std::endl << LlvmCodeGen::print(raw_val)
<< std::endl << _type << " => " << LlvmCodeGen::print(_codegen->get_type(_type));
switch (_type.type) {
case TYPE_VARCHAR:
case TYPE_CHAR:
case TYPE_HLL: {
// Convert StringValue to StringVal
set_ptr(_builder->CreateExtractValue(raw_val, 0, "ptr"));
set_len(_builder->CreateExtractValue(raw_val, 1, "len"));
break;
}
case TYPE_DATE:
case TYPE_DATETIME: {
Function* fn = _codegen->get_function(IRFunction::TO_DATETIME_VAL);
Value* val_ptr = _builder->CreateAlloca(get_lowered_type(_codegen, _type), 0, "val_ptr");
_builder->CreateCall2(fn, _codegen->get_ptr_to(_builder, raw_val), val_ptr);
_value = _builder->CreateLoad(val_ptr);
break;
}
case TYPE_DECIMAL: {
Function* fn = _codegen->get_function(IRFunction::TO_DECIMAL_VAL);
Value* val_ptr = _builder->CreateAlloca(get_lowered_type(_codegen, _type), 0, "val_ptr");
_builder->CreateCall2(fn, _codegen->get_ptr_to(_builder, raw_val), val_ptr);
_value = _builder->CreateLoad(val_ptr);
break;
}
case TYPE_BOOLEAN:
case TYPE_TINYINT:
case TYPE_SMALLINT:
case TYPE_INT:
case TYPE_BIGINT:
case TYPE_LARGEINT:
case TYPE_FLOAT:
case TYPE_DOUBLE:
// raw_val is a native type
set_val(raw_val);
break;
default:
DCHECK(false) << "NYI: " << _type;
break;
}
}
Value* CodegenAnyVal::to_native_value() {
Type* raw_type = _codegen->get_type(_type);
Value* raw_val = llvm::Constant::getNullValue(raw_type);
switch (_type.type) {
case TYPE_CHAR:
case TYPE_VARCHAR:
case TYPE_HLL: {
// Convert StringVal to StringValue
raw_val = _builder->CreateInsertValue(raw_val, get_ptr(), 0);
raw_val = _builder->CreateInsertValue(raw_val, get_len(), 1);
break;
}
case TYPE_BOOLEAN:
case TYPE_TINYINT:
case TYPE_SMALLINT:
case TYPE_INT:
case TYPE_BIGINT:
case TYPE_LARGEINT:
case TYPE_FLOAT:
case TYPE_DOUBLE:
// raw_val is a native type
raw_val = get_val();
break;
case TYPE_DATE:
case TYPE_DATETIME: {
Function* func = _codegen->get_function(IRFunction::FROM_DATETIME_VAL);
Value* raw_val_ptr = _codegen->create_entry_block_alloca(
*_builder, _codegen->get_type(TYPE_DECIMAL), "raw_val_ptr");
_builder->CreateCall2(func, raw_val_ptr, _value);
raw_val = _builder->CreateLoad(raw_val_ptr, "result");
break;
}
case TYPE_DECIMAL: {
Function* func = _codegen->get_function(IRFunction::FROM_DECIMAL_VAL);
Value* raw_val_ptr = _codegen->create_entry_block_alloca(
*_builder, _codegen->get_type(TYPE_DECIMAL), "raw_val_ptr");
_builder->CreateCall2(func, raw_val_ptr, _value);
raw_val = _builder->CreateLoad(raw_val_ptr, "result");
break;
}
default:
DCHECK(false) << "NYI: " << _type;
break;
}
return raw_val;
}
Value* CodegenAnyVal::to_native_ptr(Value* native_ptr) {
Value* v = to_native_value();
if (native_ptr == NULL) {
native_ptr = _codegen->create_entry_block_alloca(*_builder, v->getType(), "");
}
_builder->CreateStore(v, native_ptr);
return native_ptr;
}
Value* CodegenAnyVal::eq(CodegenAnyVal* other) {
DCHECK_EQ(_type, other->_type);
switch (_type.type) {
case TYPE_BOOLEAN:
case TYPE_TINYINT:
case TYPE_SMALLINT:
case TYPE_INT:
case TYPE_BIGINT:
case TYPE_LARGEINT:
return _builder->CreateICmpEQ(get_val(), other->get_val(), "eq");
case TYPE_FLOAT:
case TYPE_DOUBLE:
return _builder->CreateFCmpUEQ(get_val(), other->get_val(), "eq");
case TYPE_CHAR:
case TYPE_VARCHAR:
case TYPE_HLL: {
llvm::Function* eq_fn = _codegen->get_function(IRFunction::CODEGEN_ANYVAL_STRING_VAL_EQ);
return _builder->CreateCall2(
eq_fn, get_unlowered_ptr(), other->get_unlowered_ptr(), "eq");
}
case TYPE_DATE:
case TYPE_DATETIME: {
llvm::Function* eq_fn = _codegen->get_function(
IRFunction::CODEGEN_ANYVAL_DATETIME_VAL_EQ);
return _builder->CreateCall2(
eq_fn, get_unlowered_ptr(), other->get_unlowered_ptr(), "eq");
}
case TYPE_DECIMAL: {
llvm::Function* eq_fn = _codegen->get_function(IRFunction::CODEGEN_ANYVAL_DECIMAL_VAL_EQ);
return _builder->CreateCall2(
eq_fn, get_unlowered_ptr(), other->get_unlowered_ptr(), "eq");
}
default:
DCHECK(false) << "NYI: " << _type;
return NULL;
}
}
Value* CodegenAnyVal::eq_to_native_ptr(Value* native_ptr) {
Value* val = NULL;
if (!_type.is_string_type()) {
val = _builder->CreateLoad(native_ptr);
}
switch (_type.type) {
case TYPE_NULL:
return _codegen->false_value();
case TYPE_BOOLEAN:
case TYPE_TINYINT:
case TYPE_SMALLINT:
case TYPE_INT:
case TYPE_BIGINT:
case TYPE_LARGEINT:
return _builder->CreateICmpEQ(get_val(), val, "cmp_raw");
case TYPE_FLOAT:
case TYPE_DOUBLE:
return _builder->CreateFCmpUEQ(get_val(), val, "cmp_raw");
case TYPE_CHAR:
case TYPE_VARCHAR:
case TYPE_HLL: {
llvm::Function* eq_fn = _codegen->get_function(
IRFunction::CODEGEN_ANYVAL_STRING_VALUE_EQ);
return _builder->CreateCall2(eq_fn, get_unlowered_ptr(), native_ptr, "cmp_raw");
}
case TYPE_DATE:
case TYPE_DATETIME: {
llvm::Function* eq_fn = _codegen->get_function(
IRFunction::CODEGEN_ANYVAL_DATETIME_VALUE_EQ);
return _builder->CreateCall2(eq_fn, get_unlowered_ptr(), native_ptr, "cmp_raw");
}
case TYPE_DECIMAL: {
llvm::Function* eq_fn = _codegen->get_function(
IRFunction::CODEGEN_ANYVAL_DECIMAL_VALUE_EQ);
return _builder->CreateCall2(eq_fn, get_unlowered_ptr(), native_ptr, "cmp_raw");
}
default:
DCHECK(false) << "NYI: " << _type;
return NULL;
}
}
Value* CodegenAnyVal::compare(CodegenAnyVal* other, const char* name) {
DCHECK_EQ(_type, other->_type);
Value* v1 = to_native_ptr();
Value* void_v1 = _builder->CreateBitCast(v1, _codegen->ptr_type());
Value* v2 = other->to_native_ptr();
Value* void_v2 = _builder->CreateBitCast(v2, _codegen->ptr_type());
Value* type_ptr = _codegen->get_ptr_to(_builder, _type.to_ir(_codegen), "type");
llvm::Function* compare_fn = _codegen->get_function(IRFunction::RAW_VALUE_COMPARE);
Value* args[] = { void_v1, void_v2, type_ptr };
return _builder->CreateCall(compare_fn, args, name);
}
Value* CodegenAnyVal::get_high_bits(int num_bits, Value* v, const char* name) {
DCHECK_EQ(v->getType()->getIntegerBitWidth(), num_bits * 2);
Value* shifted = _builder->CreateAShr(v, num_bits);
return _builder->CreateTrunc(
shifted, llvm::IntegerType::get(_codegen->context(), num_bits));
}
// Example output: (num_bits = 8)
// %1 = zext i1 %src to i16
// %2 = shl i16 %1, 8
// %3 = and i16 %dst1 255 ; clear the top half of dst
// %dst2 = or i16 %3, %2 ; set the top of half of dst to src
Value* CodegenAnyVal::set_high_bits(
int num_bits, Value* src, Value* dst, const char* name) {
DCHECK_LE(src->getType()->getIntegerBitWidth(), num_bits);
DCHECK_EQ(dst->getType()->getIntegerBitWidth(), num_bits * 2);
Value* extended_src = _builder->CreateZExt(
src, llvm::IntegerType::get(_codegen->context(), num_bits * 2));
Value* shifted_src = _builder->CreateShl(extended_src, num_bits);
Value* masked_dst = _builder->CreateAnd(dst, (1LL << num_bits) - 1);
return _builder->CreateOr(masked_dst, shifted_src, name);
}
Value* CodegenAnyVal::get_null_val(LlvmCodeGen* codegen, const TypeDescriptor& type) {
Type* val_type = get_lowered_type(codegen, type);
return get_null_val(codegen, val_type);
}
Value* CodegenAnyVal::get_null_val(LlvmCodeGen* codegen, Type* val_type) {
if (val_type->isStructTy()) {
llvm::StructType* struct_type = llvm::cast<llvm::StructType>(val_type);
std::vector<Constant*> elements;
if (struct_type->getElementType(0)->isStructTy()) {
// Return the struct { {1}, 0, 0 } (the 'is_null' byte, i.e. the first value's first
// byte, is set to 1, the other bytes don't matter)
llvm::StructType* anyval_struct_type = llvm::cast<llvm::StructType>(
struct_type->getElementType(0));
Type* is_null_type = anyval_struct_type->getElementType(0);
llvm::Constant* null_anyval = llvm::ConstantStruct::get(
anyval_struct_type, llvm::ConstantInt::get(is_null_type, 1));
elements.push_back(null_anyval);
} else {
Type* type1 = struct_type->getElementType(0);
elements.push_back(llvm::ConstantInt::get(type1, 1));
}
for (int i = 1; i < struct_type->getNumElements(); ++i) {
Type* ele_type = struct_type->getElementType(i);
elements.push_back(llvm::Constant::getNullValue(ele_type));
}
return llvm::ConstantStruct::get(struct_type, elements);
}
// Return the int 1 ('is_null' byte is 1, other bytes don't matter)
DCHECK(val_type->isIntegerTy());
return llvm::ConstantInt::get(val_type, 1);
}
CodegenAnyVal CodegenAnyVal::get_non_null_val(
LlvmCodeGen* codegen, LlvmCodeGen::LlvmBuilder* builder,
const TypeDescriptor& type, const char* name) {
Type* val_type = get_lowered_type(codegen, type);
// All zeros => 'is_null' = false
Value* value = llvm::Constant::getNullValue(val_type);
return CodegenAnyVal(codegen, builder, type, value, name);
}
}

View File

@ -0,0 +1,282 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef IMPALA_CODEGEN_CODEGEN_ANYVAL_H
#define IMPALA_CODEGEN_CODEGEN_ANYVAL_H
#include "codegen/llvm_codegen.h"
namespace llvm {
class Type;
class Value;
}
namespace palo {
/// Class for handling AnyVal subclasses during codegen. Codegen functions should use this
/// wrapper instead of creating or manipulating *Val values directly in most cases. This is
/// because the struct types must be lowered to integer types in many cases in order to
/// conform to the standard calling convention (e.g., { i8, i32 } => i64). This class wraps
/// the lowered types for each *Val struct.
//
/// This class conceptually represents a single *Val that is mutated, but operates by
/// generating IR instructions involving _value (each of which generates a new Value*,
/// since IR uses SSA), and then setting _value to the most recent Value* generated. The
/// generated instructions perform the integer manipulation equivalent to setting the
/// fields of the original struct type.
//
/// Lowered types:
/// TYPE_BOOLEAN/BooleanVal: i16
/// TYPE_TINYINT/TinyIntVal: i16
/// TYPE_SMALLINT/SmallIntVal: i32
/// TYPE_INT/INTVal: i64
/// TYPE_BIGINT/BigIntVal: { i8, i64 }
/// TYPE_LARGEINT/LargeIntVal: { {i8}, [15 x i8], i128 } Not Lowered
/// TYPE_FLOAT/FloatVal: i64
/// TYPE_DOUBLE/DoubleVal: { i8, double }
/// TYPE_STRING/StringVal: { i64, i8* }
/// TYPE_DATETIME/DateTimeVal: { {i8}, i64, i32 } Not Lowered
/// TYPE_DECIMAL/DecimalVal: { {i8}, i8, i8, i8, [9 x i32] } Not Lowered
//
/// TODO:
/// - unit tests
class CodegenAnyVal {
public:
static const char* _s_llvm_booleanval_name;
static const char* _s_llvm_tinyintval_name;
static const char* _s_llvm_smallintval_name;
static const char* _s_llvm_intval_name;
static const char* _s_llvm_bigintval_name;
static const char* _s_llvm_largeintval_name;
static const char* _s_llvm_floatval_name;
static const char* _s_llvm_doubleval_name;
static const char* _s_llvm_stringval_name;
static const char* _s_llvm_datetimeval_name;
static const char* _s_llvm_decimalval_name;
/// Creates a call to 'fn', which should return a (lowered) *Val, and returns the result.
/// This abstracts over the x64 calling convention, in particular for functions returning
/// a DecimalVal, which pass the return value as an output argument.
//
/// If 'result_ptr' is non-NULL, it should be a pointer to the lowered return type of
/// 'fn' (e.g. if 'fn' returns a BooleanVal, 'result_ptr' should have type i16*). The
/// result of calling 'fn' will be stored in 'result_ptr' and this function will return
/// NULL. If 'result_ptr' is NULL, this function will return the result (note that the
/// result will not be a pointer in this case).
//
/// 'name' optionally specifies the name of the returned value.
static llvm::Value* create_call(
LlvmCodeGen* cg, LlvmCodeGen::LlvmBuilder* builder,
llvm::Function* fn, llvm::ArrayRef<llvm::Value*> args,
const char* name,
llvm::Value* result_ptr);
static llvm::Value* create_call(
LlvmCodeGen* cg, LlvmCodeGen::LlvmBuilder* builder,
llvm::Function* fn, llvm::ArrayRef<llvm::Value*> args,
const char* name) {
return create_call(cg, builder, fn, args, name, NULL);
}
/// Same as above but wraps the result in a CodegenAnyVal.
static CodegenAnyVal create_call_wrapped(LlvmCodeGen* cg,
LlvmCodeGen::LlvmBuilder* builder, const TypeDescriptor& type, llvm::Function* fn,
llvm::ArrayRef<llvm::Value*> args, const char* name,
llvm::Value* result_ptr);
/// Same as above but wraps the result in a CodegenAnyVal.
static CodegenAnyVal create_call_wrapped(LlvmCodeGen* cg,
LlvmCodeGen::LlvmBuilder* builder, const TypeDescriptor& type, llvm::Function* fn,
llvm::ArrayRef<llvm::Value*> args, const char* name) {
return create_call_wrapped(cg, builder, type, fn, args, name, NULL);
}
/// Returns the lowered AnyVal type associated with 'type'.
/// E.g.: TYPE_BOOLEAN (which corresponds to a BooleanVal) => i16
static llvm::Type* get_lowered_type(LlvmCodeGen* cg, const TypeDescriptor& type);
/// Returns the lowered AnyVal pointer type associated with 'type'.
/// E.g.: TYPE_BOOLEAN => i16*
static llvm::Type* get_lowered_ptr_type(LlvmCodeGen* cg, const TypeDescriptor& type);
/// Returns the unlowered AnyVal type associated with 'type'.
/// E.g.: TYPE_BOOLEAN => %"struct.impala_udf::BooleanVal"
static llvm::Type* get_unlowered_type(LlvmCodeGen* cg, const TypeDescriptor& type);
/// Returns the unlowered AnyVal pointer type associated with 'type'.
/// E.g.: TYPE_BOOLEAN => %"struct.impala_udf::BooleanVal"*
static llvm::Type* get_unlowered_ptr_type(LlvmCodeGen* cg, const TypeDescriptor& type);
/// Return the constant type-lowered value corresponding to a null *Val.
/// E.g.: passing TYPE_DOUBLE (corresponding to the lowered DoubleVal { i8, double })
/// returns the constant struct { 1, 0.0 }
static llvm::Value* get_null_val(LlvmCodeGen* codegen, const TypeDescriptor& type);
/// Return the constant type-lowered value corresponding to a null *Val.
/// 'val_type' must be a lowered type (i.e. one of the types returned by GetType)
static llvm::Value* get_null_val(LlvmCodeGen* codegen, llvm::Type* val_type);
/// Return the constant type-lowered value corresponding to a non-null *Val.
/// E.g.: TYPE_DOUBLE (lowered DoubleVal: { i8, double }) => { 0, 0 }
/// This returns a CodegenAnyVal, rather than the unwrapped Value*, because the actual
/// value still needs to be set.
static CodegenAnyVal get_non_null_val(
LlvmCodeGen* codegen, LlvmCodeGen::LlvmBuilder* builder,
const TypeDescriptor& type, const char* name);
static CodegenAnyVal get_non_null_val(
LlvmCodeGen* codegen, LlvmCodeGen::LlvmBuilder* builder,
const TypeDescriptor& type) {
return get_non_null_val(codegen, builder, type, "");
}
/// Creates a wrapper around a lowered *Val value.
//
/// Instructions for manipulating the value are generated using 'builder'. The insert
/// point of 'builder' is not modified by this class, and it is safe to call
/// 'builder'.SetInsertPoint() after passing 'builder' to this class.
//
/// 'type' identified the type of wrapped value (e.g., TYPE_INT corresponds to IntVal,
/// which is lowered to i64).
//
/// If 'value' is NULL, a new value of the lowered type is alloca'd. Otherwise 'value'
/// must be of the correct lowered type.
//
/// If 'name' is specified, it will be used when generated instructions that set value.
CodegenAnyVal(LlvmCodeGen* codegen, LlvmCodeGen::LlvmBuilder* builder,
const TypeDescriptor& type, llvm::Value* value = NULL, const char* name = "");
~CodegenAnyVal() { }
/// Returns the current type-lowered value.
llvm::Value* value() { return _value; }
/// Gets the 'is_null' field of the *Val.
llvm::Value* get_is_null(const char* name = "is_null");
/// Get the 'val' field of the *Val. Do not call if this represents a StringVal or
/// TimestampVal. If this represents a DecimalVal, returns 'val4', 'val8', or 'val16'
/// depending on the precision of 'type_'. The returned value will have variable name
/// 'name'.
llvm::Value* get_val(const char* name = "val");
/// Sets the 'is_null' field of the *Val.
void set_is_null(llvm::Value* is_null);
/// Sets the 'val' field of the *Val. Do not call if this represents a StringVal or
/// TimestampVal.
void set_val(llvm::Value* val);
/// Sets the 'val' field of the *Val. The *Val must correspond to the argument type.
void set_val(bool val);
void set_val(int8_t val);
void set_val(int16_t val);
void set_val(int32_t val);
void set_val(int64_t val);
void set_val(__int128 val);
void set_val(float val);
void set_val(double val);
/// Getters for StringVals.
llvm::Value* get_ptr();
llvm::Value *get_len();
/// Setters for StringVals.
void set_ptr(llvm::Value* ptr);
void set_len(llvm::Value* len);
/// Allocas and stores this value in an unlowered pointer, and returns the pointer. This
/// *Val should be non-null.
llvm::Value* get_unlowered_ptr();
/// Set this *Val's value based on 'raw_val'. 'raw_val' should be a native type,
/// StringValue, or DateTimeValue.
void set_from_raw_value(llvm::Value* raw_val);
/// Set this *Val's value based on void* 'raw_ptr'. 'raw_ptr' should be a pointer to a
/// native type, StringValue, or TimestampValue (i.e. the value returned by an
/// interpreted compute fn).
void set_from_raw_ptr(llvm::Value* raw_ptr);
/// Converts this *Val's value to a native type, StringValue, TimestampValue, etc.
/// This should only be used if this *Val is not null.
llvm::Value* to_native_value();
/// Sets 'native_ptr' to this *Val's value. If non-NULL, 'native_ptr' should be a
/// pointer to a native type, StringValue, TimestampValue, etc. If NULL, a pointer is
/// alloca'd. In either case the pointer is returned. This should only be used if this
/// *Val is not null.
llvm::Value* to_native_ptr(llvm::Value* native_ptr = NULL);
/// Returns the i1 result of this == other. this and other must be non-null.
llvm::Value* eq(CodegenAnyVal* other);
/// Compares this *Val to the value of 'native_ptr'. 'native_ptr' should be a pointer to
/// a native type, StringValue, or TimestampValue. This *Val should match 'native_ptr's
/// type (e.g. if this is an IntVal, 'native_ptr' should have type i32*). Returns the i1
/// result of the equality comparison.
llvm::Value* eq_to_native_ptr(llvm::Value* native_ptr);
/// Returns the i32 result of comparing this value to 'other' (similar to
/// RawValue::Compare()). This and 'other' must be non-null. Return value is < 0 if
/// this < 'other', 0 if this == 'other', > 0 if this > 'other'.
llvm::Value* compare(CodegenAnyVal* other, const char* name);
llvm::Value* compare(CodegenAnyVal* other) {
return compare(other, "result");
}
/// Ctor for created an uninitialized CodegenAnYVal that can be assigned to later.
CodegenAnyVal() :
_type(INVALID_TYPE), _value(NULL), _name(NULL), _codegen(NULL), _builder(NULL) {
}
private:
TypeDescriptor _type;
llvm::Value* _value;
const char* _name;
LlvmCodeGen* _codegen;
LlvmCodeGen::LlvmBuilder* _builder;
/// Helper function for getting the top (most significant) half of 'v'.
/// 'v' should have width = 'num_bits' * 2 and be an integer type.
llvm::Value* get_high_bits(int num_bits, llvm::Value* v, const char* name);
llvm::Value* get_high_bits(int num_bits, llvm::Value* v) {
return get_high_bits(num_bits, v, "");
}
/// Helper function for setting the top (most significant) half of a 'dst' to 'src'.
/// 'src' must have width <= 'num_bits' and 'dst' must have width = 'num_bits' * 2.
/// Both 'dst' and 'src' should be integer types.
llvm::Value* set_high_bits(
int num_bits, llvm::Value* src, llvm::Value* dst, const char* name);
llvm::Value* set_high_bits(
int num_bits, llvm::Value* src, llvm::Value* dst) {
return set_high_bits(num_bits, src, dst, "");
}
};
}
#endif

View File

@ -0,0 +1,63 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifdef IR_COMPILE
#include "runtime/string_value.hpp"
#include "runtime/datetime_value.h"
#include "runtime/decimal_value.h"
#include "udf/udf.h"
namespace palo {
// Note: we explicitly pass by reference because passing by value has special ABI rules
// Used by CodegenAnyVal::Eq()
bool string_val_eq(const StringVal& x, const StringVal& y) {
return x == y;
}
bool datetime_val_eq(const DateTimeVal& x, const DateTimeVal& y) {
return x == y;
}
bool decimal_val_eq(const DecimalVal& x, const DecimalVal& y) {
return x == y;
}
// Used by CodegenAnyVal::EqToNativePtr()
bool string_value_eq(const StringVal& x, const StringValue& y) {
StringValue sv = StringValue::from_string_val(x);
return sv.eq(y);
}
bool datetime_value_eq(const DateTimeVal& x, const DateTimeValue& y) {
DateTimeValue tv = DateTimeValue::from_datetime_val(x);
return tv == y;
}
bool decimal_value_eq(const DecimalVal& x, const DecimalValue& y) {
DecimalValue tv = DecimalValue::from_decimal_val(x);
return tv == y;
}
}
#else
#error "This file should only be used for cross compiling to IR."
#endif

View File

@ -0,0 +1,199 @@
# Modifications copyright (C) 2017, Baidu.com, Inc.
# Copyright 2017 The Apache Software Foundation
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
"""
This script will generate two headers that describe all of the clang cross compiled
functions.
The script outputs (run: 'palo/common/function-registry/gen_functions.py')
- be/src/generated-sources/palo-ir/palo-ir-functions.h
This file contains enums for all of the cross compiled functions
- be/src/generated-sources/palo-ir/palo-ir-function-names.h
This file contains a mapping of <string, enum>
Mapping of enum to compiled function name. The compiled function name only has to
be a substring of the actual, mangled compiler generated name.
TODO: should we work out the mangling rules?
"""
import string
import os
ir_functions = [
["AGG_NODE_PROCESS_ROW_BATCH_WITH_GROUPING", "process_row_batch_with_grouping"],
["AGG_NODE_PROCESS_ROW_BATCH_NO_GROUPING", "process_row_batch_no_grouping"],
# ["EXPR_GET_VALUE", "IrExprGetValue"],
# ["HASH_CRC", "IrCrcHash"],
# ["HASH_FVN", "IrFvnHash"],
["HASH_JOIN_PROCESS_BUILD_BATCH", "12HashJoinNode19process_build_batch"],
["HASH_JOIN_PROCESS_PROBE_BATCH", "12HashJoinNode19process_probe_batch"],
["EXPR_GET_BOOLEAN_VAL", "4Expr15get_boolean_val"],
["EXPR_GET_TINYINT_VAL", "4Expr16get_tiny_int_val"],
["EXPR_GET_SMALLINT_VAL", "4Expr17get_small_int_val"],
["EXPR_GET_INT_VAL", "4Expr11get_int_val"],
["EXPR_GET_BIGINT_VAL", "4Expr15get_big_int_val"],
["EXPR_GET_LARGEINT_VAL", "4Expr17get_large_int_val"],
["EXPR_GET_FLOAT_VAL", "4Expr13get_float_val"],
["EXPR_GET_DOUBLE_VAL", "4Expr14get_double_val"],
["EXPR_GET_STRING_VAL", "4Expr14get_string_val"],
["EXPR_GET_DATETIME_VAL", "4Expr16get_datetime_val"],
["EXPR_GET_DECIMAL_VAL", "4Expr15get_decimal_val"],
["HASH_CRC", "ir_crc_hash"],
["HASH_FNV", "ir_fnv_hash"],
["FROM_DECIMAL_VAL", "16from_decimal_val"],
["TO_DECIMAL_VAL", "14to_decimal_val"],
["FROM_DATETIME_VAL", "17from_datetime_val"],
["TO_DATETIME_VAL", "15to_datetime_val"],
["IR_STRING_COMPARE", "ir_string_compare"],
# ["STRING_VALUE_EQ", "StringValueEQ"],
# ["STRING_VALUE_NE", "StringValueNE"],
# ["STRING_VALUE_GE", "StringValueGE"],
# ["STRING_VALUE_GT", "StringValueGT"],
# ["STRING_VALUE_LT", "StringValueLT"],
# ["STRING_VALUE_LE", "StringValueLE"],
# ["STRING_TO_BOOL", "IrStringToBool"],
# ["STRING_TO_INT8", "IrStringToInt8"],
# ["STRING_TO_INT16", "IrStringToInt16"],
# ["STRING_TO_INT32", "IrStringToInt32"],
# ["STRING_TO_INT64", "IrStringToInt64"],
# ["STRING_TO_FLOAT", "IrStringToFloat"],
# ["STRING_TO_DOUBLE", "IrStringToDouble"],
# ["STRING_IS_NULL", "IrIsNullString"],
["HLL_UPDATE_BOOLEAN", "hll_updateIN8palo_udf10BooleanVal"],
["HLL_UPDATE_TINYINT", "hll_updateIN8palo_udf10TinyIntVal"],
["HLL_UPDATE_SMALLINT", "hll_updateIN8palo_udf11SmallIntVal"],
["HLL_UPDATE_INT", "hll_updateIN8palo_udf6IntVal"],
["HLL_UPDATE_BIGINT", "hll_updateIN8palo_udf9BigIntVal"],
["HLL_UPDATE_FLOAT", "hll_updateIN8palo_udf8FloatVal"],
["HLL_UPDATE_DOUBLE", "hll_updateIN8palo_udf9DoubleVal"],
["HLL_UPDATE_STRING", "hll_updateIN8palo_udf9StringVal"],
["HLL_UPDATE_TIMESTAMP", "hll_updateIN8palo_udf11DateTimeVal"],
["HLL_UPDATE_DECIMAL", "hll_updateIN8palo_udf10DecimalVal"],
["HLL_MERGE", "hll_merge"],
["CODEGEN_ANYVAL_DATETIME_VAL_EQ", "datetime_val_eq"],
["CODEGEN_ANYVAL_STRING_VAL_EQ", "string_val_eq"],
["CODEGEN_ANYVAL_DECIMAL_VAL_EQ", "decimal_val_eq"],
["CODEGEN_ANYVAL_DATETIME_VALUE_EQ", "datetime_value_eq"],
["CODEGEN_ANYVAL_STRING_VALUE_EQ", "string_value_eq"],
["CODEGEN_ANYVAL_DECIMAL_VALUE_EQ", "decimal_value_eq"],
["RAW_VALUE_COMPARE", "8RawValue7compare"],
]
enums_preamble = '\
// Modifications copyright (C) 2017, Baidu.com, Inc.\n\
// Copyright 2017 The Apache Software Foundation\n\
//\n\
// Licensed under the Apache License, Version 2.0 (the "License");\n\
// you may not use this file except in compliance with the License.\n\
// You may obtain a copy of the License at\n\
//\n\
// http://www.apache.org/licenses/LICENSE-2.0\n\
//\n\
// Unless required by applicable law or agreed to in writing, software\n\
// distributed under the License is distributed on an "AS IS" BASIS,\n\
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n\
// See the License for the specific language governing permissions and\n\
// limitations under the License.\n\
\n\
// This is a generated file, DO NOT EDIT IT.\n\
// To add new functions, see be/src/codegen/gen_ir_descriptions.py.\n\
\n\
#ifndef PALO_IR_FUNCTIONS_H\n\
#define PALO_IR_FUNCTIONS_H\n\
\n\
namespace palo {\n\
\n\
class IRFunction {\n\
public:\n\
enum Type {\n'
enums_epilogue = '\
};\n\
};\n\
\n\
}\n\
\n\
#endif\n'
names_preamble = '\
// Modifications copyright (C) 2017, Baidu.com, Inc.\n\
// Copyright 2017 The Apache Software Foundation\n\
//\n\
// Licensed under the Apache License, Version 2.0 (the "License");\n\
// you may not use this file except in compliance with the License.\n\
// You may obtain a copy of the License at\n\
//\n\
// http://www.apache.org/licenses/LICENSE-2.0\n\
//\n\
// Unless required by applicable law or agreed to in writing, software\n\
// distributed under the License is distributed on an "AS IS" BASIS,\n\
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n\
// See the License for the specific language governing permissions and\n\
// limitations under the License.\n\
\n\
// This is a generated file, DO NOT EDIT IT.\n\
// To add new functions, see be/src/codegen/gen_ir_descriptions.py.\n\
\n\
#ifndef PALO_IR_FUNCTION_NAMES_H\n\
#define PALO_IR_FUNCTION_NAMES_H\n\
\n\
#include "palo_ir/palo_ir_functions.h"\n\
\n\
namespace palo {\n\
\n\
static struct {\n\
std::string fn_name; \n\
IRFunction::Type fn; \n\
} FN_MAPPINGS[] = {\n'
names_epilogue = '\
};\n\
\n\
}\n\
\n\
#endif\n'
BE_PATH = os.environ['PALO_HOME'] + "/gensrc/build/palo_ir/"
if not os.path.exists(BE_PATH):
os.makedirs(BE_PATH)
if __name__ == "__main__":
print "Generating IR description files"
enums_file = open(BE_PATH + 'palo_ir_functions.h', 'w')
enums_file.write(enums_preamble)
names_file = open(BE_PATH + 'palo_ir_names.h', 'w')
names_file.write(names_preamble)
idx = 0
enums_file.write(" FN_START = " + str(idx) + ",\n")
for fn in ir_functions:
enum = fn[0]
fn_name = fn[1]
enums_file.write(" " + enum + " = " + str(idx) + ",\n")
names_file.write(" { \"" + fn_name + "\", IRFunction::" + enum + " },\n")
idx = idx + 1
enums_file.write(" FN_END = " + str(idx) + "\n")
enums_file.write(enums_epilogue)
enums_file.close()
names_file.write(names_epilogue)
names_file.close()

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,644 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_QUERY_CODEGEN_LLVM_CODEGEN_H
#define BDG_PALO_BE_SRC_QUERY_CODEGEN_LLVM_CODEGEN_H
#include <map>
#include <string>
#include <vector>
#include <boost/scoped_ptr.hpp>
#include <boost/thread/mutex.hpp>
#include <llvm/IR/DerivedTypes.h>
#include <llvm/IR/Intrinsics.h>
#include <llvm/IR/IRBuilder.h>
#include <llvm/IR/LLVMContext.h>
#include <llvm/IR/Module.h>
#include <llvm/Analysis/Verifier.h>
#include <llvm/Support/raw_ostream.h>
#include <llvm/Support/MemoryBuffer.h>
#include "common/status.h"
#include "runtime/primitive_type.h"
#include "exprs/expr.h"
#include "util/runtime_profile.h"
#include "palo_ir/palo_ir_functions.h"
// Forward declare all llvm classes to avoid namespace pollution.
namespace llvm {
class AllocaInst;
class BasicBlock;
class ConstantFolder;
class ExecutionEngine;
class Function;
class FunctionPassManager;
class LLVMContext;
class Module;
class NoFolder;
class PassManager;
class PointerType;
class StructType;
class TargetData;
class Type;
class Value;
template<bool B, typename T, typename I>
class IRBuilder;
template<bool preserveName>
class IRBuilderDefaultInserter;
}
namespace palo {
class SubExprElimination;
// LLVM code generator. This is the top level object to generate jitted code.
//
// LLVM provides a c++ IR builder interface so IR does not need to be written
// manually. The interface is very low level so each line of IR that needs to
// be output maps 1:1 with calls to the interface.
// The llvm documentation is not fantastic and a lot of this was figured out
// by experimenting. Thankfully, their API is pretty well designed so it's
// possible to get by without great documentation. The llvm tutorial is very
// helpful, http://llvm.org/docs/tutorial/LangImpl1.html. In this tutorial, they
// go over how to JIT an AST for a toy language they create.
// It is also helpful to use their online app that lets you compile c/c++ to IR.
// http://llvm.org/demo/index.cgi.
//
// This class provides two interfaces, one for testing and one for the query
// engine. The interface for the query engine will load the cross-compiled
// IR module (output during the build) and extract all of functions that will
// be called directly. The test interface can be used to load any precompiled
// module or none at all (but this class will not validate the module).
//
// This class is mostly not threadsafe. During the Prepare() phase of the fragment
// execution, nodes should codegen functions.
// Afterward, optimize_module() should be called at which point all codegened functions
// are optimized.
// Subsequently, nodes can get at the jit compiled function pointer (typically during the
// Open() call). Getting the jit compiled function (jit_function()) is the only thread
// safe function.
//
// Currently, each query will create and initialize one of these
// objects. This requires loading and parsing the cross compiled modules.
// TODO: we should be able to do this once per process and let llvm compile
// functions from across modules.
//
// LLVM has a nontrivial memory management scheme and objects will take
// ownership of others. The document is pretty good about being explicit with this
// but it is not very intuitive.
// TODO: look into diagnostic output and debuggability
// TODO: confirm that the multi-threaded usage is correct
class LlvmCodeGen {
public:
// This function must be called once per process before any llvm API calls are
// made. LLVM needs to allocate data structures for multi-threading support and
// to enable dynamic linking of jitted code.
// if 'load_backend', load the backend static object for llvm. This is needed
// when libbackend.so is loaded from java. llvm will be default only look in
// the current object and not be able to find the backend symbols
// TODO: this can probably be removed after Palo refactor where the java
// side is not loading the be explicitly anymore.
static void initialize_llvm(bool load_backend = false);
// Loads and parses the precompiled palo IR module
// codegen will contain the created object on success.
static Status load_palo_ir(
ObjectPool*, const std::string& id, boost::scoped_ptr<LlvmCodeGen>* codegen);
// Removes all jit compiled dynamically linked functions from the process.
~LlvmCodeGen();
RuntimeProfile* runtime_profile() {
return &_profile;
}
RuntimeProfile::Counter* codegen_timer() {
return _codegen_timer;
}
// Turns on/off optimization passes
void enable_optimizations(bool enable);
// For debugging. Returns the IR that was generated. If full_module, the
// entire module is dumped, including what was loaded from precompiled IR.
// If false, only output IR for functions which were generated.
std::string get_ir(bool full_module) const;
// Typedef builder in case we want to change the template arguments later
typedef llvm::IRBuilder<> LlvmBuilder;
// Utility struct that wraps a variable name and llvm type.
struct NamedVariable {
std::string name;
llvm::Type* type;
NamedVariable(const std::string& name = "", llvm::Type* type = NULL) {
this->name = name;
this->type = type;
}
};
// Abstraction over function prototypes. Contains helpers to build prototypes and
// generate IR for the types.
class FnPrototype {
public:
// Create a function prototype object, specifying the name of the function and
// the return type.
FnPrototype(LlvmCodeGen*, const std::string& name, llvm::Type* ret_type);
// Returns name of function
const std::string& name() const {
return _name;
}
// Add argument
void add_argument(const NamedVariable& var) {
_args.push_back(var);
}
void add_argument(const std::string& name, llvm::Type* type) {
_args.push_back(NamedVariable(name, type));
}
// Generate LLVM function prototype.
// If a non-null builder is passed, this function will also create the entry block
// and set the builder's insert point to there.
// If params is non-null, this function will also return the arguments
// values (params[0] is the first arg, etc).
// In that case, params should be preallocated to be number of arguments
llvm::Function* generate_prototype(LlvmBuilder* builder = NULL,
llvm::Value** params = NULL);
private:
friend class LlvmCodeGen;
LlvmCodeGen* _codegen;
std::string _name;
llvm::Type* _ret_type;
std::vector<NamedVariable> _args;
};
/// Codegens IR to load array[idx] and returns the loaded value. 'array' should be a
/// C-style array (e.g. i32*) or an IR array (e.g. [10 x i32]). This function does not
/// do bounds checking.
llvm::Value* codegen_array_at(
LlvmBuilder*, llvm::Value* array, int idx, const char* name);
/// Return a pointer type to 'type'
llvm::PointerType* get_ptr_type(llvm::Type* type);
// Returns llvm type for the primitive type
llvm::Type* get_type(const PrimitiveType& type);
// Returns llvm type for the primitive type
llvm::Type* get_type(const TypeDescriptor& type);
// Return a pointer type to 'type' (e.g. int16_t*)
llvm::PointerType* get_ptr_type(const TypeDescriptor& type);
llvm::PointerType* get_ptr_type(const PrimitiveType& type);
// Returns the type with 'name'. This is used to pull types from clang
// compiled IR. The types we generate at runtime are unnamed.
// The name is generated by the clang compiler in this form:
// <class/struct>.<namespace>::<class name>. For example:
// "class.palo::AggregationNode"
llvm::Type* get_type(const std::string& name);
/// Returns the pointer type of the type returned by GetType(name)
llvm::PointerType* get_ptr_type(const std::string& name);
/// Alloca's an instance of the appropriate pointer type and sets it to point at 'v'
llvm::Value* get_ptr_to(LlvmBuilder* builder, llvm::Value* v, const char* name);
/// Alloca's an instance of the appropriate pointer type and sets it to point at 'v'
llvm::Value* get_ptr_to(LlvmBuilder* builder, llvm::Value* v) {
return get_ptr_to(builder, v, "");
}
// Returns reference to llvm context object. Each LlvmCodeGen has its own
// context to allow multiple threads to be calling into llvm at the same time.
llvm::LLVMContext& context() {
return *_context.get();
}
// Returns execution engine interface
llvm::ExecutionEngine* execution_engine() {
return _execution_engine.get();
}
// Returns the underlying llvm module
llvm::Module* module() {
return _module;
}
// Register a expr function with unique id. It can be subsequently retrieved via
// get_registered_expr_fn with that id.
void register_expr_fn(int64_t id, llvm::Function* function) {
DCHECK(_registered_exprs_map.find(id) == _registered_exprs_map.end());
_registered_exprs_map[id] = function;
_registered_exprs.insert(function);
}
// Returns a registered expr function for id or NULL if it does not exist.
llvm::Function* get_registered_expr_fn(int64_t id) {
std::map<int64_t, llvm::Function*>::iterator it = _registered_exprs_map.find(id);
if (it == _registered_exprs_map.end()) {
return NULL;
}
return it->second;
}
/// Optimize and compile the module. This should be called after all functions to JIT
/// have been added to the module via AddFunctionToJit(). If optimizations_enabled_ is
/// false, the module will not be optimized before compilation.
Status finalize_module();
// Optimize the entire module. LLVM is more built for running its optimization
// passes over the entire module (all the functions) rather than individual
// functions.
void optimize_module();
// Replaces all instructions that call 'target_name' with a call instruction
// to the new_fn. Returns the modified function.
// - target_name is the unmangled function name that should be replaced.
// The name is assumed to be unmangled so all call sites that contain the
// replace_name substring will be replaced. target_name is case-sensitive
// TODO: be more strict than substring? work out the mangling rules?
// - If update_in_place is true, the caller function will be modified in place.
// Otherwise, the caller function will be cloned and the original function
// is unmodified. If update_in_place is false and the function is already
// been dynamically linked, the existing function will be unlinked. Note that
// this is very unthread-safe, if there are threads in the function to be unlinked,
// bad things will happen.
// - 'num_replaced' returns the number of call sites updated
//
// Most of our use cases will likely not be in place. We will have one 'template'
// version of the function loaded for each type of Node (e.g. AggregationNode).
// Each instance of the node will clone the function, replacing the inner loop
// body with the codegened version. The codegened bodies differ from instance
// to instance since they are specific to the node's tuple desc.
llvm::Function* replace_call_sites(llvm::Function* caller, bool update_in_place,
llvm::Function* new_fn, const std::string& target_name, int* num_replaced);
/// Returns a copy of fn. The copy is added to the module.
llvm::Function* clone_function(llvm::Function* fn);
// Verify and optimize function. This should be called at the end for each
// codegen'd function. If the function does not verify, it will return NULL,
// otherwise, it will optimize, mark the function for inlining and return the
// function object.
llvm::Function* finalize_function(llvm::Function* function);
// Inline all function calls for 'fn'. 'fn' is modified in place. Returns
// the number of functions inlined. This is *not* called recursively
// (i.e. second level function calls are not inlined). This can be called
// again to inline those until this returns 0.
int inline_call_sites(llvm::Function* fn, bool skip_registered_fns);
// Optimizes the function in place. This uses a combination of llvm optimization
// passes as well as some custom heuristics. This should be called for all
// functions which call Exprs. The exprs will be inlined as much as possible,
// and will do basic sub expression elimination.
// This should be called before optimize_module for functions that want to remove
// redundant exprs. This should be called at the highest level possible to
// maximize the number of redundant exprs that can be found.
// TODO: we need to spend more time to output better IR. Asking llvm to
// remove redundant codeblocks on its own is too difficult for it.
// TODO: this should implement the llvm FunctionPass interface and integrated
// with the llvm optimization passes.
llvm::Function* optimize_function_with_exprs(llvm::Function* fn);
/// Adds the function to be automatically jit compiled after the module is optimized.
/// That is, after FinalizeModule(), this will do *result_fn_ptr = JitFunction(fn);
//
/// This is useful since it is not valid to call JitFunction() before every part of the
/// query has finished adding their IR and it's convenient to not have to rewalk the
/// objects. This provides the same behavior as walking each of those objects and calling
/// JitFunction().
//
/// In addition, any functions not registered with AddFunctionToJit() are marked as
/// internal in FinalizeModule() and may be removed as part of optimization.
//
/// This will also wrap functions returning DecimalVals in an ABI-compliant wrapper (see
/// the comment in the .cc file for details). This is so we don't accidentally try to
/// call non-compliant code from native code.
void add_function_to_jit(llvm::Function* fn, void** fn_ptr);
// Jit compile the function. This will run optimization passes and verify
// the function. The result is a function pointer that is dynamically linked
// into the process.
// Returns NULL if the function is invalid.
// scratch_size will be set to the buffer size required to call the function
// scratch_size is the total size from all LlvmCodeGen::get_scratch_buffer
// calls (with some additional bytes for alignment)
// This function is thread safe.
void* jit_function(llvm::Function* function, int* scratch_size = NULL);
// Verfies the function if the verfier is enabled. Returns false if function
// is invalid.
bool verify_function(llvm::Function* function);
// This will generate a printf call instruction to output 'message' at the
// builder's insert point. Only for debugging.
void codegen_debug_trace(LlvmBuilder* builder, const char* message);
/// Returns the string representation of a llvm::Value* or llvm::Type*
template <typename T>
static std::string print(T* value_or_type) {
std::string str;
llvm::raw_string_ostream stream(str);
value_or_type->print(stream);
return str;
}
// Returns the libc function, adding it to the module if it has not already been.
llvm::Function* get_lib_c_function(FnPrototype* prototype);
// Returns the cross compiled function. IRFunction::Type is an enum which is
// defined in 'palo-ir/palo-ir-functions.h'
llvm::Function* get_function(IRFunction::Type);
// Returns the hash function with signature:
// int32_t Hash(int8_t* data, int len, int32_t seed);
// If num_bytes is non-zero, the returned function will be codegen'd to only
// work for that number of bytes. It is invalid to call that function with a
// different 'len'.
llvm::Function* get_hash_function(int num_bytes = -1);
// Allocate stack storage for local variables. This is similar to traditional c, where
// all the variables must be declared at the top of the function. This helper can be
// called from anywhere and will add a stack allocation for 'var' at the beginning of
// the function. This would be used, for example, if a function needed a temporary
// struct allocated. The allocated variable is scoped to the function.
// This is not related to get_scratch_buffer which is used for structs that are returned
// to the caller.
llvm::AllocaInst* create_entry_block_alloca(llvm::Function* f, const NamedVariable& var);
llvm::AllocaInst* create_entry_block_alloca(
const LlvmBuilder& builder, llvm::Type* type, const char* name);
// Utility to create two blocks in 'fn' for if/else codegen. if_block and else_block
// are return parameters. insert_before is optional and if set, the two blocks
// will be inserted before that block otherwise, it will be inserted at the end
// of 'fn'. Being able to place blocks is useful for debugging so the IR has a
// better looking control flow.
void create_if_else_blocks(llvm::Function* fn, const std::string& if_name,
const std::string& else_name,
llvm::BasicBlock** if_block, llvm::BasicBlock** else_block,
llvm::BasicBlock* insert_before = NULL);
// Returns offset into scratch buffer: offset points to area of size 'byte_size'
// Called by expr generation to request scratch buffer. This is used for struct
// types (i.e. StringValue) where data cannot be returned by registers.
// For example, to jit the expr "strlen(str_col)", we need a temporary StringValue
// struct from the inner SlotRef expr node. The SlotRef node would call
// get_scratch_buffer(sizeof(StringValue)) and output the intermediate struct at
// scratch_buffer (passed in as argument to compute function) + offset.
int get_scratch_buffer(int byte_size);
// Create a llvm pointer value from 'ptr'. This is used to pass pointers between
// c-code and code-generated IR. The resulting value will be of 'type'.
llvm::Value* cast_ptr_to_llvm_ptr(llvm::Type* type, void* ptr);
// Returns the constant 'val' of 'type'
llvm::Value* get_int_constant(PrimitiveType type, int64_t val);
// Returns true/false constants (bool type)
llvm::Value* true_value() {
return _true_value;
}
llvm::Value* false_value() {
return _false_value;
}
llvm::Value* null_ptr_value() {
return llvm::ConstantPointerNull::get(ptr_type());
}
// Simple wrappers to reduce code verbosity
llvm::Type* boolean_type() {
return get_type(TYPE_BOOLEAN);
}
llvm::Type* tinyint_type() {
return get_type(TYPE_TINYINT);
}
llvm::Type* smallint_type() {
return get_type(TYPE_SMALLINT);
}
llvm::Type* int_type() {
return get_type(TYPE_INT);
}
llvm::Type* bigint_type() {
return get_type(TYPE_BIGINT);
}
llvm::Type* largeint_type() {
return get_type(TYPE_LARGEINT);
}
llvm::Type* float_type() {
return get_type(TYPE_FLOAT);
}
llvm::Type* double_type() {
return get_type(TYPE_DOUBLE);
}
llvm::Type* string_val_type() const {
return _string_val_type;
}
llvm::Type* datetime_val_type() const {
return _datetime_val_type;
}
llvm::Type* decimal_val_type() const {
return _decimal_val_type;
}
llvm::PointerType* ptr_type() {
return _ptr_type;
}
llvm::Type* void_type() {
return _void_type;
}
llvm::Type* i128_type() {
return llvm::Type::getIntNTy(context(), 128);
}
// Fills 'functions' with all the functions that are defined in the module.
// Note: this does not include functions that are just declared
void get_functions(std::vector<llvm::Function*>* functions);
// Generates function to return min/max(v1, v2)
llvm::Function* codegen_min_max(const TypeDescriptor& type, bool min);
// Codegen to call llvm memcpy intrinsic at the current builder location
// dst & src must be pointer types. size is the number of bytes to copy.
void codegen_memcpy(LlvmBuilder*, llvm::Value* dst, llvm::Value* src, int size);
// Codegen for do *dst = src. For native types, this is just a store, for structs
// we need to assign the fields one by one
void codegen_assign(LlvmBuilder*, llvm::Value* dst, llvm::Value* src, PrimitiveType);
llvm::Instruction::CastOps get_cast_op(
const TypeDescriptor& from_type, const TypeDescriptor& to_type);
private:
friend class LlvmCodeGenTest;
friend class SubExprElimination;
// Top level codegen object. 'module_name' is only used for debugging when
// outputting the IR. module's loaded from disk will be named as the file
// path.
LlvmCodeGen(ObjectPool* pool, const std::string& module_name);
// Initializes the jitter and execution engine.
Status init();
// Load a pre-compiled IR module from 'file'. This creates a top level
// codegen object. This is used by tests to load custom modules.
// codegen will contain the created object on success.
static Status load_from_file(ObjectPool*, const std::string& file,
boost::scoped_ptr<LlvmCodeGen>* codegen);
/// Load a pre-compiled IR module from module_ir. This creates a top level codegen
/// object. codegen will contain the created object on success.
static Status load_from_memory(ObjectPool* pool, llvm::MemoryBuffer* module_ir,
const std::string& module_name, const std::string& id,
boost::scoped_ptr<LlvmCodeGen>* codegen);
/// Loads an LLVM module. 'module_ir' should be a reference to a memory buffer containing
/// LLVM bitcode. module_name is the name of the module to use when reporting errors.
/// The caller is responsible for cleaning up module.
static Status load_module_from_memory(LlvmCodeGen* codegen, llvm::MemoryBuffer* module_ir,
const std::string& module_name, llvm::Module** module);
// Load the intrinsics palo needs. This is a one time initialization.
// Values are stored in '_llvm_intrinsics'
Status load_intrinsics();
// Clears generated hash fns. This is only used for testing.
void clear_hash_fns();
// Name of the JIT module. Useful for debugging.
std::string _name;
// Codegen counters
RuntimeProfile _profile;
RuntimeProfile::Counter* _load_module_timer;
RuntimeProfile::Counter* _prepare_module_timer;
RuntimeProfile::Counter* _module_file_size;
RuntimeProfile::Counter* _codegen_timer;
RuntimeProfile::Counter* _optimization_timer;
RuntimeProfile::Counter* _compile_timer;
// whether or not optimizations are enabled
bool _optimizations_enabled;
// If true, the module is corrupt and we cannot codegen this query.
// TODO: we could consider just removing the offending function and attempting to
// codegen the rest of the query. This requires more testing though to make sure
// that the error is recoverable.
bool _is_corrupt;
// If true, the module has been compiled. It is not valid to add additional
// functions after this point.
bool _is_compiled;
// Error string that llvm will write to
std::string _error_string;
// Top level llvm object. Objects from different contexts do not share anything.
// We can have multiple instances of the LlvmCodeGen object in different threads
boost::scoped_ptr<llvm::LLVMContext> _context;
// Top level codegen object. Contains everything to jit one 'unit' of code.
// Owned by the _execution_engine.
llvm::Module* _module;
// Execution/Jitting engine.
boost::scoped_ptr<llvm::ExecutionEngine> _execution_engine;
// current offset into scratch buffer
int _scratch_buffer_offset;
// Keeps track of all the functions that have been jit compiled and linked into
// the process. Special care needs to be taken if we need to modify these functions.
// bool is unused.
std::map<llvm::Function*, bool> _jitted_functions;
// Lock protecting _jitted_functions
boost::mutex _jitted_functions_lock;
// Keeps track of the external functions that have been included in this module
// e.g libc functions or non-jitted palo functions.
// TODO: this should probably be FnPrototype->Functions mapping
std::map<std::string, llvm::Function*> _external_functions;
// Functions parsed from pre-compiled module. Indexed by PaloIR::Function enum
std::vector<llvm::Function*> _loaded_functions;
// Stores functions codegen'd by palo. This does not contain cross compiled
// functions, only function that were generated at runtime. Does not overlap
// with _loaded_functions.
std::vector<llvm::Function*> _codegend_functions;
// A mapping of unique id to registered expr functions
std::map<int64_t, llvm::Function*> _registered_exprs_map;
// A set of all the functions in '_registered_exprs_map' for quick lookup.
std::set<llvm::Function*> _registered_exprs;
// A cache of loaded llvm intrinsics
std::map<llvm::Intrinsic::ID, llvm::Function*> _llvm_intrinsics;
// This is a cache of generated hash functions by byte size. It is common
// for the caller to know the number of bytes to hash (e.g. tuple width) and
// we can codegen a loop unrolled hash function.
std::map<int, llvm::Function*> _hash_fns;
/// The locations of modules that have been linked. Used to avoid linking the same module
/// twice, which causes symbol collision errors.
std::set<std::string> _linked_modules;
/// The vector of functions to automatically JIT compile after FinalizeModule().
std::vector<std::pair<llvm::Function*, void**> > _fns_to_jit_compile;
// Debug utility that will insert a printf-like function into the generated
// IR. Useful for debugging the IR. This is lazily created.
llvm::Function* _debug_trace_fn;
// Debug strings that will be outputted by jitted code. This is a copy of all
// strings passed to codegen_debug_trace.
std::vector<std::string> _debug_strings;
// llvm representation of a few common types. Owned by context.
llvm::PointerType* _ptr_type; // int8_t*
llvm::Type* _void_type; // void
llvm::Type* _string_val_type; // StringVal
llvm::Type* _decimal_val_type; // StringVal
llvm::Type* _datetime_val_type; // DateTimeValue
// llvm constants to help with code gen verbosity
llvm::Value* _true_value;
llvm::Value* _false_value;
};
}
#endif

View File

@ -0,0 +1,458 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include <string>
#include <gtest/gtest.h>
#include <boost/thread/thread.hpp>
#include "codegen/llvm-codegen.h"
#include "runtime/raw-value.h"
#include "util/cpu-info.h"
#include "util/disk-info.h"
#include "util/hash-util.h"
#include "util/mem-info.h"
#include "util/path-builder.h"
using namespace std;
using namespace boost;
using namespace llvm;
namespace palo {
class LlvmCodeGenTest : public testing::Test {
private:
static void LifetimeTest() {
ObjectPool pool;
Status status;
for (int i = 0; i < 10; ++i) {
LlvmCodeGen object1(&pool, "Test");
LlvmCodeGen object2(&pool, "Test");
LlvmCodeGen object3(&pool, "Test");
status = object1.Init();
ASSERT_TRUE(status.ok());
status = object2.Init();
ASSERT_TRUE(status.ok());
status = object3.Init();
ASSERT_TRUE(status.ok());
}
}
// Wrapper to call private test-only methods on LlvmCodeGen object
static Status load_from_file(ObjectPool* pool, const string& filename,
scoped_ptr<LlvmCodeGen>* codegen) {
return LlvmCodeGen::load_from_file(pool, filename, codegen);
}
static LlvmCodeGen* CreateCodegen(ObjectPool* pool) {
LlvmCodeGen* codegen = pool->Add(new LlvmCodeGen(pool, "Test"));
if (codegen != NULL) {
Status status = codegen->Init();
if (!status.ok()) {
return NULL;
}
}
return codegen;
}
static void clear_hash_fns(LlvmCodeGen* codegen) {
codegen->clear_hash_fns();
}
};
// Simple test to just make and destroy llvmcodegen objects. LLVM
// has non-obvious object ownership transfers and this sanity checks that.
TEST_F(LlvmCodeGenTest, BasicLifetime) {
LifetimeTest();
}
// Same as above but multithreaded
TEST_F(LlvmCodeGenTest, MultithreadedLifetime) {
const int NUM_THREADS = 10;
thread_group thread_group;
for (int i = 0; i < NUM_THREADS; ++i) {
thread_group.add_thread(new thread(&LifetimeTest));
}
thread_group.join_all();
}
// Test loading a non-existent file
TEST_F(LlvmCodeGenTest, BadIRFile) {
ObjectPool pool;
string module_file = "NonExistentFile.ir";
scoped_ptr<LlvmCodeGen> codegen;
Status status = LlvmCodeGenTest::load_from_file(&pool, module_file.c_str(), &codegen);
EXPECT_TRUE(!status.ok());
}
// IR for the generated linner loop
// define void @JittedInnerLoop() {
// entry:
// call void @DebugTrace(i8* inttoptr (i64 18970856 to i8*))
// %0 = load i64* inttoptr (i64 140735197627800 to i64*)
// %1 = add i64 %0, <delta>
// store i64 %1, i64* inttoptr (i64 140735197627800 to i64*)
// ret void
// }
// The random int in there is the address of jitted_counter
Function* CodegenInnerLoop(LlvmCodeGen* codegen, int64_t* jitted_counter, int delta) {
LLVMContext& context = codegen->context();
LlvmCodeGen::LlvmBuilder builder(context);
LlvmCodeGen::FnPrototype fn_prototype(codegen, "JittedInnerLoop", codegen->void_type());
Function* jitted_loop_call = fn_prototype.generate_prototype();
BasicBlock* entry_block = BasicBlock::Create(context, "entry", jitted_loop_call);
builder.SetInsertPoint(entry_block);
codegen->codegen_debug_trace(&builder, "Jitted");
// Store &jitted_counter as a constant.
Value* const_delta = ConstantInt::get(context, APInt(64, delta));
Value* counter_ptr = codegen->cast_ptr_to_llvm_ptr(codegen->get_ptr_type(TYPE_BIGINT),
jitted_counter);
Value* loaded_counter = builder.CreateLoad(counter_ptr);
Value* incremented_value = builder.CreateAdd(loaded_counter, const_delta);
builder.CreateStore(incremented_value, counter_ptr);
builder.CreateRetVoid();
return jitted_loop_call;
}
// This test loads a precompiled IR file (compiled from testdata/llvm/test-loop.cc).
// The test contains two functions, an outer loop function and an inner loop function.
// The outer loop calls the inner loop function.
// The test will
// 1. create a LlvmCodegen object from the precompiled file
// 2. add another function to the module with the same signature as the inner
// loop function.
// 3. Replace the call instruction in the outer loop to a call to the new inner loop
// function.
// 4. Run the loop and make sure the inner loop is called.
// 5. Updated the jitted loop in place with another jitted inner loop function
// 6. Run the loop and make sure the updated is called.
TEST_F(LlvmCodeGenTest, ReplaceFnCall) {
ObjectPool pool;
const char* loop_call_name = "DefaultImplementation";
const char* loop_name = "TestLoop";
typedef void (*TestLoopFn)(int);
string module_file;
PathBuilder::GetFullPath("llvm-ir/test-loop.ir", &module_file);
// Part 1: Load the module and make sure everything is loaded correctly.
scoped_ptr<LlvmCodeGen> codegen;
Status status = LlvmCodeGenTest::load_from_file(&pool, module_file.c_str(), &codegen);
EXPECT_TRUE(codegen.get() != NULL);
EXPECT_TRUE(status.ok());
vector<Function*> functions;
codegen->get_functions(&functions);
EXPECT_EQ(functions.size(), 2);
Function* loop_call = functions[0];
Function* loop = functions[1];
EXPECT_TRUE(loop_call->getName().find(loop_call_name) != string::npos);
EXPECT_TRUE(loop_call->arg_empty());
EXPECT_TRUE(loop->getName().find(loop_name) != string::npos);
EXPECT_EQ(loop->arg_size(), 1);
int scratch_size;
void* original_loop = codegen->jit_function(loop, &scratch_size);
EXPECT_EQ(scratch_size, 0);
EXPECT_TRUE(original_loop != NULL);
TestLoopFn original_loop_fn = reinterpret_cast<TestLoopFn>(original_loop);
original_loop_fn(5);
// Part 2: Generate a new inner loop function.
//
// The c++ version of the code is
// static int64_t* counter;
// void JittedInnerLoop() {
// printf("LLVM Trace: Jitted\n");
// ++*counter;
// }
//
int64_t jitted_counter = 0;
Function* jitted_loop_call = CodegenInnerLoop(codegen.get(), &jitted_counter, 1);
// Part 3: Replace the call instruction to the normal function with a call to the
// jitted one
int num_replaced;
Function* jitted_loop = codegen->replace_call_sites(
loop, false, jitted_loop_call, loop_call_name, &num_replaced);
EXPECT_EQ(num_replaced, 1);
EXPECT_TRUE(codegen->verify_function(jitted_loop));
// Part4: Call the new loop and verify results
void* new_loop = codegen->jit_function(jitted_loop, &scratch_size);
EXPECT_EQ(scratch_size, 0);
EXPECT_TRUE(new_loop != NULL);
TestLoopFn new_loop_fn = reinterpret_cast<TestLoopFn>(new_loop);
EXPECT_EQ(jitted_counter, 0);
new_loop_fn(5);
EXPECT_EQ(jitted_counter, 5);
new_loop_fn(5);
EXPECT_EQ(jitted_counter, 10);
// Part5: Generate a new inner loop function and a new loop function in place
Function* jitted_loop_call2 = CodegenInnerLoop(codegen.get(), &jitted_counter, -2);
Function* jitted_loop2 = codegen->replace_call_sites(loop, true, jitted_loop_call2,
loop_call_name, &num_replaced);
EXPECT_EQ(num_replaced, 1);
EXPECT_TRUE(codegen->verify_function(jitted_loop2));
// Part6: Call new loop
void* new_loop2 = codegen->jit_function(jitted_loop2, &scratch_size);
EXPECT_EQ(scratch_size, 0);
EXPECT_TRUE(new_loop2 != NULL);
TestLoopFn new_loop_fn2 = reinterpret_cast<TestLoopFn>(new_loop2);
new_loop_fn2(5);
EXPECT_EQ(jitted_counter, 0);
}
// Test function for c++/ir interop for strings. Function will do:
// int StringTest(StringValue* strval) {
// strval->ptr[0] = 'A';
// int len = strval->len;
// strval->len = 1;
// return len;
// }
// Corresponding IR is:
// define i32 @StringTest(%StringValue* %str) {
// entry:
// %str_ptr = getelementptr inbounds %StringValue* %str, i32 0, i32 0
// %ptr = load i8** %str_ptr
// %first_char_ptr = getelementptr i8* %ptr, i32 0
// store i8 65, i8* %first_char_ptr
// %len_ptr = getelementptr inbounds %StringValue* %str, i32 0, i32 1
// %len = load i32* %len_ptr
// store i32 1, i32* %len_ptr
// ret i32 %len
// }
Function* CodegenStringTest(LlvmCodeGen* codegen) {
PointerType* string_val_ptr_type = codegen->get_ptr_type(TYPE_VARCHAR);
EXPECT_TRUE(string_val_ptr_type != NULL);
LlvmCodeGen::FnPrototype prototype(codegen, "StringTest", codegen->get_type(TYPE_INT));
prototype.add_argument(LlvmCodeGen::NamedVariable("str", string_val_ptr_type));
LlvmCodeGen::LlvmBuilder builder(codegen->context());
Value* str = NULL;
Function* interop_fn = prototype.generate_prototype(&builder, &str);
// strval->ptr[0] = 'A'
Value* str_ptr = builder.CreateStructGEP(str, 0, "str_ptr");
Value* ptr = builder.CreateLoad(str_ptr, "ptr");
Value* first_char_offset[] = { codegen->get_int_constant(TYPE_INT, 0) };
Value* first_char_ptr = builder.CreateGEP(ptr, first_char_offset, "first_char_ptr");
builder.CreateStore(codegen->get_int_constant(TYPE_TINYINT, 'A'), first_char_ptr);
// Update and return old len
Value* len_ptr = builder.CreateStructGEP(str, 1, "len_ptr");
Value* len = builder.CreateLoad(len_ptr, "len");
builder.CreateStore(codegen->get_int_constant(TYPE_INT, 1), len_ptr);
builder.CreateRet(len);
return interop_fn;
}
// This test validates that the llvm StringValue struct matches the c++ stringvalue
// struct. Just create a simple StringValue struct and make sure the IR can read it
// and modify it.
TEST_F(LlvmCodeGenTest, StringValue) {
ObjectPool pool;
scoped_ptr<LlvmCodeGen> codegen;
Status status = LlvmCodeGen::load_palo_ir(&pool, &codegen);
EXPECT_TRUE(status.ok());
EXPECT_TRUE(codegen.get() != NULL);
string str("Test");
StringValue str_val;
// Call memset to make sure padding bits are zero.
memset(&str_val, 0, sizeof(str_val));
str_val.ptr = const_cast<char*>(str.c_str());
str_val.len = str.length();
Function* string_test_fn = CodegenStringTest(codegen.get());
EXPECT_TRUE(string_test_fn != NULL);
EXPECT_TRUE(codegen->verify_function(string_test_fn));
// Jit compile function
void* jitted_fn = codegen->jit_function(string_test_fn);
EXPECT_TRUE(jitted_fn != NULL);
// Call IR function
typedef int (*TestStringInteropFn)(StringValue*);
TestStringInteropFn fn = reinterpret_cast<TestStringInteropFn>(jitted_fn);
int result = fn(&str_val);
// Validate
EXPECT_EQ(str.length(), result);
EXPECT_EQ('A', str_val.ptr[0]);
EXPECT_EQ(1, str_val.len);
EXPECT_EQ(static_cast<void*>(str_val.ptr), static_cast<const void*>(str.c_str()));
// Validate padding bytes are unchanged
int32_t* bytes = reinterpret_cast<int32_t*>(&str_val);
EXPECT_EQ(1, bytes[2]); // str_val.len
EXPECT_EQ(0, bytes[3]); // padding
}
// Test calling memcpy intrinsic
TEST_F(LlvmCodeGenTest, MemcpyTest) {
ObjectPool pool;
scoped_ptr<LlvmCodeGen> codegen;
Status status = LlvmCodeGen::load_palo_ir(&pool, &codegen);
ASSERT_TRUE(status.ok());
ASSERT_TRUE(codegen.get() != NULL);
LlvmCodeGen::FnPrototype prototype(codegen.get(), "MemcpyTest", codegen->void_type());
prototype.add_argument(LlvmCodeGen::NamedVariable("dest", codegen->ptr_type()));
prototype.add_argument(LlvmCodeGen::NamedVariable("src", codegen->ptr_type()));
prototype.add_argument(LlvmCodeGen::NamedVariable("n", codegen->get_type(TYPE_INT)));
LlvmCodeGen::LlvmBuilder builder(codegen->context());
char src[] = "abcd";
char dst[] = "aaaa";
Value* args[3];
Function* fn = prototype.generate_prototype(&builder, &args[0]);
codegen->codegen_memcpy(&builder, args[0], args[1], sizeof(src));
builder.CreateRetVoid();
fn = codegen->finalize_function(fn);
ASSERT_TRUE(fn != NULL);
void* jitted_fn = codegen->jit_function(fn);
ASSERT_TRUE(jitted_fn != NULL);
typedef void (*TestMemcpyFn)(char*, char*, int64_t);
TestMemcpyFn test_fn = reinterpret_cast<TestMemcpyFn>(jitted_fn);
test_fn(dst, src, 4);
EXPECT_EQ(memcmp(src, dst, 4), 0);
}
// Test codegen for hash
TEST_F(LlvmCodeGenTest, HashTest) {
ObjectPool pool;
// Values to compute hash on
const char* data1 = "test string";
const char* data2 = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
scoped_ptr<LlvmCodeGen> codegen;
Status status = LlvmCodeGen::load_palo_ir(&pool, &codegen);
ASSERT_TRUE(status.ok());
ASSERT_TRUE(codegen.get() != NULL);
bool restore_sse_support = false;
Value* llvm_data1 = codegen->cast_ptr_to_llvm_ptr(codegen->ptr_type(),
const_cast<char*>(data1));
Value* llvm_data2 = codegen->cast_ptr_to_llvm_ptr(codegen->ptr_type(),
const_cast<char*>(data2));
Value* llvm_len1 = codegen->get_int_constant(TYPE_INT, strlen(data1));
Value* llvm_len2 = codegen->get_int_constant(TYPE_INT, strlen(data2));
// Loop to test both the sse4 on/off paths
for (int i = 0; i < 2; ++i) {
uint32_t expected_hash = 0;
expected_hash = HashUtil::Hash(data1, strlen(data1), expected_hash);
expected_hash = HashUtil::Hash(data2, strlen(data2), expected_hash);
expected_hash = HashUtil::Hash(data1, strlen(data1), expected_hash);
// Create a codegen'd function that hashes all the types and returns the results.
// The tuple/values to hash are baked into the codegen for simplicity.
LlvmCodeGen::FnPrototype prototype(codegen.get(), "HashTest",
codegen->get_type(TYPE_INT));
LlvmCodeGen::LlvmBuilder builder(codegen->context());
// Test both byte-size specific hash functions and the generic loop hash function
Function* fn_fixed = prototype.generate_prototype(&builder, NULL);
Function* data1_hash_fn = codegen->get_hash_function(strlen(data1));
Function* data2_hash_fn = codegen->get_hash_function(strlen(data2));
Function* generic_hash_fn = codegen->get_hash_function();
ASSERT_TRUE(data1_hash_fn != NULL);
ASSERT_TRUE(data2_hash_fn != NULL);
ASSERT_TRUE(generic_hash_fn != NULL);
Value* seed = codegen->get_int_constant(TYPE_INT, 0);
seed = builder.CreateCall3(data1_hash_fn, llvm_data1, llvm_len1, seed);
seed = builder.CreateCall3(data2_hash_fn, llvm_data2, llvm_len2, seed);
seed = builder.CreateCall3(generic_hash_fn, llvm_data1, llvm_len1, seed);
builder.CreateRet(seed);
fn_fixed = codegen->finalize_function(fn_fixed);
ASSERT_TRUE(fn_fixed != NULL);
void* jitted_fn = codegen->jit_function(fn_fixed);
ASSERT_TRUE(jitted_fn != NULL);
typedef uint32_t (*TestHashFn)();
TestHashFn test_fn = reinterpret_cast<TestHashFn>(jitted_fn);
uint32_t result = test_fn();
// Validate that the hashes are identical
EXPECT_EQ(result, expected_hash);
if (i == 0 && CpuInfo::is_supported(CpuInfo::SSE4_2)) {
CpuInfo::EnableFeature(CpuInfo::SSE4_2, false);
restore_sse_support = true;
LlvmCodeGenTest::clear_hash_fns(codegen.get());
} else {
// System doesn't have sse, no reason to test non-sse path again.
break;
}
}
// Restore hardware feature for next test
CpuInfo::EnableFeature(CpuInfo::SSE4_2, restore_sse_support);
}
}
int main(int argc, char** argv) {
palo::CpuInfo::Init();
palo::DiskInfo::Init();
palo::MemInfo::Init();
::testing::InitGoogleTest(&argc, argv);
palo::LlvmCodeGen::initialize_llvm();
return RUN_ALL_TESTS();
}

View File

@ -0,0 +1,45 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifdef IR_COMPILE
struct __float128;
#include "codegen/codegen_anyval_ir.cpp"
#include "exec/aggregation_node_ir.cpp"
#include "exec/hash_join_node_ir.cpp"
#include "exprs/aggregate_functions.cpp"
#include "exprs/cast_functions.cpp"
#include "exprs/conditional_functions_ir.cpp"
#include "exprs/decimal_operators.cpp"
#include "exprs/expr_ir.cpp"
#include "exprs/is_null_predicate.cpp"
#include "exprs/like_predicate.cpp"
#include "exprs/math_functions.cpp"
#include "exprs/operators.cpp"
#include "exprs/string_functions.cpp"
#include "exprs/timestamp_functions.cpp"
#include "exprs/utility_functions.cpp"
#include "runtime/raw_value_ir.cpp"
#include "runtime/string_value_ir.cpp"
#include "udf/udf_ir.cpp"
#include "util/hash_util_ir.cpp"
#else
#error "This file should only be used for cross compiling to IR."
#endif

37
be/src/codegen/palo_ir.h Normal file
View File

@ -0,0 +1,37 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_QUERY_CODGEN_PALO_IR_H
#define BDG_PALO_BE_SRC_QUERY_CODGEN_PALO_IR_H
#ifdef IR_COMPILE
// For cross compiling to IR, we need functions decorated in specific ways. For
// functions that we will replace with codegen, we need them not inlined (otherwise
// we can't find the function by name. For functions where the non-codegen'd version
// is too long for the compiler to inline, we might still want to inline it since
// the codegen'd version is suitable for inling.
// In the non-ir case (g++), we will just default to whatever the compiler thought
// best at that optimization setting.
#define IR_NO_INLINE __attribute__((noinline))
#define IR_ALWAYS_INLINE __attribute__((always_inline))
#else
#define IR_NO_INLINE
#define IR_ALWAYS_INLINE
#endif
#endif

View File

@ -0,0 +1,33 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_QUERY_CODEGEN_PALO_IR_DATA_H
#define BDG_PALO_BE_SRC_QUERY_CODEGEN_PALO_IR_DATA_H
/// Header with declarations of Impala IR data. Definitions of the arrays are generated
/// separately.
extern const unsigned char palo_sse_llvm_ir[];
extern const size_t palo_sse_llvm_ir_len;
extern const unsigned char palo_no_sse_llvm_ir[];
extern const size_t palo_no_sse_llvm_ir_len;
#endif

View File

@ -0,0 +1,231 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "codegen/subexpr_elimination.h"
#include <fstream>
#include <iostream>
#include <sstream>
#include <boost/thread/mutex.hpp>
#include <llvm/Analysis/Dominators.h>
#include <llvm/Analysis/Passes.h>
#include <llvm/Analysis/InstructionSimplify.h>
#include <llvm/Support/DynamicLibrary.h>
#include <llvm/IRReader/IRReader.h>
#include <llvm/Support/MemoryBuffer.h>
#include <llvm/Support/InstIterator.h>
#include <llvm/Support/NoFolder.h>
#include <llvm/Support/TargetSelect.h>
#include <llvm/Support/raw_ostream.h>
#include <llvm/Support/system_error.h>
#include "llvm/Transforms/IPO.h"
#include <llvm/Transforms/Scalar.h>
#include <llvm/Transforms/Utils/SSAUpdater.h>
#include "common/logging.h"
#include "codegen/subexpr_elimination.h"
#include "palo_ir/palo_ir_names.h"
#include "util/cpu_info.h"
#include "util/path_builder.h"
using llvm::CallInst;
using llvm::BitCastInst;
using llvm::Instruction;
using llvm::LoadInst;
using llvm::StoreInst;
using llvm::Function;
using llvm::Value;
using llvm::DominatorTree;
namespace palo {
SubExprElimination::SubExprElimination(LlvmCodeGen* codegen) : _codegen(codegen) {
}
// Before running the standard llvm optimization passes, first remove redundant calls
// to slotref expression. SlotRefs are more heavyweight due to the null handling that
// is required and after they are inlined, llvm is unable to eliminate the redundant
// inlined code blocks.
// For example:
// select colA + colA would generate an inner loop with 2 calls to the colA slot ref,
// rather than doing subexpression elimination. To handle this, we will:
// 1. inline all call sites in the original function except calls to SlotRefs
// 2. for all call sites to SlotRefs except the first to that SlotRef, replace the
// results from the secondary calls with the result from the first and remove
// the call instruction.
// 3. Inline calls to the SlotRefs (there should only be one for each slot ref).
//
// In the above example, the input function would look something like:
// int ArithmeticAdd(TupleRow* row, bool* is_null) {
// bool lhs_is_null, rhs_is_null;
// int lhs_value = SlotRef(row, &lhs_is_null);
// if (lhs_is_null) { *is_null = true; return 0; }
// int rhs_value = SlotRef(row, &rhs_is_null);
// if (rhs_is_null) { *is_null = true; return 0; }
// *is_null = false; return lhs_value + rhs_value;
// }
// During step 2, we'd substitute the second call to SlotRef with the results from
// the first call.
// int ArithmeticAdd(TupleRow* row, bool* is_null) {
// bool lhs_is_null, rhs_is_null;
// int lhs_value = SlotRef(row, &lhs_is_null);
// if (lhs_is_null) { *is_null = true; return 0; }
// int rhs_value = lhs_value;
// rhs_is_null = lhs_is_null;
// if (rhs_is_null) { *is_null = true; return 0; }
// *is_null = false; return lhs_value + rhs_value;
// }
// And then rely on llvm to finish the removing the redundant code, resulting in:
// int ArithmeticAdd(TupleRow* row, bool* is_null) {
// bool lhs_is_null, rhs_is_null;
// int lhs_value = SlotRef(row, &lhs_is_null);
// if (lhs_is_null) { *is_null = true; return 0; }
// *is_null = false; return lhs_value + lhs_value;
// }
// Details on how to do this:
// http://llvm.org/docs/ProgrammersManual.html#replacing-an-instruction-with-another-value
// Step 2 requires more manipulation to ensure the resulting IR is still valid IR.
// The call to the expr returns two things, both of which need to be replaced.
// The value of the function as the return argument and whether or not the result was
// null as a function output argument.
// 1. The return value is trivial since with SSA, it is easy to identity all uses of
// We simply replace the subsequent call instructions with the value.
// 2. For the is_null result ptr, we replace the call to the expr with a store
// instruction of the cached value.
// i.e:
// val1 = Call(is_null_ptr);
// is_null1 = *is_null_ptr
// ...
// val2 = Call(is_null_ptr);
// is_null2 = *is_null_ptr
// Becomes:
// val1 = Call(is_null_ptr);
// is_null1 = *is_null_ptr
// ...
// val2 = val1;
// *is_null_ptr = is_null1;
// is_null2 = *is_null_ptr
// We do this because the is_null ptr is not SSA form, making manipulating it
// complex. The above approach exactly preserves the Call function, including
// all writes to ptrs. We then rely on the llvm load/store removal pass which
// will remove the redundant loads (which is tricky since you have to track
// other instructions that wrote to the ptr, etc).
// When doing the eliminations, we need to consider the call graph to make sure
// the instruction we are replacing with dominates the instruction we are replacing;
// that is, we need to guarantee the instruction we are replacing with always executes
// before the replacee instruction in all code paths.
// TODO: remove all this with expr refactoring. Everything will be SSA form then.
struct CachedExprResult {
// First function call result. Subsequent calls will be replaced with this value
CallInst* result;
// First is null result. Subsequent calls will be replaced with this value.
Instruction* is_null_value;
};
bool SubExprElimination::run(Function* fn) {
// Step 1:
int num_inlined = 0;
do {
// This assumes that all redundant exprs have been registered.
num_inlined = _codegen->inline_call_sites(fn, true);
} while (num_inlined > 0);
// Mapping of (expr eval function, its 'row' arg) to cached result. We want to remove
// redundant calls to the same function with the same argument.
std::map<std::pair<Function*, Value*>, CachedExprResult> cached_slot_ref_results;
// Step 2:
DominatorTree dom_tree;
dom_tree.runOnFunction(*fn);
llvm::inst_iterator fn_end = llvm::inst_end(fn);
llvm::inst_iterator instr_iter = llvm::inst_begin(fn);
// Loop over every instruction in the function.
while (instr_iter != fn_end) {
Instruction* instr = &*instr_iter;
++instr_iter;
// Look for call instructions
if (!CallInst::classof(instr)) {
continue;
}
CallInst* call_instr = reinterpret_cast<CallInst*>(instr);
Function* called_fn = call_instr->getCalledFunction();
if (_codegen->_registered_exprs.find(called_fn) ==
_codegen->_registered_exprs.end()) {
continue;
}
// Found a registered expr function. We generate the IR in a very specific way
// when calling the expr. The call instruction is always followed by loading the
// resulting is_null result. We need to update both.
// TODO: we need to update this to do more analysis since we are relying on a very
// specific code structure to do this.
// Arguments are (row, scratch_buffer, is_null);
DCHECK_EQ(call_instr->getNumArgOperands(), 3);
Value* row_arg = call_instr->getArgOperand(0);
DCHECK(BitCastInst::classof(row_arg));
BitCastInst* row_cast = reinterpret_cast<BitCastInst*>(row_arg);
// Get at the underlying row arg. We need to differentiate between
// call Fn(row1) and call Fn(row2). (identical fns but different input).
row_arg = row_cast->getOperand(0);
instr = &*instr_iter;
++instr_iter;
if (!LoadInst::classof(instr)) {
continue;
}
LoadInst* is_null_value = reinterpret_cast<LoadInst*>(instr);
Value* loaded_ptr = is_null_value->getPointerOperand();
// Subexpr elimination requires the IR to be a very specific form.
// call SlotRef(row, NULL, is_null_ptr)
// load is_null_ptr
// Since we generate this IR currently, we can enforce this logic in our exprs
// TODO: this should be removed/generalized with expr refactoring
DCHECK_EQ(loaded_ptr, call_instr->getArgOperand(2));
std::pair<Function*, Value*> call_desc = std::make_pair(called_fn, row_arg);
if (cached_slot_ref_results.find(call_desc) == cached_slot_ref_results.end()) {
CachedExprResult cache_entry;
cache_entry.result = call_instr;
cache_entry.is_null_value = is_null_value;
cached_slot_ref_results[call_desc] = cache_entry;
} else {
// Reuse the result.
CachedExprResult& cache_entry = cached_slot_ref_results[call_desc];
if (dom_tree.dominates(cache_entry.result, call_instr)) {
new StoreInst(cache_entry.is_null_value, loaded_ptr, call_instr);
call_instr->replaceAllUsesWith(cache_entry.result);
call_instr->eraseFromParent();
}
}
}
// Step 3:
_codegen->inline_call_sites(fn, false);
return true;
}
}

View File

@ -0,0 +1,47 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_QUERY_CODEGEN_SUBEXPR_ELIMINATION_H
#define BDG_PALO_BE_SRC_QUERY_CODEGEN_SUBEXPR_ELIMINATION_H
#include "common/status.h"
#include "codegen/llvm_codegen.h"
namespace palo {
// Optimization pass to remove redundant exprs.
// TODO: make this into a llvm function pass (i.e. implement FunctionPass interface).
class SubExprElimination {
public:
SubExprElimination(LlvmCodeGen* codegen);
~SubExprElimination() { }
// Perform subexpr elimination on function.
bool run(llvm::Function* function);
private:
// Parent codegen object.
LlvmCodeGen* _codegen;
};
}
#endif

View File

@ -0,0 +1,32 @@
# Modifications copyright (C) 2017, Baidu.com, Inc.
# Copyright 2017 The Apache Software Foundation
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
# where to put generated libraries
set(LIBRARY_OUTPUT_PATH "${BUILD_DIR}/src/common")
add_library(Common STATIC
daemon.cpp
status.cpp
resource_tls.cpp
logconfig.cpp
configbase.cpp
)
#ADD_BE_TEST(resource_tls_test)

209
be/src/common/atomic.h Normal file
View File

@ -0,0 +1,209 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_COMMON_ATOMIC_H
#define BDG_PALO_BE_SRC_COMMON_ATOMIC_H
#include <algorithm>
#include "common/compiler_util.h"
#include "gutil/atomicops.h"
#include "gutil/macros.h"
namespace palo {
class AtomicUtil {
public:
// Issues instruction to have the CPU wait, this is less busy (bus traffic
// etc) than just spinning.
// For example:
// while (1);
// should be:
// while (1) CpuWait();
static ALWAYS_INLINE void cpu_wait() {
asm volatile("pause\n": : :"memory");
}
/// Provides "barrier" semantics (see below) without a memory access.
static ALWAYS_INLINE void memory_barrier() {
__sync_synchronize();
}
/// Provides a compiler barrier. The compiler is not allowed to reorder memory
/// accesses across this (but the CPU can). This generates no instructions.
static ALWAYS_INLINE void compiler_barrier() {
__asm__ __volatile__("" : : : "memory");
}
};
// Wrapper for atomic integers. This should be switched to c++ 11 when
// we can switch.
// This class overloads operators to behave like a regular integer type
// but all operators and functions are thread safe.
template<typename T>
class AtomicInt {
public:
AtomicInt(T initial) : _value(initial) {}
AtomicInt() : _value(0) {}
operator T() const { return _value; }
AtomicInt& operator=(T val) {
_value = val;
return *this;
}
AtomicInt& operator=(const AtomicInt<T>& val) {
_value = val._value;
return *this;
}
AtomicInt& operator+=(T delta) {
__sync_add_and_fetch(&_value, delta);
return *this;
}
AtomicInt& operator-=(T delta) {
__sync_add_and_fetch(&_value, -delta);
return *this;
}
AtomicInt& operator|=(T v) {
__sync_or_and_fetch(&_value, v);
return *this;
}
AtomicInt& operator&=(T v) {
__sync_and_and_fetch(&_value, v);
return *this;
}
// These define the preincrement (i.e. --value) operators.
AtomicInt& operator++() {
__sync_add_and_fetch(&_value, 1);
return *this;
}
AtomicInt& operator--() {
__sync_add_and_fetch(&_value, -1);
return *this;
}
// This is post increment, which needs to return a new object.
AtomicInt<T> operator++(int) {
T prev = __sync_fetch_and_add(&_value, 1);
return AtomicInt<T>(prev);
}
AtomicInt<T> operator--(int) {
T prev = __sync_fetch_and_add(&_value, -1);
return AtomicInt<T>(prev);
}
// Safe read of the value
T read() {
return __sync_fetch_and_add(&_value, 0);
}
/// Atomic load with "acquire" memory-ordering semantic.
ALWAYS_INLINE T load() const {
return base::subtle::Acquire_Load(&_value);
}
/// Atomic store with "release" memory-ordering semantic.
ALWAYS_INLINE void store(T x) {
base::subtle::Release_Store(&_value, x);
}
/// Atomic add with "barrier" memory-ordering semantic. Returns the new value.
ALWAYS_INLINE T add(T x) {
return base::subtle::Barrier_AtomicIncrement(&_value, x);
}
// Increments by delta (i.e. += delta) and returns the new val
T update_and_fetch(T delta) {
return __sync_add_and_fetch(&_value, delta);
}
// Increment by delta and returns the old val
T fetch_and_update(T delta) {
return __sync_fetch_and_add(&_value, delta);
}
// Updates the int to 'value' if value is larger
void update_max(T value) {
while (true) {
T old_value = _value;
T new_value = std::max(old_value, value);
if (LIKELY(compare_and_swap(old_value, new_value))) {
break;
}
}
}
void update_min(T value) {
while (true) {
T old_value = _value;
T new_value = std::min(old_value, value);
if (LIKELY(compare_and_swap(old_value, new_value))) {
break;
}
}
}
// Returns true if the atomic compare-and-swap was successful.
// If _value == oldVal, make _value = new_val and return true; otherwise return false;
bool compare_and_swap(T old_val, T new_val) {
return __sync_bool_compare_and_swap(&_value, old_val, new_val);
}
// Returns the content of _value before the operation.
// If returnValue == old_val, then the atomic compare-and-swap was successful.
T compare_and_swap_val(T old_val, T new_val) {
return __sync_val_compare_and_swap(&_value, old_val, new_val);
}
// Atomically updates _value with new_val. Returns the old _value.
T swap(const T& new_val) {
return __sync_lock_test_and_set(&_value, new_val);
}
private:
T _value;
};
/// Supported atomic types. Use these types rather than referring to AtomicInt<>
/// directly.
typedef AtomicInt<int32_t> AtomicInt32;
typedef AtomicInt<int64_t> AtomicInt64;
/// Atomic pointer. Operations have the same semantics as AtomicInt.
template<typename T>
class AtomicPtr {
public:
AtomicPtr(T* initial = nullptr) : _ptr(reinterpret_cast<intptr_t>(initial)) {}
/// Atomic load with "acquire" memory-ordering semantic.
inline T* load() const { return reinterpret_cast<T*>(_ptr.load()); }
/// Atomic store with "release" memory-ordering semantic.
inline void store(T* val) { _ptr.store(reinterpret_cast<intptr_t>(val)); }
private:
AtomicInt<intptr_t> _ptr;
};
} // end namespace palo
#endif // BDG_PALO_BE_SRC_COMMON_ATOMIC_H

View File

@ -0,0 +1,49 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_COMMON_COMMON_COMPILER_UTIL_H
#define BDG_PALO_BE_SRC_COMMON_COMMON_COMPILER_UTIL_H
// Compiler hint that this branch is likely or unlikely to
// be taken. Take from the "What all programmers should know
// about memory" paper.
// example: if (LIKELY(size > 0)) { ... }
// example: if (UNLIKELY(!status.ok())) { ... }
#ifdef LIKELY
#undef LIKELY
#endif
#ifdef UNLIKELY
#undef UNLIKELY
#endif
#define LIKELY(expr) __builtin_expect(!!(expr), 1)
#define UNLIKELY(expr) __builtin_expect(!!(expr), 0)
#define PREFETCH(addr) __builtin_prefetch(addr)
/// Force inlining. The 'inline' keyword is treated by most compilers as a hint,
/// not a command. This should be used sparingly for cases when either the function
/// needs to be inlined for a specific reason or the compiler's heuristics make a bad
/// decision, e.g. not inlining a small function on a hot path.
#define ALWAYS_INLINE __attribute__((always_inline))
#endif

319
be/src/common/config.h Normal file
View File

@ -0,0 +1,319 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_COMMON_CONFIG_H
#define BDG_PALO_BE_SRC_COMMON_CONFIG_H
#include "configbase.h"
namespace palo {
namespace config {
// cluster id
CONF_Int32(cluster_id, "-1");
// port on which ImpalaInternalService is exported
CONF_Int32(be_port, "9060");
CONF_Int32(be_rpc_port, "10060");
////
//// tcmalloc gc parameter
////
// min memory for TCmalloc, when used memory is smaller than this, do not returned to OS
CONF_Int64(tc_use_memory_min, "10737418240");
// free memory rate.[0-100]
CONF_Int64(tc_free_memory_rate, "20");
// process memory limit specified as number of bytes
// ('<int>[bB]?'), megabytes ('<float>[mM]'), gigabytes ('<float>[gG]'),
// or percentage of the physical memory ('<int>%').
// defaults to bytes if no unit is given"
CONF_String(mem_limit, "80%");
// the port heartbeat service used
CONF_Int32(heartbeat_service_port, "9050");
// the count of heart beat service
CONF_Int32(heartbeat_service_thread_count, "1");
// the count of thread to create table
CONF_Int32(create_table_worker_count, "3");
// the count of thread to drop table
CONF_Int32(drop_table_worker_count, "3");
// the count of thread to batch load
CONF_Int32(push_worker_count_normal_priority, "3");
// the count of thread to high priority batch load
CONF_Int32(push_worker_count_high_priority, "3");
// the count of thread to delete
CONF_Int32(delete_worker_count, "3");
// the count of thread to alter table
CONF_Int32(alter_table_worker_count, "3");
// the count of thread to clone
CONF_Int32(clone_worker_count, "3");
// the count of thread to clone
CONF_Int32(storage_medium_migrate_count, "1");
// the count of thread to cancel delete data
CONF_Int32(cancel_delete_data_worker_count, "3");
// the count of thread to check consistency
CONF_Int32(check_consistency_worker_count, "1");
// the count of thread to upload
CONF_Int32(upload_worker_count, "3");
// the count of thread to restore
CONF_Int32(restore_worker_count, "3");
// the count of thread to make snapshot
CONF_Int32(make_snapshot_worker_count, "5");
// the count of thread to release snapshot
CONF_Int32(release_snapshot_worker_count, "5");
// the interval time(seconds) for agent report tasks signatrue to dm
CONF_Int32(report_task_interval_seconds, "10");
// the interval time(seconds) for agent report disk state to dm
CONF_Int32(report_disk_state_interval_seconds, "600");
// the interval time(seconds) for agent report olap table to dm
CONF_Int32(report_olap_table_interval_seconds, "600");
// the timeout(seconds) for alter table
CONF_Int32(alter_table_timeout_seconds, "86400");
// the timeout(seconds) for make snapshot
CONF_Int32(make_snapshot_timeout_seconds, "600");
// the timeout(seconds) for release snapshot
CONF_Int32(release_snapshot_timeout_seconds, "600");
// the max download speed(KB/s)
CONF_Int32(max_download_speed_kbps, "50000");
// download low speed limit(KB/s)
CONF_Int32(download_low_speed_limit_kbps, "50");
// download low speed time(seconds)
CONF_Int32(download_low_speed_time, "300");
// curl verbose mode
CONF_Int64(curl_verbose_mode, "1");
// seconds to sleep for each time check table status
CONF_Int32(check_status_sleep_time_seconds, "10");
// sleep time for one second
CONF_Int32(sleep_one_second, "1");
// sleep time for five seconds
CONF_Int32(sleep_five_seconds, "5");
// trans file tools dir
CONF_String(trans_file_tool_path, "${PALO_HOME}/tools/trans_file_tool/trans_files.sh");
// agent tmp dir
CONF_String(agent_tmp_dir, "${PALO_HOME}/tmp");
// log dir
CONF_String(sys_log_dir, "${PALO_HOME}/log");
// INFO, WARNING, ERROR, FATAL
CONF_String(sys_log_level, "INFO");
// TIME-DAY, TIME-HOUR, SIZE-MB-nnn
CONF_String(sys_log_roll_mode, "SIZE-MB-1024");
// log roll num
CONF_Int32(sys_log_roll_num, "10");
// verbose log
CONF_Strings(sys_log_verbose_modules, "");
// log buffer level
CONF_String(log_buffer_level, "");
// Pull load task dir
CONF_String(pull_load_task_dir, "${PALO_HOME}/var/pull_load");
// the maximum number of bytes to display on the debug webserver's log page
CONF_Int64(web_log_bytes, "1048576");
// number of threads available to serve backend execution requests
CONF_Int32(be_service_threads, "64");
// key=value pair of default query options for Palo, separated by ','
CONF_String(default_query_options, "");
// If non-zero, Palo will output memory usage every log_mem_usage_interval'th fragment completion.
CONF_Int32(log_mem_usage_interval, "0");
// if non-empty, enable heap profiling and output to specified directory.
CONF_String(heap_profile_dir, "");
// cgroups allocated for palo
CONF_String(palo_cgroups, "");
// Controls the number of threads to run work per core. It's common to pick 2x
// or 3x the number of cores. This keeps the cores busy without causing excessive
// thrashing.
CONF_Int32(num_threads_per_core, "3");
// if true, compresses tuple data in Serialize
CONF_Bool(compress_rowbatches, "true");
// serialize and deserialize each returned row batch
CONF_Bool(serialize_batch, "false");
// interval between profile reports; in seconds
CONF_Int32(status_report_interval, "5");
// Local directory to copy UDF libraries from HDFS into
CONF_String(local_library_dir, "${UDF_RUNTIME_DIR}");
// number of olap scanner thread pool size
CONF_Int32(palo_scanner_thread_pool_thread_num, "48");
// number of olap scanner thread pool size
CONF_Int32(palo_scanner_thread_pool_queue_size, "102400");
// number of etl thread pool size
CONF_Int32(etl_thread_pool_size, "8");
// number of etl thread pool size
CONF_Int32(etl_thread_pool_queue_size, "256");
// port on which to run Palo test backend
CONF_Int32(port, "20001");
// default thrift client connect timeout(in seconds)
CONF_Int32(thrift_connect_timeout_seconds, "3");
// max row count number for single scan range
CONF_Int32(palo_scan_range_row_count, "524288");
// size of scanner queue between scanner thread and compute thread
CONF_Int32(palo_scanner_queue_size, "1024");
// single read execute fragment row size
CONF_Int32(palo_scanner_row_num, "16384");
// number of max scan keys
CONF_Int32(palo_max_scan_key_num, "1024");
// return_row / total_row
CONF_Int32(palo_max_pushdown_conjuncts_return_rate, "90");
// (Advanced) Maximum size of per-query receive-side buffer
CONF_Int32(exchg_node_buffer_size_bytes, "10485760");
// insert sort threadhold for sorter
CONF_Int32(insertion_threadhold, "16");
// the block_size every block allocate for sorter
CONF_Int32(sorter_block_size, "8388608");
// push_write_mbytes_per_sec
CONF_Int32(push_write_mbytes_per_sec, "10");
CONF_Int32(base_expansion_write_mbytes_per_sec, "5");
CONF_Int64(column_dictionary_key_ration_threshold, "0");
CONF_Int64(column_dictionary_key_size_threshold, "0");
// if true, output IR after optimization passes
CONF_Bool(dump_ir, "false");
// if set, saves the generated IR to the output file.
CONF_String(module_output, "");
// memory_limiation_per_thread_for_schema_change unit GB
CONF_Int32(memory_limiation_per_thread_for_schema_change, "2");
CONF_Int64(max_unpacked_row_block_size, "104857600");
CONF_Int32(file_descriptor_cache_clean_interval, "3600");
CONF_Int32(base_expansion_trigger_interval, "1");
CONF_Int32(cumulative_check_interval, "1");
CONF_Int32(disk_stat_monitor_interval, "5");
CONF_Int32(unused_index_monitor_interval, "30");
CONF_String(storage_root_path, "${PALO_HOME}/storage");
CONF_Int32(min_percentage_of_error_disk, "50");
CONF_Int32(default_num_rows_per_data_block, "1024");
CONF_Int32(default_num_rows_per_column_file_block, "1024");
CONF_Int32(max_tablet_num_per_shard, "1024");
// garbage sweep policy
CONF_Int32(max_garbage_sweep_interval, "86400");
CONF_Int32(min_garbage_sweep_interval, "200");
CONF_Int32(snapshot_expire_time_sec, "864000");
// 仅仅是建议值,当磁盘空间不足时,trash下的文件保存期可不遵守这个参数
CONF_Int32(trash_file_expire_time_sec, "259200");
CONF_Int32(disk_capacity_insufficient_percentage, "90");
// check row nums for BE/CE and schema change. true is open, false is closed.
CONF_Bool(row_nums_check, "true")
// be policy
CONF_Int32(base_expansion_thread_num, "1");
CONF_Int64(be_policy_start_time, "20");
CONF_Int64(be_policy_end_time, "7");
//file descriptors cache, by default, cache 30720 descriptors
CONF_Int32(file_descriptor_cache_capacity, "30720");
CONF_Int64(index_stream_cache_capacity, "10737418240");
CONF_Int64(max_packed_row_block_size, "20971520");
CONF_Int32(cumulative_write_mbytes_per_sec, "100");
CONF_Int64(ce_policy_delta_files_number, "5");
// ce policy: max delta file's size unit:B
CONF_Int32(cumulative_thread_num, "1");
CONF_Int64(ce_policy_max_delta_file_size, "104857600");
CONF_Int64(be_policy_cumulative_files_number, "5");
CONF_Double(be_policy_cumulative_base_ratio, "0.3");
CONF_Int64(be_policy_be_interval_seconds, "604800");
CONF_Int32(cumulative_source_overflow_ratio, "5");
CONF_Int32(delete_delta_expire_time, "1440");
// Port to start debug webserver on
CONF_Int32(webserver_port, "8040");
// Interface to start debug webserver on. If blank, webserver binds to 0.0.0.0
CONF_String(webserver_interface, "");
CONF_String(webserver_doc_root, "${PALO_HOME}");
// If true, webserver may serve static files from the webserver_doc_root
CONF_Bool(enable_webserver_doc_root, "true");
// The number of times to retry connecting to an RPC server. If zero or less,
// connections will be retried until successful
CONF_Int32(rpc_retry_times, "10");
// The interval, in ms, between retrying connections to an RPC server
CONF_Int32(rpc_retry_interval_ms, "30000");
//reactor number
CONF_Int32(rpc_reactor_threads, "10")
// Period to update rate counters and sampling counters in ms.
CONF_Int32(periodic_counter_update_period_ms, "500");
// Used for mini Load
CONF_Int64(load_data_reserve_hours, "24");
CONF_Int64(mini_load_max_mb, "2048");
// Fragment thread pool
CONF_Int32(fragment_pool_thread_num, "64");
CONF_Int32(fragment_pool_queue_size, "1024");
//for cast
CONF_Bool(cast, "true");
// Spill to disk when query
// Writable scratch directories, splitted by ";"
CONF_String(query_scratch_dirs, "${PALO_HOME}");
// Control the number of disks on the machine. If 0, this comes from the system settings.
CONF_Int32(num_disks, "0");
// The maximum number of the threads per disk is also the max queue depth per disk.
CONF_Int32(num_threads_per_disk, "0");
// The read size is the size of the reads sent to os.
// There is a trade off of latency and throughout, trying to keep disks busy but
// not introduce seeks. The literature seems to agree that with 8 MB reads, random
// io and sequential io perform similarly.
CONF_Int32(read_size, "8388608"); // 8 * 1024 * 1024, Read Size (in bytes)
CONF_Int32(min_buffer_size, "1024"); // 1024, The minimum read buffer size (in bytes)
// For each io buffer size, the maximum number of buffers the IoMgr will hold onto
// With 1024B through 8MB buffers, this is up to ~2GB of buffers.
CONF_Int32(max_free_io_buffers, "128");
CONF_Bool(disable_mem_pools, "false");
// The probing algorithm of partitioned hash table.
// Enable quadratic probing hash table
CONF_Bool(enable_quadratic_probing, "false");
// for pprof
CONF_String(pprof_profile_dir, "${PALO_HOME}/log")
// for partition
CONF_Bool(enable_partitioned_hash_join, "false")
CONF_Bool(enable_partitioned_aggregation, "false")
// for kudu
// "The maximum size of the row batch queue, for Kudu scanners."
CONF_Int32(kudu_max_row_batches, "0")
// "The period at which Kudu Scanners should send keep-alive requests to the tablet "
// "server to ensure that scanners do not time out.")
// 150 * 1000 * 1000
CONF_Int32(kudu_scanner_keep_alive_period_us, "15000000")
// "(Advanced) Sets the Kudu scan ReadMode. "
// "Supported Kudu read modes are READ_LATEST and READ_AT_SNAPSHOT. Invalid values "
// "result in using READ_LATEST."
CONF_String(kudu_read_mode, "READ_LATEST")
// "Whether to pick only leader replicas, for tests purposes only.")
CONF_Bool(pick_only_leaders_for_tests, "false")
// "The period at which Kudu Scanners should send keep-alive requests to the tablet "
// "server to ensure that scanners do not time out."
CONF_Int32(kudu_scanner_keep_alive_period_sec, "15")
CONF_Int32(kudu_operation_timeout_ms, "5000")
// "If true, Kudu features will be disabled."
CONF_Bool(disable_kudu, "false")
} // namespace config
} // namespace palo
#endif // BDG_PALO_BE_SRC_COMMON_CONFIG_H

View File

@ -0,0 +1,278 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include <algorithm>
#include <cerrno>
#include <cstring>
#include <fstream>
#include <iostream>
#include <sstream>
#define __IN_CONFIGBASE_CPP__
#include "common/config.h"
#undef __IN_CONFIGBASE_CPP__
namespace palo {
namespace config {
std::list<Register::Field>* Register::_s_fieldlist = NULL;
std::map<std::string, std::string>* confmap = NULL;
Properties props;
// load conf file
bool Properties::load(const char* filename) {
// if filename is null, use the empty props
if (filename == 0) {
return true;
}
// open the conf file
std::ifstream input(filename);
if (!input.is_open()) {
std::cerr << "config::load() failed to open the file:" << filename << std::endl;
return false;
}
// load properties
std::string line;
std::string key;
std::string value;
line.reserve(512);
while (input) {
// read one line at a time
std::getline(input, line);
// remove left and right spaces
trim(line);
// ignore comments
if (line.empty() || line[0] == '#') {
continue;
}
// read key and value
splitkv(line, key, value);
trim(key);
trim(value);
// insert into propmap
propmap[key] = value;
}
// close the conf file
input.close();
return true;
}
template <typename T>
bool Properties::get(const char* key, const char* defstr, T& retval) const {
std::map<std::string, std::string>::const_iterator it = propmap.find(std::string(key));
std::string valstr = it != propmap.end() ? it->second: std::string(defstr);
trim(valstr);
if (!replaceenv(valstr)) {
return false;
}
return strtox(valstr, retval);
}
template <typename T>
bool Properties::strtox(const std::string& valstr, std::vector<T>& retval) {
std::stringstream ss(valstr);
std::string item;
T t;
while (std::getline(ss, item, ',')) {
if (!strtox(trim(item), t)) {
return false;
}
retval.push_back(t);
}
return true;
}
const std::map<std::string, std::string>& Properties::getmap() const {
return propmap;
}
// trim string
std::string& Properties::trim(std::string& s) {
// rtrim
s.erase(std::find_if(s.rbegin(), s.rend(), std::not1(std::ptr_fun<int, int>(std::isspace))).base(), s.end());
// ltrim
s.erase(s.begin(), std::find_if(s.begin(), s.end(), std::not1(std::ptr_fun<int, int>(std::isspace))));
return s;
}
// split string by '='
void Properties::splitkv(const std::string& s, std::string& k, std::string& v) {
const char sep = '=';
int start = 0;
int end = 0;
if ((end = s.find(sep, start)) != std::string::npos) {
k = s.substr(start, end - start);
v = s.substr(end + 1);
} else {
k = s;
v = "";
}
}
// replace env variables
bool Properties::replaceenv(std::string& s) {
std::size_t pos = 0;
std::size_t start = 0;
while ((start = s.find("${", pos)) != std::string::npos) {
std::size_t end = s.find("}", start + 2);
if (end == std::string::npos) {
return false;
}
std::string envkey = s.substr(start + 2, end - start - 2);
const char* envval = std::getenv(envkey.c_str());
if (envval == NULL) {
return false;
}
s.erase(start, end - start + 1);
s.insert(start, envval);
pos = start + strlen(envval);
}
return true;
}
bool Properties::strtox(const std::string& valstr, bool& retval) {
if (valstr.compare("true") == 0) {
retval = true;
} else if (valstr.compare("false") == 0) {
retval = false;
} else {
return false;
}
return true;
}
template<typename T>
bool Properties::strtointeger(const std::string& valstr, T& retval) {
if (valstr.length() == 0) {
return false; // empty-string is only allowed for string type.
}
char* end;
errno = 0;
const char* valcstr = valstr.c_str();
int64_t ret64 = strtoll(valcstr, &end, 10);
if (errno || end != valcstr + strlen(valcstr)) {
return false; // bad parse
}
retval = static_cast<T>(ret64);
if (retval != ret64) {
return false;
}
return true;
}
bool Properties::strtox(const std::string& valstr, int16_t& retval) {
return strtointeger(valstr, retval);
}
bool Properties::strtox(const std::string& valstr, int32_t& retval) {
return strtointeger(valstr, retval);
}
bool Properties::strtox(const std::string& valstr, int64_t& retval) {
return strtointeger(valstr, retval);
}
bool Properties::strtox(const std::string& valstr, double& retval) {
if (valstr.length() == 0) {
return false; // empty-string is only allowed for string type.
}
char* end = NULL;
errno = 0;
const char* valcstr = valstr.c_str();
retval = strtod(valcstr, &end);
if (errno || end != valcstr + strlen(valcstr)) {
return false; // bad parse
}
return true;
}
bool Properties::strtox(const std::string& valstr, std::string& retval) {
retval = valstr;
return true;
}
template<typename T>
std::ostream& operator<< (std::ostream& out, const std::vector<T>& v) {
size_t last = v.size() - 1;
for (size_t i = 0; i < v.size(); ++i) {
out << v[i];
if (i != last) {
out << ", ";
}
}
return out;
}
#define SET_FIELD(FIELD, TYPE, FILL_CONFMAP)\
if (strcmp((FIELD).type, #TYPE) == 0) {\
if (!props.get((FIELD).name, (FIELD).defval, *reinterpret_cast<TYPE*>((FIELD).storage))) {\
std::cerr << "config field error: " << (FIELD).name << std::endl;\
return false;\
}\
if (FILL_CONFMAP) {\
std::ostringstream oss;\
oss << (*reinterpret_cast<TYPE*>((FIELD).storage));\
(*confmap)[(FIELD).name] = oss.str();\
}\
continue;\
}
// init conf fields
bool init(const char* filename, bool fillconfmap) {
// load properties file
if (!props.load(filename)) {
return false;
}
// fill confmap ?
if (fillconfmap && confmap == NULL) {
confmap = new std::map<std::string, std::string>();
}
// set conf fields
for (std::list<Register::Field>::iterator it = Register::_s_fieldlist->begin();
it != Register::_s_fieldlist->end(); ++it) {
SET_FIELD(*it, bool, fillconfmap);
SET_FIELD(*it, int16_t, fillconfmap);
SET_FIELD(*it, int32_t, fillconfmap);
SET_FIELD(*it, int64_t, fillconfmap);
SET_FIELD(*it, double, fillconfmap);
SET_FIELD(*it, std::string, fillconfmap);
SET_FIELD(*it, std::vector<bool>, fillconfmap);
SET_FIELD(*it, std::vector<int16_t>, fillconfmap);
SET_FIELD(*it, std::vector<int32_t>, fillconfmap);
SET_FIELD(*it, std::vector<int64_t>, fillconfmap);
SET_FIELD(*it, std::vector<double>, fillconfmap);
SET_FIELD(*it, std::vector<std::string>, fillconfmap);
}
return true;
}
} // namespace config
} // namespace palo

128
be/src/common/configbase.h Normal file
View File

@ -0,0 +1,128 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_COMMON_CONFIGBASE_H
#define BDG_PALO_BE_SRC_COMMON_CONFIGBASE_H
#include <stdint.h>
#include <list>
#include <map>
#include <vector>
namespace palo {
namespace config {
class Register {
public:
struct Field {
const char* type;
const char* name;
void* storage;
const char* defval;
Field(const char* ftype, const char* fname, void* fstorage, const char* fdefval) :
type(ftype),
name(fname),
storage(fstorage),
defval(fdefval) {}
};
public:
static std::list<Field>* _s_fieldlist;
public:
Register(const char* ftype, const char* fname, void* fstorage, const char* fdefval) {
if (_s_fieldlist == NULL) {
_s_fieldlist = new std::list<Field>();
}
Field field(ftype, fname, fstorage, fdefval);
_s_fieldlist->push_back(field);
}
};
#define DEFINE_FIELD(FIELD_TYPE, FIELD_NAME, FIELD_DEFAULT)\
FIELD_TYPE FIELD_NAME;\
static Register reg_##FIELD_NAME(#FIELD_TYPE, #FIELD_NAME, &FIELD_NAME, FIELD_DEFAULT);
#define DECLARE_FIELD(FIELD_TYPE, FIELD_NAME) extern FIELD_TYPE FIELD_NAME;
#ifdef __IN_CONFIGBASE_CPP__
#define CONF_Bool(name, defaultstr) DEFINE_FIELD(bool, name, defaultstr)
#define CONF_Int16(name, defaultstr) DEFINE_FIELD(int16_t, name, defaultstr)
#define CONF_Int32(name, defaultstr) DEFINE_FIELD(int32_t, name, defaultstr)
#define CONF_Int64(name, defaultstr) DEFINE_FIELD(int64_t, name, defaultstr)
#define CONF_Double(name, defaultstr) DEFINE_FIELD(double, name, defaultstr)
#define CONF_String(name, defaultstr) DEFINE_FIELD(std::string, name, defaultstr)
#define CONF_Bools(name, defaultstr) DEFINE_FIELD(std::vector<bool>, name, defaultstr)
#define CONF_Int16s(name, defaultstr) DEFINE_FIELD(std::vector<int16_t>, name, defaultstr)
#define CONF_Int32s(name, defaultstr) DEFINE_FIELD(std::vector<int32_t>, name, defaultstr)
#define CONF_Int64s(name, defaultstr) DEFINE_FIELD(std::vector<int64_t>, name, defaultstr)
#define CONF_Doubles(name, defaultstr) DEFINE_FIELD(std::vector<double>, name, defaultstr)
#define CONF_Strings(name, defaultstr) DEFINE_FIELD(std::vector<std::string>, name, defaultstr)
#else
#define CONF_Bool(name, defaultstr) DECLARE_FIELD(bool, name)
#define CONF_Int16(name, defaultstr) DECLARE_FIELD(int16_t, name)
#define CONF_Int32(name, defaultstr) DECLARE_FIELD(int32_t, name)
#define CONF_Int64(name, defaultstr) DECLARE_FIELD(int64_t, name)
#define CONF_Double(name, defaultstr) DECLARE_FIELD(double, name)
#define CONF_String(name, defaultstr) DECLARE_FIELD(std::string, name)
#define CONF_Bools(name, defaultstr) DECLARE_FIELD(std::vector<bool>, name)
#define CONF_Int16s(name, defaultstr) DECLARE_FIELD(std::vector<int16_t>, name)
#define CONF_Int32s(name, defaultstr) DECLARE_FIELD(std::vector<int32_t>, name)
#define CONF_Int64s(name, defaultstr) DECLARE_FIELD(std::vector<int64_t>, name)
#define CONF_Doubles(name, defaultstr) DECLARE_FIELD(std::vector<double>, name)
#define CONF_Strings(name, defaultstr) DECLARE_FIELD(std::vector<std::string>, name)
#endif
class Properties {
public:
bool load(const char* filename);
template<typename T>
bool get(const char* key, const char* defstr, T& retval) const;
const std::map<std::string, std::string>& getmap() const;
private:
template <typename T>
static bool strtox(const std::string& valstr, std::vector<T>& retval);
template<typename T>
static bool strtointeger(const std::string& valstr, T& retval);
static bool strtox(const std::string& valstr, bool& retval);
static bool strtox(const std::string& valstr, int16_t& retval);
static bool strtox(const std::string& valstr, int32_t& retval);
static bool strtox(const std::string& valstr, int64_t& retval);
static bool strtox(const std::string& valstr, double& retval);
static bool strtox(const std::string& valstr, std::string& retval);
static std::string& trim(std::string& s);
static void splitkv(const std::string& s, std::string& k, std::string& v);
static bool replaceenv(std::string& s);
private:
std::map<std::string, std::string> propmap;
};
extern Properties props;
extern std::map<std::string, std::string>* confmap;
bool init(const char* filename, bool fillconfmap = false);
} // namespace config
} // namespace palo
#endif // BDG_PALO_BE_SRC_COMMON_CONFIGBASE_H

110
be/src/common/daemon.cpp Normal file
View File

@ -0,0 +1,110 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "common/daemon.h"
#include <gperftools/malloc_extension.h>
#include "util/cpu_info.h"
#include "util/debug_util.h"
#include "util/disk_info.h"
#include "util/logging.h"
#include "util/mem_info.h"
#include "util/network_util.h"
#include "util/thrift_util.h"
#include "runtime/lib_cache.h"
#include "exprs/operators.h"
#include "exprs/is_null_predicate.h"
#include "exprs/like_predicate.h"
#include "exprs/compound_predicate.h"
#include "exprs/new_in_predicate.h"
#include "exprs/string_functions.h"
#include "exprs/cast_functions.h"
#include "exprs/math_functions.h"
#include "exprs/encryption_functions.h"
#include "exprs/timestamp_functions.h"
#include "exprs/decimal_operators.h"
#include "exprs/utility_functions.h"
#include "exprs/json_functions.h"
#include "exprs/hll_hash_function.h"
namespace palo {
void* tcmalloc_gc_thread(void* dummy) {
while (1) {
sleep(10);
size_t used_size = 0;
size_t free_size = 0;
#ifndef ADDRESS_SANITIZER
MallocExtension::instance()->GetNumericProperty("generic.current_allocated_bytes", &used_size);
MallocExtension::instance()->GetNumericProperty("tcmalloc.pageheap_free_bytes", &free_size);
#endif
size_t alloc_size = used_size + free_size;
if (alloc_size > config::tc_use_memory_min) {
size_t max_free_size = alloc_size * config::tc_free_memory_rate / 100;
#ifndef ADDRESS_SANITIZER
if (free_size > max_free_size) {
MallocExtension::instance()->ReleaseToSystem(free_size - max_free_size);
}
#endif
}
}
return NULL;
}
void init_daemon(int argc, char** argv) {
// google::SetVersionString(get_build_version(false));
// google::ParseCommandLineFlags(&argc, &argv, true);
init_glog("be", true);
LOG(INFO) << get_version_string(false);
init_thrift_logging();
CpuInfo::init();
DiskInfo::init();
MemInfo::init();
LibCache::init();
Operators::init();
IsNullPredicate::init();
LikePredicate::init();
StringFunctions::init();
CastFunctions::init();
InPredicate::init();
MathFunctions::init();
EncryptionFunctions::init();
TimestampFunctions::init();
DecimalOperators::init();
UtilityFunctions::init();
CompoundPredicate::init();
JsonFunctions::init();
HllHashFunctions::init();
pthread_t id;
pthread_create(&id, NULL, tcmalloc_gc_thread, NULL);
LOG(INFO) << CpuInfo::debug_string();
LOG(INFO) << DiskInfo::debug_string();
LOG(INFO) << MemInfo::debug_string();
}
}

33
be/src/common/daemon.h Normal file
View File

@ -0,0 +1,33 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_COMMON_COMMON_DAEMON_H
#define BDG_PALO_BE_SRC_COMMON_COMMON_DAEMON_H
namespace palo {
// Initialises logging, flags etc. Callers that want to override default gflags
// variables should do so before calling this method; no logging should be
// performed until after this method returns.
void init_daemon(int argc, char** argv);
}
#endif

View File

@ -0,0 +1,36 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_COMMON_COMMON_GLOBAL_TYPES_H
#define BDG_PALO_BE_SRC_COMMON_COMMON_GLOBAL_TYPES_H
namespace palo {
// for now, these are simply ints; if we find we need to generate ids in the
// backend, we can also introduce separate classes for these to make them
// assignment-incompatible
typedef int TupleId;
typedef int SlotId;
typedef int TableId;
typedef int PlanNodeId;
};
#endif

35
be/src/common/hdfs.h Normal file
View File

@ -0,0 +1,35 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef IMPALA_COMMON_HDFS_H
#define IMPALA_COMMON_HDFS_H
// This is a wrapper around the hdfs header. When we are compiling to IR,
// we don't want to pull in the hdfs headers. We only need the headers
// for the typedefs which we will replicate here
// TODO: is this the cleanest way?
#ifdef IR_COMPILE
typedef void* hdfsFS;
typedef void* hdfsFile;
#else
#endif
#endif

153
be/src/common/logconfig.cpp Normal file
View File

@ -0,0 +1,153 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include <iostream>
#include <cerrno>
#include <cstring>
#include <cstdlib>
#include <mutex>
#include <glog/logging.h>
#include <glog/vlog_is_on.h>
#include "common/config.h"
namespace palo {
static bool logging_initialized = false;
static std::mutex logging_mutex;
static bool iequals(const std::string& a, const std::string& b)
{
unsigned int sz = a.size();
if (b.size() != sz) {
return false;
}
for (unsigned int i = 0; i < sz; ++i) {
if (tolower(a[i]) != tolower(b[i])) {
return false;
}
}
return true;
}
bool init_glog(const char* basename, bool install_signal_handler) {
std::lock_guard<std::mutex> logging_lock(logging_mutex);
if (logging_initialized) {
return true;
}
if (install_signal_handler) {
google::InstallFailureSignalHandler();
}
// don't log to stderr
FLAGS_stderrthreshold = 5;
// set glog log dir
FLAGS_log_dir = config::sys_log_dir;
// 0 means buffer INFO only
FLAGS_logbuflevel = 0;
// buffer log messages for at most this many seconds
FLAGS_logbufsecs = 30;
// set roll num
FLAGS_log_filenum_quota = config::sys_log_roll_num;
// set log level
std::string& loglevel = config::sys_log_level;
if (iequals(loglevel, "INFO")) {
FLAGS_minloglevel = 0;
} else if (iequals(loglevel, "WARNING")) {
FLAGS_minloglevel = 1;
} else if (iequals(loglevel, "ERROR")) {
FLAGS_minloglevel = 2;
} else if (iequals(loglevel, "FATAL")) {
FLAGS_minloglevel = 3;
} else {
std::cerr << "sys_log_level needs to be INFO, WARNING, ERROR, FATAL" << std::endl;
return false;
}
// set log buffer level
// defalut is 0
std::string& logbuflevel = config::log_buffer_level;
if (iequals(logbuflevel, "-1")) {
FLAGS_logbuflevel = -1;
} else if (iequals(logbuflevel, "0")) {
FLAGS_logbuflevel = 0;
}
// set log roll mode
std::string& rollmode = config::sys_log_roll_mode;
std::string sizeflag = "SIZE-MB-";
bool ok = false;
if (rollmode.compare("TIME-DAY") == 0) {
FLAGS_log_split_method = "day";
ok = true;
} else if (rollmode.compare("TIME-HOUR") == 0) {
FLAGS_log_split_method = "hour";
ok = true;
} else if (rollmode.substr(0, sizeflag.length()).compare(sizeflag) == 0) {
FLAGS_log_split_method = "size";
std::string sizestr = rollmode.substr(sizeflag.size(), rollmode.size() - sizeflag.size());
if (sizestr.size() != 0) {
char* end = NULL;
errno = 0;
const char* sizecstr = sizestr.c_str();
int64_t ret64 = strtoll(sizecstr, &end, 10);
if ((errno == 0) && (end == sizecstr + strlen(sizecstr))) {
int32_t retval = static_cast<int32_t>(ret64);
if (retval == ret64) {
FLAGS_max_log_size = retval;
ok = true;
}
}
}
} else {
ok = false;
}
if (!ok) {
std::cerr << "sys_log_roll_mode needs to be TIME-DAY, TIME-HOUR, SIZE-MB-nnn" << std::endl;
return false;
}
// set verbose modules. only use vlog(0)
FLAGS_v = -1;
std::vector<std::string>& verbose_modules = config::sys_log_verbose_modules;
for (size_t i = 0; i < verbose_modules.size(); i++) {
if (verbose_modules[i].size() != 0) {
google::SetVLOGLevel(verbose_modules[i].c_str(), 10);
}
}
google::InitGoogleLogging(basename);
logging_initialized = true;
return true;
}
void shutdown_logging() {
std::lock_guard<std::mutex> logging_lock(logging_mutex);
google::ShutdownGoogleLogging();
}
} // namespace palo

68
be/src/common/logging.h Normal file
View File

@ -0,0 +1,68 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef IMPALA_COMMON_LOGGING_H
#define IMPALA_COMMON_LOGGING_H
// This is a wrapper around the glog header. When we are compiling to IR,
// we don't want to pull in the glog headers. Pulling them in causes linking
// issues when we try to dynamically link the codegen'd functions.
#ifdef IR_COMPILE
#include <iostream>
#define DCHECK(condition) while (false) std::cout
#define DCHECK_EQ(a, b) while(false) std::cout
#define DCHECK_NE(a, b) while(false) std::cout
#define DCHECK_GT(a, b) while(false) std::cout
#define DCHECK_LT(a, b) while(false) std::cout
#define DCHECK_GE(a, b) while(false) std::cout
#define DCHECK_LE(a, b) while(false) std::cout
// Similar to how glog defines DCHECK for release.
#define LOG(level) while (false) std::cout
#define VLOG(level) while (false) std::cout
#else
// GLOG defines this based on the system but doesn't check if it's already
// been defined. undef it first to avoid warnings.
// glog MUST be included before gflags. Instead of including them,
// our files should include this file instead.
#undef _XOPEN_SOURCE
// This is including a glog internal file. We want this to expose the
// function to get the stack trace.
#include <glog/logging.h>
#include <common/config.h>
#undef MutexLock
#endif
// Define VLOG levels. We want display per-row info less than per-file which
// is less than per-query. For now per-connection is the same as per-query.
#define VLOG_CONNECTION VLOG(1)
#define VLOG_RPC VLOG(2)
#define VLOG_QUERY VLOG(1)
#define VLOG_FILE VLOG(2)
#define VLOG_ROW VLOG(3)
#define VLOG_PROGRESS VLOG(2)
#define VLOG_CONNECTION_IS_ON VLOG_IS_ON(1)
#define VLOG_RPC_IS_ON VLOG_IS_ON(2)
#define VLOG_QUERY_IS_ON VLOG_IS_ON(1)
#define VLOG_FILE_IS_ON VLOG_IS_ON(2)
#define VLOG_ROW_IS_ON VLOG_IS_ON(3)
#define VLOG_PROGRESS_IS_ON VLOG_IS_ON(2)
#endif

View File

@ -0,0 +1,84 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_COMMON_COMMON_OBJECT_POOL_H
#define BDG_PALO_BE_SRC_COMMON_COMMON_OBJECT_POOL_H
#include <vector>
#include <boost/thread/mutex.hpp>
#include <boost/thread/locks.hpp>
#include "util/spinlock.h"
namespace palo {
// An ObjectPool maintains a list of C++ objects which are deallocated
// by destroying the pool.
// Thread-safe.
class ObjectPool {
public:
ObjectPool(): _objects() {}
~ObjectPool() {
clear();
}
template <class T>
T* add(T* t) {
// Create the object to be pushed to the shared vector outside the critical section.
// TODO: Consider using a lock-free structure.
SpecificElement<T>* obj = new SpecificElement<T>(t);
DCHECK(obj != NULL);
boost::lock_guard<SpinLock> l(_lock);
_objects.push_back(obj);
return t;
}
void clear() {
boost::lock_guard<SpinLock> l(_lock);
for (auto i = _objects.rbegin(); i != _objects.rend(); ++i) {
delete *i;
}
_objects.clear();
}
private:
struct GenericElement {
virtual ~GenericElement() {}
};
template <class T>
struct SpecificElement : GenericElement {
SpecificElement(T* t): t(t) {}
~SpecificElement() {
delete t;
}
T* t;
};
typedef std::vector<GenericElement*> ElementVector;
ElementVector _objects;
SpinLock _lock;
};
}
#endif

View File

@ -0,0 +1,70 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "common/resource_tls.h"
#include <pthread.h>
#include "common/logging.h"
#include "gen_cpp/Types_types.h"
namespace palo {
static pthread_key_t s_resource_key;
static bool s_is_init = false;
static void resource_destructor(void* value) {
TResourceInfo* info = (TResourceInfo*)value;
if (info == nullptr) {
delete info;
}
}
void ResourceTls::init() {
int ret = pthread_key_create(&s_resource_key, resource_destructor);
if (ret != 0) {
LOG(ERROR) << "create pthread key for resource failed.";
return;
}
s_is_init = true;
}
TResourceInfo* ResourceTls::get_resource_tls() {
if (!s_is_init) {
return nullptr;
}
return (TResourceInfo*)pthread_getspecific(s_resource_key);
}
int ResourceTls::set_resource_tls(TResourceInfo* info) {
if (!s_is_init) {
return -1;
}
TResourceInfo* old_info = (TResourceInfo*)pthread_getspecific(s_resource_key);
int ret = pthread_setspecific(s_resource_key, info);
if (ret == 0) {
// OK, now we delete old one
delete old_info;
}
return ret;
}
}

View File

@ -0,0 +1,36 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_COMMON_COMMON_RESOURCE_TLS_H
#define BDG_PALO_BE_SRC_COMMON_COMMON_RESOURCE_TLS_H
namespace palo {
class TResourceInfo;
class ResourceTls {
public:
static void init();
static TResourceInfo* get_resource_tls();
static int set_resource_tls(TResourceInfo*);
};
}
#endif

139
be/src/common/status.cpp Normal file
View File

@ -0,0 +1,139 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "common/status.h"
#include <boost/algorithm/string/join.hpp>
#include "common/logging.h"
#include "util/debug_util.h"
namespace palo {
// NOTE: this is statically initialized and we must be very careful what
// functions these constructors call. In particular, we cannot call
// glog functions which also rely on static initializations.
// TODO: is there a more controlled way to do this.
const Status Status::OK;
const Status Status::CANCELLED(TStatusCode::CANCELLED, "Cancelled", true);
const Status Status::MEM_LIMIT_EXCEEDED(
TStatusCode::MEM_LIMIT_EXCEEDED, "Memory limit exceeded", true);
const Status Status::THRIFT_RPC_ERROR(
TStatusCode::THRIFT_RPC_ERROR, "Thrift RPC failed", true);
Status::ErrorDetail::ErrorDetail(const TStatus& status) :
error_code(status.status_code),
error_msgs(status.error_msgs) {
DCHECK_NE(error_code, TStatusCode::OK);
}
Status::Status(const std::string& error_msg) :
_error_detail(new ErrorDetail(TStatusCode::INTERNAL_ERROR, error_msg)) {
LOG(INFO) << error_msg << std::endl << get_stack_trace();
}
Status::Status(TStatusCode::type code, const std::string& error_msg)
: _error_detail(new ErrorDetail(code, error_msg)) {
}
Status::Status(const std::string& error_msg, bool quiet) :
_error_detail(new ErrorDetail(TStatusCode::INTERNAL_ERROR, error_msg)) {
if (!quiet) {
LOG(INFO) << error_msg << std::endl << get_stack_trace();
}
}
Status::Status(const TStatus& status) :
_error_detail(status.status_code == TStatusCode::OK ? NULL : new ErrorDetail(status)) {
}
Status& Status::operator=(const TStatus& status) {
delete _error_detail;
if (status.status_code == TStatusCode::OK) {
_error_detail = NULL;
} else {
_error_detail = new ErrorDetail(status);
}
return *this;
}
void Status::add_error_msg(TStatusCode::type code, const std::string& msg) {
if (_error_detail == NULL) {
_error_detail = new ErrorDetail(code, msg);
} else {
_error_detail->error_msgs.push_back(msg);
}
VLOG(2) << msg;
}
void Status::add_error_msg(const std::string& msg) {
add_error_msg(TStatusCode::INTERNAL_ERROR, msg);
}
void Status::add_error(const Status& status) {
if (status.ok()) {
return;
}
add_error_msg(status.code(), status.get_error_msg());
}
void Status::get_error_msgs(std::vector<std::string>* msgs) const {
msgs->clear();
if (_error_detail != NULL) {
*msgs = _error_detail->error_msgs;
}
}
void Status::get_error_msg(std::string* msg) const {
msg->clear();
if (_error_detail != NULL) {
*msg = boost::join(_error_detail->error_msgs, "\n");
}
}
std::string Status::get_error_msg() const {
std::string msg;
get_error_msg(&msg);
return msg;
}
void Status::to_thrift(TStatus* status) const {
status->error_msgs.clear();
if (_error_detail == NULL) {
status->status_code = TStatusCode::OK;
} else {
status->status_code = _error_detail->error_code;
for (int i = 0; i < _error_detail->error_msgs.size(); ++i) {
status->error_msgs.push_back(_error_detail->error_msgs[i]);
}
status->__isset.error_msgs = !_error_detail->error_msgs.empty();
}
}
}

206
be/src/common/status.h Normal file
View File

@ -0,0 +1,206 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_COMMON_COMMON_STATUS_H
#define BDG_PALO_BE_SRC_COMMON_COMMON_STATUS_H
#include <string>
#include <vector>
#include "common/logging.h"
#include "common/compiler_util.h"
#include "gen_cpp/Status_types.h" // for TStatus
namespace palo {
// Status is used as a function return type to indicate success, failure or cancellation
// of the function. In case of successful completion, it only occupies sizeof(void*)
// statically allocated memory. In the error case, it records a stack of error messages.
//
// example:
// Status fnB(int x) {
// Status status = fnA(x);
// if (!status.ok()) {
// status.AddErrorMsg("fnA(x) went wrong");
// return status;
// }
// }
//
// TODO: macros:
// RETURN_IF_ERROR(status) << "msg"
// MAKE_ERROR() << "msg"
class Status {
public:
Status(): _error_detail(NULL) {}
static const Status OK;
static const Status CANCELLED;
static const Status MEM_LIMIT_EXCEEDED;
static const Status THRIFT_RPC_ERROR;
// copy c'tor makes copy of error detail so Status can be returned by value
Status(const Status& status) : _error_detail(
status._error_detail != NULL
? new ErrorDetail(*status._error_detail)
: NULL) {
}
// c'tor for error case - is this useful for anything other than CANCELLED?
Status(TStatusCode::type code) : _error_detail(new ErrorDetail(code)) {
}
// c'tor for error case
Status(TStatusCode::type code, const std::string& error_msg, bool quiet) :
_error_detail(new ErrorDetail(code, error_msg)) {
if (!quiet) {
VLOG(2) << error_msg;
}
}
Status(TStatusCode::type code, const std::string& error_msg);
// c'tor for internal error
Status(const std::string& error_msg);
Status(const std::string& error_msg, bool quiet);
~Status() {
if (_error_detail != NULL) {
delete _error_detail;
}
}
// same as copy c'tor
Status& operator=(const Status& status) {
delete _error_detail;
if (LIKELY(status._error_detail == NULL)) {
_error_detail = NULL;
} else {
_error_detail = new ErrorDetail(*status._error_detail);
}
return *this;
}
// "Copy" c'tor from TStatus.
Status(const TStatus& status);
// same as previous c'tor
Status& operator=(const TStatus& status);
// assign from stringstream
Status& operator=(const std::stringstream& stream);
bool ok() const {
return _error_detail == NULL;
}
bool is_cancelled() const {
return _error_detail != NULL
&& _error_detail->error_code == TStatusCode::CANCELLED;
}
bool is_mem_limit_exceeded() const {
return _error_detail != NULL
&& _error_detail->error_code == TStatusCode::MEM_LIMIT_EXCEEDED;
}
bool is_thrift_rpc_error() const {
return _error_detail != NULL
&& _error_detail->error_code == TStatusCode::MEM_LIMIT_EXCEEDED;
}
// Add an error message and set the code if no code has been set yet.
// If a code has already been set, 'code' is ignored.
void add_error_msg(TStatusCode::type code, const std::string& msg);
// Add an error message and set the code to INTERNAL_ERROR if no code has been
// set yet. If a code has already been set, it is left unchanged.
void add_error_msg(const std::string& msg);
// Does nothing if status.ok().
// Otherwise: if 'this' is an error status, adds the error msg from 'status;
// otherwise assigns 'status'.
void add_error(const Status& status);
// Return all accumulated error msgs.
void get_error_msgs(std::vector<std::string>* msgs) const;
// Convert into TStatus. Call this if 'status_container' contains an optional
// TStatus field named 'status'. This also sets __isset.status.
template <typename T> void set_t_status(T* status_container) const {
to_thrift(&status_container->status);
status_container->__isset.status = true;
}
// Convert into TStatus.
void to_thrift(TStatus* status) const;
// Return all accumulated error msgs in a single string.
void get_error_msg(std::string* msg) const;
std::string get_error_msg() const;
TStatusCode::type code() const {
return _error_detail == NULL ? TStatusCode::OK : _error_detail->error_code;
}
private:
struct ErrorDetail {
TStatusCode::type error_code; // anything other than OK
std::vector<std::string> error_msgs;
ErrorDetail(const TStatus& status);
ErrorDetail(TStatusCode::type code)
: error_code(code) {}
ErrorDetail(TStatusCode::type code, const std::string& msg)
: error_code(code), error_msgs(1, msg) {}
};
ErrorDetail* _error_detail;
};
// some generally useful macros
#define RETURN_IF_ERROR(stmt) \
do { \
Status _status_ = (stmt); \
if (UNLIKELY(!_status_.ok())) { \
return _status_; \
} \
} while (false)
#define EXIT_IF_ERROR(stmt) \
do { \
Status _status_ = (stmt); \
if (UNLIKELY(!_status_.ok())) { \
string msg; \
_status_.get_error_msg(&msg); \
LOG(ERROR) << msg; \
exit(1); \
} \
} while (false)
}
#define WARN_UNUSED_RESULT __attribute__((warn_unused_result))
#endif

110
be/src/exec/CMakeLists.txt Normal file
View File

@ -0,0 +1,110 @@
# Modifications copyright (C) 2017, Baidu.com, Inc.
# Copyright 2017 The Apache Software Foundation
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
cmake_minimum_required(VERSION 2.6)
# where to put generated libraries
set(LIBRARY_OUTPUT_PATH "${BUILD_DIR}/src/exec")
# where to put generated binaries
set(EXECUTABLE_OUTPUT_PATH "${BUILD_DIR}/src/exec")
add_library(Exec STATIC
aggregation_node.cpp
#pre_aggregation_node.cpp
aggregation_node_ir.cpp
analytic_eval_node.cpp
blocking_join_node.cpp
broker_scan_node.cpp
broker_reader.cpp
broker_scanner.cpp
cross_join_node.cpp
data_sink.cpp
decompressor.cpp
empty_set_node.cpp
exec_node.cpp
exchange_node.cpp
hash_join_node.cpp
hash_join_node_ir.cpp
hash_table.cpp
local_file_reader.cpp
merge_node.cpp
merge_join_node.cpp
scan_node.cpp
select_node.cpp
text_converter.cpp
topn_node.cpp
sort_exec_exprs.cpp
sort_node.cpp
olap_rewrite_node.cpp
olap_scan_node.cpp
olap_scanner.cpp
olap_meta_reader.cpp
olap_common.cpp
plain_text_line_reader.cpp
mysql_scan_node.cpp
mysql_scanner.cpp
csv_scan_node.cpp
csv_scanner.cpp
spill_sort_node.cc
union_node.cpp
union_node_ir.cpp
schema_scanner.cpp
schema_scan_node.cpp
schema_scanner/schema_tables_scanner.cpp
schema_scanner/schema_dummy_scanner.cpp
schema_scanner/schema_schemata_scanner.cpp
schema_scanner/schema_variables_scanner.cpp
schema_scanner/schema_columns_scanner.cpp
schema_scanner/schema_charsets_scanner.cpp
schema_scanner/schema_collations_scanner.cpp
schema_scanner/frontend_helper.cpp
partitioned_hash_table.cc
partitioned_hash_table_ir.cc
partitioned_aggregation_node.cc
partitioned_aggregation_node_ir.cc
local_file_writer.cpp
broker_writer.cpp
)
# TODO: why is this test disabled?
#ADD_BE_TEST(new_olap_scan_node_test)
#ADD_BE_TEST(pre_aggregation_node_test)
#ADD_BE_TEST(hash_table_test)
#ADD_BE_TEST(olap_scanner_test)
#ADD_BE_TEST(olap_meta_reader_test)
#ADD_BE_TEST(olap_common_test)
#ADD_BE_TEST(olap_scan_node_test)
#ADD_BE_TEST(mysql_scan_node_test)
#ADD_BE_TEST(mysql_scanner_test)
#ADD_BE_TEST(schema_scan_node_test)
#ADD_BE_TEST(schema_scanner_test)
##ADD_BE_TEST(set_executor_test)
#ADD_BE_TEST(schema_scanner/schema_authors_scanner_test)
#ADD_BE_TEST(schema_scanner/schema_columns_scanner_test)
#ADD_BE_TEST(schema_scanner/schema_create_table_scanner_test)
#ADD_BE_TEST(schema_scanner/schema_open_tables_scanner_test)
#ADD_BE_TEST(schema_scanner/schema_schemata_scanner_test)
#ADD_BE_TEST(schema_scanner/schema_table_names_scanner_test)
#ADD_BE_TEST(schema_scanner/schema_tables_scanner_test)
#ADD_BE_TEST(schema_scanner/schema_variables_scanner_test)
#ADD_BE_TEST(schema_scanner/schema_engines_scanner_test)
#ADD_BE_TEST(schema_scanner/schema_collations_scanner_test)
#ADD_BE_TEST(schema_scanner/schema_charsets_scanner_test)

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,161 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_QUERY_EXEC_AGGREGATION_NODE_H
#define BDG_PALO_BE_SRC_QUERY_EXEC_AGGREGATION_NODE_H
#include <boost/scoped_ptr.hpp>
#include <functional>
#include "exec/exec_node.h"
#include "exec/hash_table.h"
#include "runtime/descriptors.h"
#include "runtime/free_list.hpp"
#include "runtime/mem_pool.h"
#include "runtime/string_value.h"
namespace llvm {
class Function;
}
namespace palo {
class AggFnEvaluator;
class LlvmCodeGen;
class RowBatch;
class RuntimeState;
struct StringValue;
class Tuple;
class TupleDescriptor;
class SlotDescriptor;
// Node for in-memory hash aggregation.
// The node creates a hash set of aggregation output tuples, which
// contain slots for all grouping and aggregation exprs (the grouping
// slots precede the aggregation expr slots in the output tuple descriptor).
//
// For string aggregation, we need to append additional data to the tuple object
// to reduce the number of string allocations (since we cannot know the length of
// the output string beforehand). For each string slot in the output tuple, a int32
// will be appended to the end of the normal tuple data that stores the size of buffer
// for that string slot. This also results in the correct alignment because StringValue
// slots are 8-byte aligned and form the tail end of the tuple.
class AggregationNode : public ExecNode {
public:
AggregationNode(ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs);
virtual ~AggregationNode();
virtual Status init(const TPlanNode& tnode);
virtual Status prepare(RuntimeState* state);
virtual Status open(RuntimeState* state);
virtual Status get_next(RuntimeState* state, RowBatch* row_batch, bool* eos);
virtual Status close(RuntimeState* state);
virtual void debug_string(int indentation_level, std::stringstream* out) const;
virtual void push_down_predicate(
RuntimeState *state, std::list<ExprContext*> *expr_ctxs);
static const char* _s_llvm_class_name;
private:
boost::scoped_ptr<HashTable> _hash_tbl;
HashTable::Iterator _output_iterator;
std::vector<AggFnEvaluator*> _aggregate_evaluators;
/// FunctionContext for each agg fn and backing pool.
std::vector<palo_udf::FunctionContext*> _agg_fn_ctxs;
boost::scoped_ptr<MemPool> _agg_fn_pool;
// Exprs used to evaluate input rows
std::vector<ExprContext*> _probe_expr_ctxs;
// Exprs used to insert constructed aggregation tuple into the hash table.
// All the exprs are simply SlotRefs for the agg tuple.
std::vector<ExprContext*> _build_expr_ctxs;
/// Tuple into which Update()/Merge()/Serialize() results are stored.
TupleId _intermediate_tuple_id;
TupleDescriptor* _intermediate_tuple_desc;
/// Tuple into which Finalize() results are stored. Possibly the same as
/// the intermediate tuple.
TupleId _output_tuple_id;
TupleDescriptor* _output_tuple_desc;
Tuple* _singleton_output_tuple; // result of aggregation w/o GROUP BY
boost::scoped_ptr<MemPool> _tuple_pool;
/// IR for process row batch. NULL if codegen is disabled.
llvm::Function* _codegen_process_row_batch_fn;
typedef void (*ProcessRowBatchFn)(AggregationNode*, RowBatch*);
// Jitted ProcessRowBatch function pointer. Null if codegen is disabled.
ProcessRowBatchFn _process_row_batch_fn;
// Certain aggregates require a finalize step, which is the final step of the
// aggregate after consuming all input rows. The finalize step converts the aggregate
// value into its final form. This is true if this node contains aggregate that requires
// a finalize step.
bool _needs_finalize;
// Time spent processing the child rows
RuntimeProfile::Counter* _build_timer;
// Time spent returning the aggregated rows
RuntimeProfile::Counter* _get_results_timer;
// Num buckets in hash table
RuntimeProfile::Counter* _hash_table_buckets_counter;
// Load factor in hash table
RuntimeProfile::Counter* _hash_table_load_factor_counter;
// Constructs a new aggregation output tuple (allocated from _tuple_pool),
// initialized to grouping values computed over '_current_row'.
// Aggregation expr slots are set to their initial values.
Tuple* construct_intermediate_tuple();
// Updates the aggregation output tuple 'tuple' with aggregation values
// computed over 'row'.
void update_tuple(Tuple* tuple, TupleRow* row);
// Called when all rows have been aggregated for the aggregation tuple to compute final
// aggregate values
Tuple* finalize_tuple(Tuple* tuple, MemPool* pool);
// Do the aggregation for all tuple rows in the batch
void process_row_batch_no_grouping(RowBatch* batch, MemPool* pool);
void process_row_batch_with_grouping(RowBatch* batch, MemPool* pool);
/// Codegen the process row batch loop. The loop has already been compiled to
/// IR and loaded into the codegen object. UpdateAggTuple has also been
/// codegen'd to IR. This function will modify the loop subsituting the
/// UpdateAggTuple function call with the (inlined) codegen'd 'update_tuple_fn'.
llvm::Function* codegen_process_row_batch(
RuntimeState* state, llvm::Function* update_tuple_fn);
/// Codegen for updating aggregate_exprs at slot_idx. Returns NULL if unsuccessful.
/// slot_idx is the idx into aggregate_exprs_ (does not include grouping exprs).
llvm::Function* codegen_update_slot(
RuntimeState* state, AggFnEvaluator* evaluator, SlotDescriptor* slot_desc);
/// Codegen UpdateTuple(). Returns NULL if codegen is unsuccessful.
llvm::Function* codegen_update_tuple(RuntimeState* state);
};
}
#endif

View File

@ -0,0 +1,55 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/aggregation_node.h"
#include "exec/hash_table.hpp"
#include "runtime/row_batch.h"
#include "runtime/runtime_state.h"
#include "runtime/tuple.h"
#include "runtime/tuple_row.h"
namespace palo {
void AggregationNode::process_row_batch_no_grouping(RowBatch* batch, MemPool* pool) {
for (int i = 0; i < batch->num_rows(); ++i) {
update_tuple(_singleton_output_tuple, batch->get_row(i));
}
}
void AggregationNode::process_row_batch_with_grouping(RowBatch* batch, MemPool* pool) {
for (int i = 0; i < batch->num_rows(); ++i) {
TupleRow* row = batch->get_row(i);
Tuple* agg_tuple = NULL;
HashTable::Iterator it = _hash_tbl->find(row);
if (it.at_end()) {
agg_tuple = construct_intermediate_tuple();
_hash_tbl->insert(reinterpret_cast<TupleRow*>(&agg_tuple));
} else {
agg_tuple = it.get_row()->get_tuple(0);
}
update_tuple(agg_tuple, row);
}
}
}

View File

@ -0,0 +1,924 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/analytic_eval_node.h"
#include "exprs/agg_fn_evaluator.h"
#include "exprs/anyval_util.h"
#include "runtime/buffered_tuple_stream.hpp"
#include "runtime/descriptors.h"
#include "runtime/row_batch.h"
#include "runtime/runtime_state.h"
#include "udf/udf_internal.h"
static const int MAX_TUPLE_POOL_SIZE = 8 * 1024 * 1024; // 8MB
namespace palo {
using palo_udf::BigIntVal;
AnalyticEvalNode::AnalyticEvalNode(ObjectPool* pool, const TPlanNode& tnode,
const DescriptorTbl& descs) :
ExecNode(pool, tnode, descs),
_window(tnode.analytic_node.window),
_intermediate_tuple_desc(
descs.get_tuple_descriptor(tnode.analytic_node.intermediate_tuple_id)),
_result_tuple_desc(
descs.get_tuple_descriptor(tnode.analytic_node.output_tuple_id)),
_buffered_tuple_desc(NULL),
_partition_by_eq_expr_ctx(NULL),
_order_by_eq_expr_ctx(NULL),
_rows_start_offset(0),
_rows_end_offset(0),
_has_first_val_null_offset(false),
_first_val_null_offset(0),
_last_result_idx(-1),
_prev_pool_last_result_idx(-1),
_prev_pool_last_window_idx(-1),
_curr_tuple(NULL),
_dummy_result_tuple(NULL),
_curr_partition_idx(-1),
_prev_input_row(NULL),
_input_eos(false),
_evaluation_timer(NULL) {
if (tnode.analytic_node.__isset.buffered_tuple_id) {
_buffered_tuple_desc = descs.get_tuple_descriptor(
tnode.analytic_node.buffered_tuple_id);
}
if (!tnode.analytic_node.__isset.window) {
_fn_scope = AnalyticEvalNode::PARTITION;
} else if (tnode.analytic_node.window.type == TAnalyticWindowType::RANGE) {
_fn_scope = AnalyticEvalNode::RANGE;
DCHECK(!_window.__isset.window_start)
<< "RANGE windows must have UNBOUNDED PRECEDING";
DCHECK(!_window.__isset.window_end ||
_window.window_end.type == TAnalyticWindowBoundaryType::CURRENT_ROW)
<< "RANGE window end bound must be CURRENT ROW or UNBOUNDED FOLLOWING";
} else {
DCHECK_EQ(tnode.analytic_node.window.type, TAnalyticWindowType::ROWS);
_fn_scope = AnalyticEvalNode::ROWS;
if (_window.__isset.window_start) {
TAnalyticWindowBoundary b = _window.window_start;
if (b.__isset.rows_offset_value) {
_rows_start_offset = b.rows_offset_value;
if (b.type == TAnalyticWindowBoundaryType::PRECEDING) {
_rows_start_offset *= -1;
}
} else {
DCHECK_EQ(b.type, TAnalyticWindowBoundaryType::CURRENT_ROW);
_rows_start_offset = 0;
}
}
if (_window.__isset.window_end) {
TAnalyticWindowBoundary b = _window.window_end;
if (b.__isset.rows_offset_value) {
_rows_end_offset = b.rows_offset_value;
if (b.type == TAnalyticWindowBoundaryType::PRECEDING) {
_rows_end_offset *= -1;
}
} else {
DCHECK_EQ(b.type, TAnalyticWindowBoundaryType::CURRENT_ROW);
_rows_end_offset = 0;
}
}
}
VLOG_ROW << "tnode=" << apache::thrift::ThriftDebugString(tnode);
}
Status AnalyticEvalNode::init(const TPlanNode& tnode) {
RETURN_IF_ERROR(ExecNode::init(tnode));
const TAnalyticNode& analytic_node = tnode.analytic_node;
bool has_lead_fn = false;
for (int i = 0; i < analytic_node.analytic_functions.size(); ++i) {
AggFnEvaluator* evaluator = NULL;
RETURN_IF_ERROR(AggFnEvaluator::create(
_pool, analytic_node.analytic_functions[i], true, &evaluator));
_evaluators.push_back(evaluator);
const TFunction& fn = analytic_node.analytic_functions[i].nodes[0].fn;
_is_lead_fn.push_back("lead" == fn.name.function_name);
has_lead_fn = has_lead_fn || _is_lead_fn.back();
}
DCHECK(!has_lead_fn || !_window.__isset.window_start);
DCHECK(_fn_scope != PARTITION || analytic_node.order_by_exprs.empty());
DCHECK(_window.__isset.window_end || !_window.__isset.window_start)
<< "UNBOUNDED FOLLOWING is only supported with UNBOUNDED PRECEDING.";
if (analytic_node.__isset.partition_by_eq) {
DCHECK(analytic_node.__isset.buffered_tuple_id);
RETURN_IF_ERROR(Expr::create_expr_tree(
_pool, analytic_node.partition_by_eq, &_partition_by_eq_expr_ctx));
}
if (analytic_node.__isset.order_by_eq) {
DCHECK(analytic_node.__isset.buffered_tuple_id);
RETURN_IF_ERROR(Expr::create_expr_tree(
_pool, analytic_node.order_by_eq, &_order_by_eq_expr_ctx));
}
return Status::OK;
}
Status AnalyticEvalNode::prepare(RuntimeState* state) {
SCOPED_TIMER(_runtime_profile->total_time_counter());
RETURN_IF_ERROR(ExecNode::prepare(state));
DCHECK(child(0)->row_desc().is_prefix_of(row_desc()));
_child_tuple_desc = child(0)->row_desc().tuple_descriptors()[0];
_curr_tuple_pool.reset(new MemPool(mem_tracker()));
_prev_tuple_pool.reset(new MemPool(mem_tracker()));
_mem_pool.reset(new MemPool(mem_tracker(), 0));
_evaluation_timer = ADD_TIMER(runtime_profile(), "EvaluationTime");
DCHECK_EQ(_result_tuple_desc->slots().size(), _evaluators.size());
for (int i = 0; i < _evaluators.size(); ++i) {
palo_udf::FunctionContext* ctx;
RETURN_IF_ERROR(_evaluators[i]->prepare(state, child(0)->row_desc(), _mem_pool.get(),
_intermediate_tuple_desc->slots()[i], _result_tuple_desc->slots()[i],
mem_tracker(), &ctx));
_fn_ctxs.push_back(ctx);
state->obj_pool()->add(ctx);
}
if (_partition_by_eq_expr_ctx != NULL || _order_by_eq_expr_ctx != NULL) {
DCHECK(_buffered_tuple_desc != NULL);
vector<TTupleId> tuple_ids;
tuple_ids.push_back(child(0)->row_desc().tuple_descriptors()[0]->id());
tuple_ids.push_back(_buffered_tuple_desc->id());
RowDescriptor cmp_row_desc(state->desc_tbl(), tuple_ids, vector<bool>(2, false));
if (_partition_by_eq_expr_ctx != NULL) {
RETURN_IF_ERROR(
_partition_by_eq_expr_ctx->prepare(state, cmp_row_desc, expr_mem_tracker()));
//AddExprCtxToFree(_partition_by_eq_expr_ctx);
}
if (_order_by_eq_expr_ctx != NULL) {
RETURN_IF_ERROR(
_order_by_eq_expr_ctx->prepare(state, cmp_row_desc, expr_mem_tracker()));
//AddExprCtxToFree(_order_by_eq_expr_ctx);
}
}
_child_tuple_cmp_row = reinterpret_cast<TupleRow*>(
_mem_pool->allocate(sizeof(Tuple*) * 2));
return Status::OK;
}
Status AnalyticEvalNode::open(RuntimeState* state) {
SCOPED_TIMER(_runtime_profile->total_time_counter());
RETURN_IF_ERROR(ExecNode::open(state));
RETURN_IF_CANCELLED(state);
//RETURN_IF_ERROR(QueryMaintenance(state));
RETURN_IF_ERROR(child(0)->open(state));
// RETURN_IF_ERROR(state->block_mgr()->RegisterClient(2, mem_tracker(), state, &client_));
_input_stream.reset(new BufferedTupleStream(state, child(0)->row_desc(), state->block_mgr()));
RETURN_IF_ERROR(_input_stream->init(runtime_profile()));
DCHECK_EQ(_evaluators.size(), _fn_ctxs.size());
for (int i = 0; i < _evaluators.size(); ++i) {
RETURN_IF_ERROR(_evaluators[i]->open(state, _fn_ctxs[i]));
if ("first_value_rewrite" == _evaluators[i]->fn_name() &&
_fn_ctxs[i]->get_num_args() == 2) {
DCHECK(!_has_first_val_null_offset);
_first_val_null_offset =
reinterpret_cast<BigIntVal*>(_fn_ctxs[i]->get_constant_arg(1))->val;
VLOG_FILE << id() << " FIRST_VAL rewrite null offset: " << _first_val_null_offset;
_has_first_val_null_offset = true;
}
}
if (_partition_by_eq_expr_ctx != NULL) {
RETURN_IF_ERROR(_partition_by_eq_expr_ctx->open(state));
}
if (_order_by_eq_expr_ctx != NULL) {
RETURN_IF_ERROR(_order_by_eq_expr_ctx->open(state));
}
// An intermediate tuple is only allocated once and is reused.
_curr_tuple = Tuple::create(_intermediate_tuple_desc->byte_size(), _mem_pool.get());
AggFnEvaluator::init(_evaluators, _fn_ctxs, _curr_tuple);
_dummy_result_tuple = Tuple::create(_result_tuple_desc->byte_size(), _mem_pool.get());
// Initialize state for the first partition.
init_next_partition(0);
// Fetch the first input batch so that some _prev_input_row can be set here to avoid
// special casing in GetNext().
_prev_child_batch.reset(new RowBatch(child(0)->row_desc(), state->batch_size(), mem_tracker()));
_curr_child_batch.reset(new RowBatch(child(0)->row_desc(), state->batch_size(), mem_tracker()));
while (!_input_eos && _prev_input_row == NULL) {
RETURN_IF_ERROR(child(0)->get_next(state, _curr_child_batch.get(), &_input_eos));
if (_curr_child_batch->num_rows() > 0) {
_prev_input_row = _curr_child_batch->get_row(0);
process_child_batches(state);
} else {
// Empty batch, still need to reset.
_curr_child_batch->reset();
}
}
if (_prev_input_row == NULL) {
DCHECK(_input_eos);
// Delete _curr_child_batch to indicate there is no batch to process in GetNext()
_curr_child_batch.reset();
}
return Status::OK;
}
string debug_window_bound_string(const TAnalyticWindowBoundary& b) {
if (b.type == TAnalyticWindowBoundaryType::CURRENT_ROW) {
return "CURRENT_ROW";
}
stringstream ss;
if (b.__isset.rows_offset_value) {
ss << b.rows_offset_value;
} else {
// TODO: Return debug string when range offsets are supported
DCHECK(false) << "Range offsets not yet implemented";
}
if (b.type == TAnalyticWindowBoundaryType::PRECEDING) {
ss << " PRECEDING";
} else {
DCHECK_EQ(b.type, TAnalyticWindowBoundaryType::FOLLOWING);
ss << " FOLLOWING";
}
return ss.str();
}
std::string AnalyticEvalNode::debug_window_string() const {
std::stringstream ss;
if (_fn_scope == PARTITION) {
ss << "NO WINDOW";
return ss.str();
}
ss << "{type=";
if (_fn_scope == RANGE) {
ss << "RANGE";
} else {
ss << "ROWS";
}
ss << ", start=";
if (_window.__isset.window_start) {
ss << debug_window_bound_string(_window.window_start);
} else {
ss << "UNBOUNDED_PRECEDING";
}
ss << ", end=";
if (_window.__isset.window_end) {
ss << debug_window_bound_string(_window.window_end) << "}";
} else {
ss << "UNBOUNDED_FOLLOWING";
}
return ss.str();
}
std::string AnalyticEvalNode::debug_state_string(bool detailed) const {
stringstream ss;
ss << "num_returned=" << _input_stream->rows_returned()
<< " num_rows=" << _input_stream->num_rows()
<< " _curr_partition_idx=" << _curr_partition_idx
<< " last_result_idx=" << _last_result_idx;
if (detailed) {
ss << " result_tuples idx: [";
for (std::list<std::pair<int64_t, Tuple*> >::const_iterator it = _result_tuples.begin();
it != _result_tuples.end(); ++it) {
ss << it->first;
if (*it != _result_tuples.back()) {
ss << ", ";
}
}
ss << "]";
if (_fn_scope == ROWS && _window.__isset.window_start) {
ss << " window_tuples idx: [";
for (std::list<std::pair<int64_t, Tuple*> >::const_iterator it = _window_tuples.begin();
it != _window_tuples.end(); ++it) {
ss << it->first;
if (*it != _window_tuples.back()) {
ss << ", ";
}
}
ss << "]";
}
} else {
if (_fn_scope == ROWS && _window.__isset.window_start) {
if (_window_tuples.empty()) {
ss << " window_tuples empty";
} else {
ss << " window_tuples idx range: (" << _window_tuples.front().first << ","
<< _window_tuples.back().first << ")";
}
}
if (_result_tuples.empty()) {
ss << " result_tuples empty";
} else {
ss << " result_tuples idx range: (" << _result_tuples.front().first << ","
<< _result_tuples.back().first << ")";
}
}
return ss.str();
}
void AnalyticEvalNode::add_result_tuple(int64_t stream_idx) {
VLOG_ROW << id() << " add_result_tuple idx=" << stream_idx;
DCHECK(_curr_tuple != NULL);
Tuple* result_tuple = Tuple::create(_result_tuple_desc->byte_size(),
_curr_tuple_pool.get());
AggFnEvaluator::get_value(_evaluators, _fn_ctxs, _curr_tuple, result_tuple);
DCHECK_GT(stream_idx, _last_result_idx);
_result_tuples.push_back(std::pair<int64_t, Tuple*>(stream_idx, result_tuple));
_last_result_idx = stream_idx;
VLOG_ROW << id() << " Added result tuple, final state: " << debug_state_string(true);
}
inline void AnalyticEvalNode::try_add_result_tuple_for_prev_row(bool next_partition,
int64_t stream_idx, TupleRow* row) {
// The analytic fns are finalized after the previous row if we found a new partition
// or the window is a RANGE and the order by exprs changed. For ROWS windows we do not
// need to compare the current row to the previous row.
VLOG_ROW << id() << " try_add_result_tuple_for_prev_row partition=" << next_partition
<< " idx=" << stream_idx;
if (_fn_scope == ROWS) {
return;
}
if (next_partition || (_fn_scope == RANGE && _window.__isset.window_end &&
!prev_row_compare(_order_by_eq_expr_ctx))) {
add_result_tuple(stream_idx - 1);
}
}
inline void AnalyticEvalNode::try_add_result_tuple_for_curr_row(int64_t stream_idx,
TupleRow* row) {
VLOG_ROW << id() << " try_add_result_tuple_for_curr_row idx=" << stream_idx;
// We only add results at this point for ROWS windows (unless unbounded following)
if (_fn_scope != ROWS || !_window.__isset.window_end) {
return;
}
// Nothing to add if the end offset is before the start of the partition.
if (stream_idx - _rows_end_offset < _curr_partition_idx) {
return;
}
add_result_tuple(stream_idx - _rows_end_offset);
}
inline void AnalyticEvalNode::try_remove_rows_before_window(int64_t stream_idx) {
if (_fn_scope != ROWS || !_window.__isset.window_start) {
return;
}
// The start of the window may have been before the current partition, in which case
// there is no tuple to remove in _window_tuples. Check the index of the row at which
// tuples from _window_tuples should begin to be removed.
int64_t remove_idx = stream_idx - _rows_end_offset + std::min(_rows_start_offset, 0L) - 1;
if (remove_idx < _curr_partition_idx) {
return;
}
VLOG_ROW << id() << " Remove idx=" << remove_idx << " stream_idx=" << stream_idx;
DCHECK(!_window_tuples.empty()) << debug_state_string(true);
DCHECK_EQ(remove_idx + std::max(_rows_start_offset, 0L), _window_tuples.front().first)
<< debug_state_string(true);
TupleRow* remove_row = reinterpret_cast<TupleRow*>(&_window_tuples.front().second);
AggFnEvaluator::remove(_evaluators, _fn_ctxs, remove_row, _curr_tuple);
_window_tuples.pop_front();
}
inline void AnalyticEvalNode::try_add_remaining_results(int64_t partition_idx,
int64_t prev_partition_idx) {
DCHECK_LT(prev_partition_idx, partition_idx);
// For PARTITION, RANGE, or ROWS with UNBOUNDED PRECEDING: add a result tuple for the
// remaining rows in the partition that do not have an associated result tuple yet.
if (_fn_scope != ROWS || !_window.__isset.window_end) {
if (_last_result_idx < partition_idx - 1) {
add_result_tuple(partition_idx - 1);
}
return;
}
// lead() is re-written to a ROWS window with an end bound FOLLOWING. Any remaining
// results need the default value (set by Init()). If this is the case, the start bound
// is UNBOUNDED PRECEDING (DCHECK in Init()).
for (int i = 0; i < _evaluators.size(); ++i) {
if (_is_lead_fn[i]) {
_evaluators[i]->init(_fn_ctxs[i], _curr_tuple);
}
}
// If the start bound is not UNBOUNDED PRECEDING and there are still rows in the
// partition for which we need to produce result tuples, we need to continue removing
// input tuples at the start of the window from each row that we're adding results for.
VLOG_ROW << id() << " try_add_remaining_results prev_partition_idx=" << prev_partition_idx
<< " " << debug_state_string(true);
for (int64_t next_result_idx = _last_result_idx + 1; next_result_idx < partition_idx;
++next_result_idx) {
if (_window_tuples.empty()) {
break;
}
if (next_result_idx + _rows_start_offset > _window_tuples.front().first) {
DCHECK_EQ(next_result_idx + _rows_start_offset - 1, _window_tuples.front().first);
// For every tuple that is removed from the window: Remove() from the evaluators
// and add the result tuple at the next index.
VLOG_ROW << id() << " Remove window_row_idx=" << _window_tuples.front().first
<< " for result row at idx=" << next_result_idx;
TupleRow* remove_row = reinterpret_cast<TupleRow*>(&_window_tuples.front().second);
AggFnEvaluator::remove(_evaluators, _fn_ctxs, remove_row, _curr_tuple);
_window_tuples.pop_front();
}
add_result_tuple(_last_result_idx + 1);
}
// If there are still rows between the row with the last result (add_result_tuple() may
// have updated _last_result_idx) and the partition boundary, add the current results
// for the remaining rows with the same result tuple (_curr_tuple is not modified).
if (_last_result_idx < partition_idx - 1) {
add_result_tuple(partition_idx - 1);
}
}
inline void AnalyticEvalNode::init_next_partition(int64_t stream_idx) {
VLOG_FILE << id() << " init_next_partition idx=" << stream_idx;
DCHECK_LT(_curr_partition_idx, stream_idx);
int64_t prev_partition_stream_idx = _curr_partition_idx;
_curr_partition_idx = stream_idx;
// If the window has an end bound preceding the current row, we will have output
// tuples for rows beyond the partition so they should be removed. If there was only
// one result tuple left in the partition it will remain in _result_tuples because it
// is the empty result tuple (i.e. called Init() and never Update()) that was added
// when initializing the previous partition so that the first rows have the default
// values (where there are no preceding rows in the window).
bool removed_results_past_partition = false;
while (!_result_tuples.empty() && _last_result_idx >= _curr_partition_idx) {
removed_results_past_partition = true;
DCHECK(_window.__isset.window_end &&
_window.window_end.type == TAnalyticWindowBoundaryType::PRECEDING);
VLOG_ROW << id() << " Removing result past partition idx: "
<< _result_tuples.back().first;
Tuple* prev_result_tuple = _result_tuples.back().second;
_result_tuples.pop_back();
if (_result_tuples.empty() ||
_result_tuples.back().first < prev_partition_stream_idx) {
// prev_result_tuple was the last result tuple in the partition, add it back with
// the index of the last row in the partition so that all output rows in this
// partition get the default result tuple.
_result_tuples.push_back(
std::pair<int64_t, Tuple*>(_curr_partition_idx - 1, prev_result_tuple));
}
_last_result_idx = _result_tuples.back().first;
}
if (removed_results_past_partition) {
VLOG_ROW << id() << " After removing results past partition: "
<< debug_state_string(true);
DCHECK_EQ(_last_result_idx, _curr_partition_idx - 1);
DCHECK_LE(_input_stream->rows_returned(), _last_result_idx);
}
if (_fn_scope == ROWS && stream_idx > 0 && (!_window.__isset.window_end ||
_window.window_end.type == TAnalyticWindowBoundaryType::FOLLOWING)) {
try_add_remaining_results(stream_idx, prev_partition_stream_idx);
}
_window_tuples.clear();
// Re-initialize _curr_tuple.
VLOG_ROW << id() << " Reset curr_tuple";
// Call finalize to release resources; result is not needed but the dst tuple must be
// a tuple described by _result_tuple_desc.
AggFnEvaluator::finalize(_evaluators, _fn_ctxs, _curr_tuple, _dummy_result_tuple);
_curr_tuple->init(_intermediate_tuple_desc->byte_size());
AggFnEvaluator::init(_evaluators, _fn_ctxs, _curr_tuple);
// Add a result tuple containing values set by Init() (e.g. NULL for sum(), 0 for
// count()) for output rows that have no input rows in the window. We need to add this
// result tuple before any input rows are consumed and the evaluators are updated.
if (_fn_scope == ROWS && _window.__isset.window_end &&
_window.window_end.type == TAnalyticWindowBoundaryType::PRECEDING) {
if (_has_first_val_null_offset) {
// Special handling for FIRST_VALUE which has the window rewritten in the FE
// in order to evaluate the fn efficiently with a trivial agg fn implementation.
// This occurs when the original analytic window has a start bound X PRECEDING. In
// that case, the window is rewritten to have an end bound X PRECEDING which would
// normally mean we add the newly Init()'d result tuple X rows down (so that those
// first rows have the initial value because they have no rows in their windows).
// However, the original query did not actually have X PRECEDING so we need to do
// one of the following:
// 1) Do not insert the initial result tuple with at all, indicated by
// _first_val_null_offset == -1. This happens when the original end bound was
// actually CURRENT ROW or Y FOLLOWING.
// 2) Insert the initial result tuple at _first_val_null_offset. This happens when
// the end bound was actually Y PRECEDING.
if (_first_val_null_offset != -1) {
add_result_tuple(_curr_partition_idx + _first_val_null_offset - 1);
}
} else {
add_result_tuple(_curr_partition_idx - _rows_end_offset - 1);
}
}
}
inline bool AnalyticEvalNode::prev_row_compare(ExprContext* pred_ctx) {
DCHECK(pred_ctx != NULL);
palo_udf::BooleanVal result = pred_ctx->get_boolean_val(_child_tuple_cmp_row);
DCHECK(!result.is_null);
return result.val;
}
Status AnalyticEvalNode::process_child_batches(RuntimeState* state) {
// Consume child batches until eos or there are enough rows to return more than an
// output batch. Ensuring there is at least one more row left after returning results
// allows us to simplify the logic dealing with _last_result_idx and _result_tuples.
while (_curr_child_batch.get() != NULL &&
num_output_rows_ready() < state->batch_size() + 1) {
RETURN_IF_CANCELLED(state);
//RETURN_IF_ERROR(QueryMaintenance(state));
RETURN_IF_ERROR(process_child_batch(state));
// TODO: DCHECK that the size of _result_tuples is bounded. It shouldn't be larger
// than 2x the batch size unless the end bound has an offset preceding, in which
// case it may be slightly larger (proportional to the offset but still bounded).
if (_input_eos) {
// Already processed the last child batch. Clean up and break.
_curr_child_batch.reset();
_prev_child_batch.reset();
break;
}
_prev_child_batch->reset();
_prev_child_batch.swap(_curr_child_batch);
RETURN_IF_ERROR(child(0)->get_next(state, _curr_child_batch.get(), &_input_eos));
}
return Status::OK;
}
Status AnalyticEvalNode::process_child_batch(RuntimeState* state) {
// TODO: DCHECK input is sorted (even just first row vs _prev_input_row)
VLOG_FILE << id() << " process_child_batch: " << debug_state_string(false)
<< " input batch size:" << _curr_child_batch->num_rows()
<< " tuple pool size:" << _curr_tuple_pool->total_allocated_bytes();
SCOPED_TIMER(_evaluation_timer);
// BufferedTupleStream::num_rows() returns the total number of rows that have been
// inserted into the stream (it does not decrease when we read rows), so the index of
// the next input row that will be inserted will be the current size of the stream.
int64_t stream_idx = _input_stream->num_rows();
// Stores the stream_idx of the row that was last inserted into _window_tuples.
int64_t last_window_tuple_idx = -1;
for (int i = 0; i < _curr_child_batch->num_rows(); ++i, ++stream_idx) {
TupleRow* row = _curr_child_batch->get_row(i);
_child_tuple_cmp_row->set_tuple(0, _prev_input_row->get_tuple(0));
_child_tuple_cmp_row->set_tuple(1, row->get_tuple(0));
try_remove_rows_before_window(stream_idx);
// Every row is compared against the previous row to determine if (a) the row
// starts a new partition or (b) the row does not share the same values for the
// ordering exprs. When either of these occurs, the _evaluators are finalized and
// the result tuple is added to _result_tuples so that it may be added to output
// rows in get_next_output_batch(). When a new partition is found (a), a new, empty
// result tuple is created and initialized over the _evaluators. If the row has
// different values for the ordering exprs (b), then a new tuple is created but
// copied from _curr_tuple because the original is used for one or more previous
// row(s) but the incremental state still applies to the current row.
bool next_partition = false;
if (_partition_by_eq_expr_ctx != NULL) {
// _partition_by_eq_expr_ctx checks equality over the predicate exprs
next_partition = !prev_row_compare(_partition_by_eq_expr_ctx);
}
try_add_result_tuple_for_prev_row(next_partition, stream_idx, row);
if (next_partition) {
init_next_partition(stream_idx);
}
// The _evaluators are updated with the current row.
if (_fn_scope != ROWS || !_window.__isset.window_start ||
stream_idx - _rows_start_offset >= _curr_partition_idx) {
VLOG_ROW << id() << " Update idx=" << stream_idx;
AggFnEvaluator::add(_evaluators, _fn_ctxs, row, _curr_tuple);
if (_window.__isset.window_start) {
VLOG_ROW << id() << " Adding tuple to window at idx=" << stream_idx;
Tuple* tuple = row->get_tuple(0)->deep_copy(*_child_tuple_desc,
_curr_tuple_pool.get());
_window_tuples.push_back(std::pair<int64_t, Tuple*>(stream_idx, tuple));
last_window_tuple_idx = stream_idx;
}
}
try_add_result_tuple_for_curr_row(stream_idx, row);
// Buffer the entire input row to be returned later with the analytic eval results.
if (UNLIKELY(!_input_stream->add_row(row))) {
// AddRow returns false if an error occurs (available via status()) or there is
// not enough memory (status() is OK). If there isn't enough memory, we unpin
// the stream and continue writing/reading in unpinned mode.
// TODO: Consider re-pinning later if the output stream is fully consumed.
RETURN_IF_ERROR(_input_stream->status());
// RETURN_IF_ERROR(_input_stream->UnpinStream());
VLOG_FILE << id() << " Unpin input stream while adding row idx=" << stream_idx;
if (!_input_stream->add_row(row)) {
// Rows should be added in unpinned mode unless an error occurs.
RETURN_IF_ERROR(_input_stream->status());
DCHECK(false);
}
}
_prev_input_row = row;
}
// We need to add the results for the last row(s).
if (_input_eos) {
try_add_remaining_results(stream_idx, _curr_partition_idx);
}
// Transfer resources to _prev_tuple_pool when enough resources have accumulated
// and the _prev_tuple_pool has already been transfered to an output batch.
if (_curr_tuple_pool->total_allocated_bytes() > MAX_TUPLE_POOL_SIZE &&
(_prev_pool_last_result_idx == -1 || _prev_pool_last_window_idx == -1)) {
_prev_tuple_pool->acquire_data(_curr_tuple_pool.get(), false);
_prev_pool_last_result_idx = _last_result_idx;
_prev_pool_last_window_idx = last_window_tuple_idx;
VLOG_FILE << id() << " Transfer resources from curr to prev pool at idx: "
<< stream_idx << ", stores tuples with last result idx: "
<< _prev_pool_last_result_idx << " last window idx: "
<< _prev_pool_last_window_idx;
}
return Status::OK;
}
Status AnalyticEvalNode::get_next_output_batch(RuntimeState* state, RowBatch* output_batch,
bool* eos) {
SCOPED_TIMER(_evaluation_timer);
VLOG_FILE << id() << " get_next_output_batch: " << debug_state_string(false)
<< " tuple pool size:" << _curr_tuple_pool->total_allocated_bytes();
if (_input_stream->rows_returned() == _input_stream->num_rows()) {
*eos = true;
return Status::OK;
}
const int num_child_tuples = child(0)->row_desc().tuple_descriptors().size();
ExprContext** ctxs = &_conjunct_ctxs[0];
int num_ctxs = _conjunct_ctxs.size();
RowBatch input_batch(child(0)->row_desc(), output_batch->capacity(), mem_tracker());
int64_t stream_idx = _input_stream->rows_returned();
RETURN_IF_ERROR(_input_stream->get_next(&input_batch, eos));
for (int i = 0; i < input_batch.num_rows(); ++i) {
if (reached_limit()) {
break;
}
DCHECK(!output_batch->is_full());
DCHECK(!_result_tuples.empty());
VLOG_ROW << id() << " Output row idx=" << stream_idx << " " << debug_state_string(true);
// CopyRow works as expected: input_batch tuples form a prefix of output_batch
// tuples.
TupleRow* dest = output_batch->get_row(output_batch->add_row());
input_batch.copy_row(input_batch.get_row(i), dest);
dest->set_tuple(num_child_tuples, _result_tuples.front().second);
if (ExecNode::eval_conjuncts(ctxs, num_ctxs, dest)) {
output_batch->commit_last_row();
++_num_rows_returned;
}
// Remove the head of _result_tuples if all rows using that evaluated tuple
// have been returned.
DCHECK_LE(stream_idx, _result_tuples.front().first);
if (stream_idx >= _result_tuples.front().first) {
_result_tuples.pop_front();
}
++stream_idx;
}
input_batch.transfer_resource_ownership(output_batch);
if (reached_limit()) {
*eos = true;
}
return Status::OK;
}
inline int64_t AnalyticEvalNode::num_output_rows_ready() const {
if (_result_tuples.empty()) {
return 0;
}
int64_t rows_to_return = _last_result_idx - _input_stream->rows_returned();
if (_last_result_idx > _input_stream->num_rows()) {
// This happens when we were able to add a result tuple before consuming child rows,
// e.g. initializing a new partition with an end bound that is X preceding. The first
// X rows get the default value and we add that tuple to _result_tuples before
// consuming child rows. It's possible the result is negative, and that's fine
// because this result is only used to determine if the number of rows to return
// is at least as big as the batch size.
rows_to_return -= _last_result_idx - _input_stream->num_rows();
} else {
DCHECK_GE(rows_to_return, 0);
}
return rows_to_return;
return 0;
}
Status AnalyticEvalNode::get_next(RuntimeState* state, RowBatch* row_batch, bool* eos) {
SCOPED_TIMER(_runtime_profile->total_time_counter());
RETURN_IF_ERROR(exec_debug_action(TExecNodePhase::GETNEXT));
RETURN_IF_CANCELLED(state);
//RETURN_IF_ERROR(QueryMaintenance(state));
VLOG_FILE << id() << " GetNext: " << debug_state_string(false);
if (reached_limit()) {
*eos = true;
return Status::OK;
} else {
*eos = false;
}
RETURN_IF_ERROR(process_child_batches(state));
bool output_eos = false;
RETURN_IF_ERROR(get_next_output_batch(state, row_batch, &output_eos));
if (_curr_child_batch.get() == NULL && output_eos) {
*eos = true;
}
// Transfer resources to the output row batch if enough have accumulated and they're
// no longer needed by output rows to be returned later.
if (_prev_pool_last_result_idx != -1 &&
_prev_pool_last_result_idx < _input_stream->rows_returned() &&
_prev_pool_last_window_idx < _window_tuples.front().first) {
VLOG_FILE << id() << " Transfer prev pool to output batch, "
<< " pool size: " << _prev_tuple_pool->total_allocated_bytes()
<< " last result idx: " << _prev_pool_last_result_idx
<< " last window idx: " << _prev_pool_last_window_idx;
row_batch->tuple_data_pool()->acquire_data(_prev_tuple_pool.get(), !*eos);
_prev_pool_last_result_idx = -1;
_prev_pool_last_window_idx = -1;
}
COUNTER_SET(_rows_returned_counter, _num_rows_returned);
return Status::OK;
}
Status AnalyticEvalNode::close(RuntimeState* state) {
if (is_closed()) {
return Status::OK;
}
if (_input_stream.get() != NULL) {
_input_stream->close();
}
// Close all evaluators and fn ctxs. If an error occurred in Init or rrepare there may
// be fewer ctxs than evaluators. We also need to Finalize if _curr_tuple was created
// in Open.
DCHECK_LE(_fn_ctxs.size(), _evaluators.size());
DCHECK(_curr_tuple == NULL || _fn_ctxs.size() == _evaluators.size());
for (int i = 0; i < _evaluators.size(); ++i) {
// Need to make sure finalize is called in case there is any state to clean up.
if (_curr_tuple != NULL) {
_evaluators[i]->finalize(_fn_ctxs[i], _curr_tuple, _dummy_result_tuple);
}
_evaluators[i]->close(state);
}
for (int i = 0; i < _fn_ctxs.size(); ++i) {
_fn_ctxs[i]->impl()->close();
}
if (_partition_by_eq_expr_ctx != NULL) {
_partition_by_eq_expr_ctx->close(state);
}
if (_order_by_eq_expr_ctx != NULL) {
_order_by_eq_expr_ctx->close(state);
}
if (_prev_child_batch.get() != NULL) {
_prev_child_batch.reset();
}
if (_curr_child_batch.get() != NULL) {
_curr_child_batch.reset();
}
if (_curr_tuple_pool.get() != NULL) {
_curr_tuple_pool->free_all();
}
if (_prev_tuple_pool.get() != NULL) {
_prev_tuple_pool->free_all();
}
if (_mem_pool.get() != NULL) {
_mem_pool->free_all();
}
ExecNode::close(state);
return Status::OK;
}
void AnalyticEvalNode::debug_string(int indentation_level, stringstream* out) const {
*out << string(indentation_level * 2, ' ');
*out << "AnalyticEvalNode("
<< " window=" << debug_window_string();
if (_partition_by_eq_expr_ctx != NULL) {
// *out << " partition_exprs=" << _partition_by_eq_expr_ctx->debug_string();
}
if (_order_by_eq_expr_ctx != NULL) {
// *out << " order_by_exprs=" << _order_by_eq_expr_ctx->debug_string();
}
*out << AggFnEvaluator::debug_string(_evaluators);
ExecNode::debug_string(indentation_level, out);
*out << ")";
}
//Status AnalyticEvalNode::QueryMaintenance(RuntimeState* state) {
// for (int i = 0; i < evaluators_.size(); ++i) {
// Expr::FreeLocalAllocations(evaluators_[i]->input_expr_ctxs());
// }
// return ExecNode::QueryMaintenance(state);
//}
}

View File

@ -0,0 +1,337 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef INF_PALO_BE_SRC_EXEC_ANALYTIC_EVAL_NODE_H
#define INF_PALO_BE_SRC_EXEC_ANALYTIC_EVAL_NODE_H
#include "exec/exec_node.h"
#include "exprs/expr.h"
//#include "exprs/expr_context.h"
#include "runtime/buffered_block_mgr.h"
#include "runtime/buffered_tuple_stream.h"
#include "runtime/tuple.h"
#include "thrift/protocol/TDebugProtocol.h"
namespace palo {
class AggFnEvaluator;
// Evaluates analytic functions with a single pass over input rows. It is assumed
// that the input has already been sorted on all of the partition exprs and then the
// order by exprs. If there is no order by clause or partition clause, the input is
// unsorted. Uses a BufferedTupleStream to buffer input rows which are returned in a
// streaming fashion as entire row batches of output are ready to be returned, though in
// some cases the entire input must actually be consumed to produce any output rows.
//
// The output row is composed of the tuples from the child node followed by a single
// result tuple that holds the values of the evaluated analytic functions (one slot per
// analytic function).
//
// When enough input rows have been consumed to produce the results of all analytic
// functions for one or more rows (e.g. because the order by values are different for a
// RANGE window), the results of all the analytic functions for those rows are produced
// in a result tuple by calling GetValue()/Finalize() on the evaluators and storing the
// tuple in result_tuples_. Input row batches are fetched from the BufferedTupleStream,
// copied into output row batches, and the associated result tuple is set in each
// corresponding row. Result tuples may apply to many rows (e.g. an arbitrary number or
// an entire partition) so result_tuples_ stores a pair of the stream index (the last
// row in the stream it applies to) and the tuple.
//
// Input rows are consumed in a streaming fashion until enough input has been consumed
// in order to produce enough output rows. In some cases, this may mean that only a
// single input batch is needed to produce the results for an output batch, e.g.
// "SELECT RANK OVER (ORDER BY unique_col) ... ", but in other cases, an arbitrary
// number of rows may need to be buffered before result rows can be produced, e.g. if
// multiple rows have the same values for the order by exprs. The number of buffered
// rows may be an entire partition or even the entire input. Therefore, the output
// rows are buffered and may spill to disk via the BufferedTupleStream.
class AnalyticEvalNode : public ExecNode {
public:
~AnalyticEvalNode() {}
AnalyticEvalNode(ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs);
virtual Status init(const TPlanNode& tnode);
virtual Status prepare(RuntimeState* state);
virtual Status open(RuntimeState* state);
virtual Status get_next(RuntimeState* state, RowBatch* row_batch, bool* eos);
virtual Status close(RuntimeState* state);
protected:
// Frees local allocations from _evaluators
// virtual Status QueryMaintenance(RuntimeState* state);
virtual void debug_string(int indentation_level, std::stringstream* out) const;
private:
// The scope over which analytic functions are evaluated. Functions are either
// evaluated over a window (specified by a TAnalyticWindow) or an entire partition.
// This is used to avoid more complex logic where we often branch based on these
// cases, e.g. whether or not there is a window (i.e. no window = PARTITION) is stored
// separately from the window type (assuming there is a window).
enum AnalyticFnScope {
// Analytic functions are evaluated over an entire partition (or the entire data set
// if no partition clause was specified). Every row within a partition is added to
// _curr_tuple and buffered in the _input_stream. Once all rows in a partition have
// been consumed, a single result tuple is added to _result_tuples for all rows in
// that partition.
PARTITION,
// Functions are evaluated over windows specified with range boundaries. Currently
// only supports the 'default window', i.e. UNBOUNDED PRECEDING to CURRENT ROW. In
// this case, when the values of the order by expressions change between rows a
// result tuple is added to _result_tuples for the previous rows with the same values
// for the order by expressions. This happens in try_add_result_tuple_for_prev_row()
// because we determine if the order by expression values changed between the
// previous and current row.
RANGE,
// Functions are evaluated over windows specified with rows boundaries. A result
// tuple is added for every input row (except for some cases where the window extends
// before or after the partition). When the end boundary is offset from the current
// row, input rows are consumed and result tuples are produced for the associated
// preceding or following row. When the start boundary is offset from the current
// row, the first tuple (i.e. the input to the analytic functions) from the input
// rows are buffered in _window_tuples because they must later be removed from the
// window (by calling AggFnEvaluator::Remove() with the expired tuple to remove it
// from the current row). When either the start or end boundaries are offset from the
// current row, there is special casing around partition boundaries.
ROWS
};
// Evaluates analytic functions over _curr_child_batch. Each input row is passed
// to the evaluators and added to _input_stream where they are stored until a tuple
// containing the results of the analytic functions for that row is ready to be
// returned. When enough rows have been processed so that results can be produced for
// one or more rows, a tuple containing those results are stored in _result_tuples.
// That tuple gets set in the associated output row(s) later in get_next_output_batch().
Status process_child_batch(RuntimeState* state);
// Processes child batches (calling process_child_batch()) until enough output rows
// are ready to return an output batch.
Status process_child_batches(RuntimeState* state);
// Returns a batch of output rows from _input_stream with the analytic function
// results (from _result_tuples) set as the last tuple.
Status get_next_output_batch(RuntimeState* state, RowBatch* row_batch, bool* eos);
// Determines if there is a window ending at the previous row, and if so, calls
// add_result_tuple() with the index of the previous row in _input_stream. next_partition
// indicates if the current row is the start of a new partition. stream_idx is the
// index of the current input row from _input_stream.
void try_add_result_tuple_for_prev_row(bool next_partition, int64_t stream_idx,
TupleRow* row);
// Determines if there is a window ending at the current row, and if so, calls
// add_result_tuple() with the index of the current row in _input_stream. stream_idx is
// the index of the current input row from _input_stream.
void try_add_result_tuple_for_curr_row(int64_t stream_idx, TupleRow* row);
// Adds additional result tuples at the end of a partition, e.g. if the end bound is
// FOLLOWING. partition_idx is the index into _input_stream of the new partition,
// prev_partition_idx is the index of the previous partition.
void try_add_remaining_results(int64_t partition_idx, int64_t prev_partition_idx);
// Removes rows from _curr_tuple (by calling AggFnEvaluator::Remove()) that are no
// longer in the window (i.e. they are before the window start boundary). stream_idx
// is the index of the row in _input_stream that is currently being processed in
// process_child_batch().
void try_remove_rows_before_window(int64_t stream_idx);
// Initializes state at the start of a new partition. stream_idx is the index of the
// current input row from _input_stream.
void init_next_partition(int64_t stream_idx);
// Produces a result tuple with analytic function results by calling GetValue() or
// Finalize() for _curr_tuple on the _evaluators. The result tuple is stored in
// _result_tuples with the index into _input_stream specified by stream_idx.
void add_result_tuple(int64_t stream_idx);
// Gets the number of rows that are ready to be returned by subsequent calls to
// get_next_output_batch().
int64_t num_output_rows_ready() const;
// Resets the slots in current_tuple_ that store the intermedatiate results for lead().
// This is necessary to produce the default value (set by Init()).
void reset_lead_fn_slots();
// Evaluates the predicate pred_ctx over _child_tuple_cmp_row, which is a TupleRow*
// containing the previous row and the current row set during process_child_batch().
bool prev_row_compare(ExprContext* pred_ctx);
// Debug string containing current state. If 'detailed', per-row state is included.
std::string debug_state_string(bool detailed) const;
std::string debug_evaluated_rows_string() const;
// Debug string containing the window definition.
std::string debug_window_string() const;
// Window over which the analytic functions are evaluated. Only used if _fn_scope
// is ROWS or RANGE.
// TODO: _fn_scope and _window are candidates to be removed during codegen
const TAnalyticWindow _window;
// Tuple descriptor for storing intermediate values of analytic fn evaluation.
const TupleDescriptor* _intermediate_tuple_desc;
// Tuple descriptor for storing results of analytic fn evaluation.
const TupleDescriptor* _result_tuple_desc;
// Tuple descriptor of the buffered tuple (identical to the input child tuple, which is
// assumed to come from a single SortNode). NULL if both partition_exprs and
// order_by_exprs are empty.
TupleDescriptor* _buffered_tuple_desc;
// TupleRow* composed of the first child tuple and the buffered tuple, used by
// _partition_by_eq_expr_ctx and _order_by_eq_expr_ctx. Set in prepare() if
// _buffered_tuple_desc is not NULL, allocated from _mem_pool.
TupleRow* _child_tuple_cmp_row;
// Expr context for a predicate that checks if child tuple '<' buffered tuple for
// partitioning exprs.
ExprContext* _partition_by_eq_expr_ctx;
// Expr context for a predicate that checks if child tuple '<' buffered tuple for
// order by exprs.
ExprContext* _order_by_eq_expr_ctx;
// The scope over which analytic functions are evaluated.
// TODO: Consider adding additional state to capture whether different kinds of window
// bounds need to be maintained, e.g. (_fn_scope == ROWS && _window.__isset.end_bound).
AnalyticFnScope _fn_scope;
// Offset from the current row for ROWS windows with start or end bounds specified
// with offsets. Is positive if the offset is FOLLOWING, negative if PRECEDING, and 0
// if type is CURRENT ROW or UNBOUNDED PRECEDING/FOLLOWING.
int64_t _rows_start_offset;
int64_t _rows_end_offset;
// Analytic function evaluators.
std::vector<AggFnEvaluator*> _evaluators;
// Indicates if each evaluator is the lead() fn. Used by reset_lead_fn_slots() to
// determine which slots need to be reset.
std::vector<bool> _is_lead_fn;
// If true, evaluating FIRST_VALUE requires special null handling when initializing new
// partitions determined by the offset. Set in Open() by inspecting the agg fns.
bool _has_first_val_null_offset;
long _first_val_null_offset;
// FunctionContext for each analytic function. String data returned by the analytic
// functions is allocated via these contexts.
std::vector<palo_udf::FunctionContext*> _fn_ctxs;
// Queue of tuples which are ready to be set in output rows, with the index into
// the _input_stream stream of the last TupleRow that gets the Tuple. Pairs are
// pushed onto the queue in process_child_batch() and dequeued in order in
// get_next_output_batch(). The size of _result_tuples is limited by 2 times the
// row batch size because we only process input batches if there are not enough
// result tuples to produce a single batch of output rows. In the worst case there
// may be a single result tuple per output row and _result_tuples.size() may be one
// less than the row batch size, in which case we will process another input row batch
// (inserting one result tuple per input row) before returning a row batch.
std::list<std::pair<int64_t, Tuple*> > _result_tuples;
// Index in _input_stream of the most recently added result tuple.
int64_t _last_result_idx;
// Child tuples (described by _child_tuple_desc) that are currently within the window
// and the index into _input_stream of the row they're associated with. Only used when
// window start bound is PRECEDING or FOLLOWING. Tuples in this list are deep copied
// and owned by curr_window_tuple_pool_.
// TODO: Remove and use BufferedTupleStream (needs support for multiple readers).
std::list<std::pair<int64_t, Tuple*> > _window_tuples;
TupleDescriptor* _child_tuple_desc;
// Pools used to allocate result tuples (added to _result_tuples and later returned)
// and window tuples (added to _window_tuples to buffer the current window). Resources
// are transferred from _curr_tuple_pool to _prev_tuple_pool once it is at least
// MAX_TUPLE_POOL_SIZE bytes. Resources from _prev_tuple_pool are transferred to an
// output row batch when all result tuples it contains have been returned and all
// window tuples it contains are no longer needed.
boost::scoped_ptr<MemPool> _curr_tuple_pool;
boost::scoped_ptr<MemPool> _prev_tuple_pool;
// The index of the last row from _input_stream associated with output row containing
// resources in _prev_tuple_pool. -1 when the pool is empty. Resources from
// _prev_tuple_pool can only be transferred to an output batch once all rows containing
// these tuples have been returned.
int64_t _prev_pool_last_result_idx;
// The index of the last row from _input_stream associated with window tuples
// containing resources in _prev_tuple_pool. -1 when the pool is empty. Resources from
// _prev_tuple_pool can only be transferred to an output batch once all rows containing
// these tuples are no longer needed (removed from the _window_tuples).
int64_t _prev_pool_last_window_idx;
// The tuple described by _intermediate_tuple_desc storing intermediate state for the
// _evaluators. When enough input rows have been consumed to produce the analytic
// function results, a result tuple (described by _result_tuple_desc) is created and
// the agg fn results are written to that tuple by calling Finalize()/GetValue()
// on the evaluators with _curr_tuple as the source tuple.
Tuple* _curr_tuple;
// A tuple described by _result_tuple_desc used when calling Finalize() on the
// _evaluators to release resources between partitions; the value is never used.
// TODO: Remove when agg fns implement a separate Close() method to release resources.
Tuple* _dummy_result_tuple;
// Index of the row in _input_stream at which the current partition started.
int64_t _curr_partition_idx;
// Previous input row used to compare partition boundaries and to determine when the
// order-by expressions change.
TupleRow* _prev_input_row;
// Current and previous input row batches from the child. RowBatches are allocated
// once and reused. Previous input row batch owns _prev_input_row between calls to
// process_child_batch(). The prev batch is Reset() after calling process_child_batch()
// and then swapped with the curr batch so the RowBatch owning _prev_input_row is
// stored in _prev_child_batch for the next call to process_child_batch().
boost::scoped_ptr<RowBatch> _prev_child_batch;
boost::scoped_ptr<RowBatch> _curr_child_batch;
// Block manager client used by _input_stream. Not owned.
// BufferedBlockMgr::Client* client_;
// Buffers input rows added in process_child_batch() until enough rows are able to
// be returned by get_next_output_batch(), in which case row batches are returned from
// the front of the stream and the underlying buffered blocks are deleted once read.
// The number of rows that must be buffered may vary from an entire partition (e.g.
// no order by clause) to a single row (e.g. ROWS windows). When the amount of
// buffered data exceeds the available memory in the underlying BufferedBlockMgr,
// _input_stream is unpinned (i.e., possibly spilled to disk if necessary).
// TODO: Consider re-pinning unpinned streams when possible.
boost::scoped_ptr<BufferedTupleStream> _input_stream;
// Pool used for O(1) allocations that live until close.
boost::scoped_ptr<MemPool> _mem_pool;
// True when there are no more input rows to consume from our child.
bool _input_eos;
// Time spent processing the child rows.
RuntimeProfile::Counter* _evaluation_timer;
};
}
#endif

View File

@ -0,0 +1,223 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/blocking_join_node.h"
#include <sstream>
#include "exprs/expr.h"
#include "runtime/row_batch.h"
#include "runtime/runtime_state.h"
#include "util/debug_util.h"
#include "util/runtime_profile.h"
#include "gen_cpp/PlanNodes_types.h"
namespace palo {
const char* BlockingJoinNode::LLVM_CLASS_NAME = "class.palo::BlockingJoinNode";
BlockingJoinNode::BlockingJoinNode(const std::string& node_name,
const TJoinOp::type join_op,
ObjectPool* pool,
const TPlanNode& tnode,
const DescriptorTbl& descs)
: ExecNode(pool, tnode, descs),
_node_name(node_name),
_join_op(join_op) {
}
Status BlockingJoinNode::init(const TPlanNode& tnode) {
return ExecNode::init(tnode);
}
BlockingJoinNode::~BlockingJoinNode() {
// _left_batch must be cleaned up in close() to ensure proper resource freeing.
DCHECK(_left_batch == NULL);
}
Status BlockingJoinNode::prepare(RuntimeState* state) {
SCOPED_TIMER(_runtime_profile->total_time_counter());
RETURN_IF_ERROR(ExecNode::prepare(state));
_build_pool.reset(new MemPool(mem_tracker()));
_build_timer = ADD_TIMER(runtime_profile(), "BuildTime");
_left_child_timer = ADD_TIMER(runtime_profile(), "LeftChildTime");
_build_row_counter = ADD_COUNTER(runtime_profile(), "BuildRows", TUnit::UNIT);
_left_child_row_counter = ADD_COUNTER(runtime_profile(), "LeftChildRows",
TUnit::UNIT);
_result_tuple_row_size = _row_descriptor.tuple_descriptors().size() * sizeof(Tuple*);
// pre-compute the tuple index of build tuples in the output row
int num_left_tuples = child(0)->row_desc().tuple_descriptors().size();
int num_build_tuples = child(1)->row_desc().tuple_descriptors().size();
_build_tuple_size = num_build_tuples;
_build_tuple_idx.reserve(_build_tuple_size);
for (int i = 0; i < _build_tuple_size; ++i) {
TupleDescriptor* build_tuple_desc = child(1)->row_desc().tuple_descriptors()[i];
_build_tuple_idx.push_back(_row_descriptor.get_tuple_idx(build_tuple_desc->id()));
}
_probe_tuple_row_size = num_left_tuples * sizeof(Tuple*);
_build_tuple_row_size = num_build_tuples * sizeof(Tuple*);
_left_batch.reset(new RowBatch(child(0)->row_desc(), state->batch_size(), mem_tracker()));
return Status::OK;
}
Status BlockingJoinNode::close(RuntimeState* state) {
// TODO(zhaochun): avoid double close
// if (is_closed()) return Status::OK;
_left_batch.reset();
ExecNode::close(state);
return Status::OK;
}
void BlockingJoinNode::build_side_thread(RuntimeState* state, boost::promise<Status>* status) {
status->set_value(construct_build_side(state));
// Release the thread token as soon as possible (before the main thread joins
// on it). This way, if we had a chain of 10 joins using 1 additional thread,
// we'd keep the additional thread busy the whole time.
state->resource_pool()->release_thread_token(false);
}
Status BlockingJoinNode::open(RuntimeState* state) {
RETURN_IF_ERROR(ExecNode::open(state));
SCOPED_TIMER(_runtime_profile->total_time_counter());
// RETURN_IF_ERROR(Expr::open(_conjuncts, state));
RETURN_IF_CANCELLED(state);
// TODO(zhaochun)
// RETURN_IF_ERROR(state->check_query_state());
_eos = false;
// Kick-off the construction of the build-side table in a separate
// thread, so that the left child can do any initialisation in parallel.
// Only do this if we can get a thread token. Otherwise, do this in the
// main thread
boost::promise<Status> build_side_status;
if (state->resource_pool()->try_acquire_thread_token()) {
add_runtime_exec_option("Join Build-Side Prepared Asynchronously");
// Thread build_thread(_node_name, "build thread",
// bind(&BlockingJoinNode::BuildSideThread, this, state, &build_side_status));
// if (!state->cgroup().empty()) {
// RETURN_IF_ERROR(
// state->exec_env()->cgroups_mgr()->assign_thread_to_cgroup(
// build_thread, state->cgroup()));
// }
boost::thread(bind(&BlockingJoinNode::build_side_thread, this, state, &build_side_status));
} else {
build_side_status.set_value(construct_build_side(state));
}
// Open the left child so that it may perform any initialisation in parallel.
// Don't exit even if we see an error, we still need to wait for the build thread
// to finish.
Status open_status = child(0)->open(state);
// Blocks until ConstructBuildSide has returned, after which the build side structures
// are fully constructed.
RETURN_IF_ERROR(build_side_status.get_future().get());
// We can close the right child to release its resources because its input has been
// fully consumed.
child(1)->close(state);
RETURN_IF_ERROR(open_status);
// Seed left child in preparation for get_next().
while (true) {
RETURN_IF_ERROR(child(0)->get_next(state, _left_batch.get(), &_left_side_eos));
COUNTER_UPDATE(_left_child_row_counter, _left_batch->num_rows());
_left_batch_pos = 0;
if (_left_batch->num_rows() == 0) {
if (_left_side_eos) {
init_get_next(NULL /* eos */);
_eos = true;
break;
}
_left_batch->reset();
continue;
} else {
_current_left_child_row = _left_batch->get_row(_left_batch_pos++);
init_get_next(_current_left_child_row);
break;
}
}
return Status::OK;
}
void BlockingJoinNode::debug_string(int indentation_level, std::stringstream* out) const {
*out << std::string(indentation_level * 2, ' ');
*out << _node_name;
*out << "(eos=" << (_eos ? "true" : "false")
<< " left_batch_pos=" << _left_batch_pos;
add_to_debug_string(indentation_level, out);
ExecNode::debug_string(indentation_level, out);
*out << ")";
}
std::string BlockingJoinNode::get_left_child_row_string(TupleRow* row) {
std::stringstream out;
out << "[";
int* _build_tuple_idx_ptr = &_build_tuple_idx[0];
for (int i = 0; i < row_desc().tuple_descriptors().size(); ++i) {
if (i != 0) {
out << " ";
}
int* is_build_tuple =
std::find(_build_tuple_idx_ptr, _build_tuple_idx_ptr + _build_tuple_size, i);
if (is_build_tuple != _build_tuple_idx_ptr + _build_tuple_size) {
out << print_tuple(NULL, *row_desc().tuple_descriptors()[i]);
} else {
out << print_tuple(row->get_tuple(i), *row_desc().tuple_descriptors()[i]);
}
}
out << "]";
return out.str();
}
// This function is replaced by codegen
void BlockingJoinNode::create_output_row(TupleRow* out, TupleRow* left, TupleRow* build) {
uint8_t* out_ptr = reinterpret_cast<uint8_t*>(out);
if (left == NULL) {
memset(out_ptr, 0, _probe_tuple_row_size);
} else {
memcpy(out_ptr, left, _probe_tuple_row_size);
}
if (build == NULL) {
memset(out_ptr + _probe_tuple_row_size, 0, _build_tuple_row_size);
} else {
memcpy(out_ptr + _probe_tuple_row_size, build, _build_tuple_row_size);
}
}
}

View File

@ -0,0 +1,138 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_QUERY_EXEC_BLOCKING_JOIN_NODE_H
#define BDG_PALO_BE_SRC_QUERY_EXEC_BLOCKING_JOIN_NODE_H
#include <boost/scoped_ptr.hpp>
#include <boost/thread.hpp>
#include <string>
#include "exec/exec_node.h"
#include "gen_cpp/PlanNodes_types.h"
namespace palo {
class MemPool;
class RowBatch;
class TupleRow;
// Abstract base class for join nodes that block while consuming all rows from their
// right child in open().
class BlockingJoinNode : public ExecNode {
public:
BlockingJoinNode(const std::string& node_name, const TJoinOp::type join_op,
ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs);
virtual ~BlockingJoinNode();
// Subclasses should call BlockingJoinNode::init() and then perform any other init()
// work, e.g. creating expr trees.
virtual Status init(const TPlanNode& tnode);
// Subclasses should call BlockingJoinNode::prepare() and then perform any other
// prepare() work, e.g. codegen.
virtual Status prepare(RuntimeState* state);
// Open prepares the build side structures (subclasses should implement
// construct_build_side()) and then prepares for GetNext with the first left child row
// (subclasses should implement init_get_next()).
virtual Status open(RuntimeState* state);
// Subclasses should close any other structures and then call
// BlockingJoinNode::close().
virtual Status close(RuntimeState* state);
static const char* LLVM_CLASS_NAME;
private:
const std::string _node_name;
TJoinOp::type _join_op;
bool _eos; // if true, nothing left to return in get_next()
boost::scoped_ptr<MemPool> _build_pool; // holds everything referenced from build side
// _left_batch must be cleared before calling get_next(). The child node
// does not initialize all tuple ptrs in the row, only the ones that it
// is responsible for.
boost::scoped_ptr<RowBatch> _left_batch;
int _left_batch_pos; // current scan pos in _left_batch
bool _left_side_eos; // if true, left child has no more rows to process
TupleRow* _current_left_child_row;
// _build_tuple_idx[i] is the tuple index of child(1)'s tuple[i] in the output row
std::vector<int> _build_tuple_idx;
int _build_tuple_size;
// Size of the TupleRow (just the Tuple ptrs) from the build (right) and probe (left)
// sides.
int _probe_tuple_row_size;
int _build_tuple_row_size;
// byte size of result tuple row (sum of the tuple ptrs, not the tuple data).
// This should be the same size as the left child tuple row.
int _result_tuple_row_size;
RuntimeProfile::Counter* _build_timer; // time to prepare build side
RuntimeProfile::Counter* _left_child_timer; // time to process left child batch
RuntimeProfile::Counter* _build_row_counter; // num build rows
RuntimeProfile::Counter* _left_child_row_counter; // num left child rows
// Init the build-side state for a new left child row (e.g. hash table iterator or list
// iterator) given the first row. Used in open() to prepare for get_next().
// A NULL ptr for first_left_child_row indicates the left child eos.
virtual void init_get_next(TupleRow* first_left_child_row) = 0;
// We parallelize building the build-side with Open'ing the
// left child. If, for example, the left child is another
// join node, it can start to build its own build-side at the
// same time.
virtual Status construct_build_side(RuntimeState* state) = 0;
// Gives subclasses an opportunity to add debug output to the debug string printed by
// debug_string().
virtual void add_to_debug_string(int indentation_level, std::stringstream* out) const {
}
// Subclasses should not override, use add_to_debug_string() to add to the result.
virtual void debug_string(int indentation_level, std::stringstream* out) const;
// Returns a debug string for the left child's 'row'. They have tuple ptrs that are
// uninitialized; the left child only populates the tuple ptrs it is responsible
// for. This function outputs just the row values and leaves the build
// side values as NULL.
// This is only used for debugging and outputting the left child rows before
// doing the join.
std::string get_left_child_row_string(TupleRow* row);
// Write combined row, consisting of the left child's 'left_row' and right child's
// 'build_row' to 'out_row'.
// This is replaced by codegen.
void create_output_row(TupleRow* out_row, TupleRow* left_row, TupleRow* build_row);
friend class CrossJoinNode;
private:
// Supervises ConstructBuildSide in a separate thread, and returns its status in the
// promise parameter.
void build_side_thread(RuntimeState* state, boost::promise<Status>* status);
};
}
#endif

View File

@ -0,0 +1,227 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/broker_reader.h"
#include <sstream>
#include "common/logging.h"
#include "gen_cpp/PaloBrokerService_types.h"
#include "gen_cpp/TPaloBrokerService.h"
#include "runtime/broker_mgr.h"
#include "runtime/client_cache.h"
#include "runtime/exec_env.h"
#include "runtime/runtime_state.h"
#include "util/thrift_util.h"
namespace palo {
// Broker
BrokerReader::BrokerReader(
RuntimeState* state,
const std::vector<TNetworkAddress>& broker_addresses,
const std::map<std::string, std::string>& properties,
const std::string& path,
int64_t start_offset) :
_state(state),
_addresses(broker_addresses),
_properties(properties),
_path(path),
_cur_offset(start_offset),
_is_fd_valid(false),
_eof(false),
_addr_idx(0) {
}
BrokerReader::~BrokerReader() {
close();
}
#ifdef BE_TEST
inline BrokerServiceClientCache* client_cache(RuntimeState* state) {
static BrokerServiceClientCache s_client_cache;
return &s_client_cache;
}
inline const std::string& client_id(RuntimeState* state, const TNetworkAddress& addr) {
static std::string s_client_id = "palo_unit_test";
return s_client_id;
}
#else
inline BrokerServiceClientCache* client_cache(RuntimeState* state) {
return state->exec_env()->broker_client_cache();
}
inline const std::string& client_id(RuntimeState* state, const TNetworkAddress& addr) {
return state->exec_env()->broker_mgr()->get_client_id(addr);
}
#endif
Status BrokerReader::open() {
TBrokerOpenReaderRequest request;
const TNetworkAddress& broker_addr = _addresses[_addr_idx];
request.__set_version(TBrokerVersion::VERSION_ONE);
request.__set_path(_path);
request.__set_startOffset(_cur_offset);
request.__set_clientId(client_id(_state, broker_addr));
request.__set_properties(_properties);
TBrokerOpenReaderResponse response;
try {
Status status;
// 500ms is enough
BrokerServiceConnection client(client_cache(_state), broker_addr, 500, &status);
if (!status.ok()) {
LOG(WARNING) << "Create broker client failed. broker=" << broker_addr
<< ", status=" << status.get_error_msg();
return status;
}
try {
client->openReader(response, request);
} catch (apache::thrift::transport::TTransportException& e) {
RETURN_IF_ERROR(client.reopen());
client->openReader(response, request);
}
} catch (apache::thrift::TException& e) {
std::stringstream ss;
ss << "Open broker reader failed, broker:" << broker_addr << " failed:" << e.what();
LOG(WARNING) << ss.str();
return Status(TStatusCode::THRIFT_RPC_ERROR, ss.str(), false);
}
if (response.opStatus.statusCode != TBrokerOperationStatusCode::OK) {
std::stringstream ss;
ss << "Open broker reader failed, broker:" << broker_addr
<< " failed:" << response.opStatus.message;
LOG(WARNING) << ss.str();
return Status(ss.str());
}
_fd = response.fd;
_is_fd_valid = true;
return Status::OK;
}
Status BrokerReader::read(uint8_t* buf, size_t* buf_len, bool* eof) {
if (_eof) {
*eof = true;
return Status::OK;
}
const TNetworkAddress& broker_addr = _addresses[_addr_idx];
TBrokerPReadRequest request;
request.__set_version(TBrokerVersion::VERSION_ONE);
request.__set_fd(_fd);
request.__set_offset(_cur_offset);
request.__set_length(*buf_len);
TBrokerReadResponse response;
try {
Status status;
// 500ms is enough
BrokerServiceConnection client(client_cache(_state), broker_addr, 500, &status);
if (!status.ok()) {
LOG(WARNING) << "Create broker client failed. broker=" << broker_addr
<< ", status=" << status.get_error_msg();
return status;
}
try {
client->pread(response, request);
} catch (apache::thrift::transport::TTransportException& e) {
RETURN_IF_ERROR(client.reopen());
client->pread(response, request);
}
} catch (apache::thrift::TException& e) {
std::stringstream ss;
ss << "Read from broker failed, broker:" << broker_addr << " failed:" << e.what();
LOG(WARNING) << ss.str();
return Status(TStatusCode::THRIFT_RPC_ERROR, ss.str(), false);
}
if (response.opStatus.statusCode == TBrokerOperationStatusCode::END_OF_FILE) {
// read the end of broker's file
*eof = _eof = true;
return Status::OK;
} else if (response.opStatus.statusCode != TBrokerOperationStatusCode::OK) {
std::stringstream ss;
ss << "Read from broker failed, broker:" << broker_addr
<< " failed:" << response.opStatus.message;
LOG(WARNING) << ss.str();
return Status(ss.str());
}
*buf_len = response.data.size();
memcpy(buf, response.data.data(), *buf_len);
_cur_offset += *buf_len;
*eof = false;
return Status::OK;
}
void BrokerReader::close() {
if (!_is_fd_valid) {
return;
}
TBrokerCloseReaderRequest request;
request.__set_version(TBrokerVersion::VERSION_ONE);
request.__set_fd(_fd);
const TNetworkAddress& broker_addr = _addresses[_addr_idx];
TBrokerOperationStatus response;
try {
Status status;
// 500ms is enough
BrokerServiceConnection client(client_cache(_state), broker_addr, 500, &status);
if (!status.ok()) {
LOG(WARNING) << "Create broker client failed. broker=" << broker_addr
<< ", status=" << status.get_error_msg();
return;
}
try {
client->closeReader(response, request);
} catch (apache::thrift::transport::TTransportException& e) {
status = client.reopen();
if (!status.ok()) {
LOG(WARNING) << "Close broker reader failed. broker=" << broker_addr
<< ", status=" << status.get_error_msg();
return;
}
client->closeReader(response, request);
}
} catch (apache::thrift::TException& e) {
LOG(WARNING) << "Close broker reader failed, broker:" << broker_addr
<< " failed:" << e.what();
return;
}
if (response.statusCode != TBrokerOperationStatusCode::OK) {
LOG(WARNING) << "Open broker reader failed, broker:" << broker_addr
<< " failed:" << response.message;
return;
}
_is_fd_valid = false;
}
}

View File

@ -0,0 +1,71 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#pragma once
#include <stdint.h>
#include <string>
#include <map>
#include "common/status.h"
#include "exec/file_reader.h"
#include "gen_cpp/Types_types.h"
#include "gen_cpp/PaloBrokerService_types.h"
namespace palo {
class RuntimeState;
class TBrokerRangeDesc;
class TNetworkAddress;
class RuntimeState;
// Reader of broker file
class BrokerReader : public FileReader {
public:
BrokerReader(RuntimeState* state,
const std::vector<TNetworkAddress>& broker_addresses,
const std::map<std::string, std::string>& properties,
const std::string& path,
int64_t start_offset);
virtual ~BrokerReader();
Status open();
// Read
virtual Status read(uint8_t* buf, size_t* buf_len, bool* eof) override;
virtual void close() override;
private:
RuntimeState* _state;
const std::vector<TNetworkAddress>& _addresses;
const std::map<std::string, std::string>& _properties;
const std::string& _path;
int64_t _cur_offset;
bool _is_fd_valid;
TBrokerFD _fd;
bool _eof;
int _addr_idx;
};
}

View File

@ -0,0 +1,472 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/broker_scan_node.h"
#include <chrono>
#include <sstream>
#include "common/object_pool.h"
#include "runtime/runtime_state.h"
#include "runtime/row_batch.h"
#include "runtime/dpp_sink_internal.h"
#include "exec/broker_scanner.h"
#include "exprs/expr.h"
#include "util/debug_util.h"
#include "util/runtime_profile.h"
namespace palo {
BrokerScanNode::BrokerScanNode(
ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs) :
ScanNode(pool, tnode, descs),
_tuple_id(tnode.broker_scan_node.tuple_id),
_runtime_state(nullptr),
_tuple_desc(nullptr),
_num_running_scanners(0),
_scan_finished(false),
_max_buffered_batches(1024),
_wait_scanner_timer(nullptr) {
}
BrokerScanNode::~BrokerScanNode() {
}
// We use the ParttitionRange to compare here. It should not be a member function of PartitionInfo
// class becaurce there are some other member in it.
static bool compare_part_use_range(const PartitionInfo* v1, const PartitionInfo* v2) {
return v1->range() < v2->range();
}
Status BrokerScanNode::init(const TPlanNode& tnode) {
RETURN_IF_ERROR(ScanNode::init(tnode));
auto& broker_scan_node = tnode.broker_scan_node;
if (broker_scan_node.__isset.partition_exprs) {
// ASSERT broker_scan_node.__isset.partition_infos == true
RETURN_IF_ERROR(Expr::create_expr_trees(
_pool, broker_scan_node.partition_exprs, &_partition_expr_ctxs));
for (auto& t_partition_info : broker_scan_node.partition_infos) {
PartitionInfo* info = _pool->add(new PartitionInfo());
RETURN_IF_ERROR(PartitionInfo::from_thrift(_pool, t_partition_info, info));
_partition_infos.emplace_back(info);
}
// partitions should be in ascending order
std::sort(_partition_infos.begin(),
_partition_infos.end(),
compare_part_use_range);
}
return Status::OK;
}
Status BrokerScanNode::prepare(RuntimeState* state) {
VLOG_QUERY << "BrokerScanNode prepare";
RETURN_IF_ERROR(ScanNode::prepare(state));
// get tuple desc
_runtime_state = state;
_tuple_desc = state->desc_tbl().get_tuple_descriptor(_tuple_id);
if (_tuple_desc == nullptr) {
std::stringstream ss;
ss << "Failed to get tuple descriptor, _tuple_id=" << _tuple_id;
return Status(ss.str());
}
// Initialize slots map
for (auto slot : _tuple_desc->slots()) {
auto pair = _slots_map.emplace(slot->col_name(), slot);
if (!pair.second) {
std::stringstream ss;
ss << "Failed to insert slot, col_name=" << slot->col_name();
return Status(ss.str());
}
}
// prepare partition
if (_partition_expr_ctxs.size() > 0) {
RETURN_IF_ERROR(Expr::prepare(
_partition_expr_ctxs, state, row_desc(), expr_mem_tracker()));
for (auto iter : _partition_infos) {
RETURN_IF_ERROR(iter->prepare(state, row_desc(), expr_mem_tracker()));
}
}
// Profile
_wait_scanner_timer = ADD_TIMER(runtime_profile(), "WaitScannerTime");
return Status::OK;
}
Status BrokerScanNode::open(RuntimeState* state) {
SCOPED_TIMER(_runtime_profile->total_time_counter());
RETURN_IF_ERROR(ExecNode::open(state));
RETURN_IF_ERROR(exec_debug_action(TExecNodePhase::OPEN));
RETURN_IF_CANCELLED(state);
// Open partition
if (_partition_expr_ctxs.size() > 0) {
RETURN_IF_ERROR(Expr::open(_partition_expr_ctxs, state));
for (auto iter : _partition_infos) {
RETURN_IF_ERROR(iter->open(state));
}
}
RETURN_IF_ERROR(start_scanners());
return Status::OK;
}
Status BrokerScanNode::start_scanners() {
{
std::unique_lock<std::mutex> l(_batch_queue_lock);
_num_running_scanners = 1;
}
_scanner_threads.emplace_back(&BrokerScanNode::scanner_worker, this, 0, _scan_ranges.size());
return Status::OK;
}
Status BrokerScanNode::get_next(RuntimeState* state, RowBatch* row_batch, bool* eos) {
SCOPED_TIMER(_runtime_profile->total_time_counter());
// check if CANCELLED.
if (state->is_cancelled()) {
std::unique_lock<std::mutex> l(_batch_queue_lock);
if (update_status(Status::CANCELLED)) {
// Notify all scanners
_queue_writer_cond.notify_all();
}
}
if (_scan_finished.load()) {
*eos = true;
return Status::OK;
}
std::shared_ptr<RowBatch> scanner_batch;
{
std::unique_lock<std::mutex> l(_batch_queue_lock);
while (_process_status.ok() &&
!_runtime_state->is_cancelled() &&
_num_running_scanners > 0 &&
_batch_queue.empty()) {
SCOPED_TIMER(_wait_scanner_timer);
_queue_reader_cond.wait_for(l, std::chrono::seconds(1));
}
if (!_process_status.ok()) {
// Some scanner process failed.
return _process_status;
}
if (_runtime_state->is_cancelled()) {
if (update_status(Status::CANCELLED)) {
_queue_writer_cond.notify_all();
}
return _process_status;
}
if (!_batch_queue.empty()) {
scanner_batch = _batch_queue.front();
_batch_queue.pop_front();
}
}
// All scanner has been finished, and all cached batch has been read
if (scanner_batch == nullptr) {
_scan_finished.store(true);
*eos = true;
return Status::OK;
}
// notify one scanner
_queue_writer_cond.notify_one();
// get scanner's batch memory
row_batch->acquire_state(scanner_batch.get());
_num_rows_returned += row_batch->num_rows();
COUNTER_SET(_rows_returned_counter, _num_rows_returned);
// This is first time reach limit.
// Only valid when query 'select * from table1 limit 20'
if (reached_limit()) {
int num_rows_over = _num_rows_returned - _limit;
row_batch->set_num_rows(row_batch->num_rows() - num_rows_over);
_num_rows_returned -= num_rows_over;
COUNTER_SET(_rows_returned_counter, _num_rows_returned);
_scan_finished.store(true);
_queue_writer_cond.notify_all();
*eos = true;
} else {
*eos = false;
}
if (VLOG_ROW_IS_ON) {
for (int i = 0; i < row_batch->num_rows(); ++i) {
TupleRow* row = row_batch->get_row(i);
VLOG_ROW << "BrokerScanNode output row: "
<< print_tuple(row->get_tuple(0), *_tuple_desc);
}
}
return Status::OK;
}
Status BrokerScanNode::close(RuntimeState* state) {
if (is_closed()) {
return Status::OK;
}
RETURN_IF_ERROR(exec_debug_action(TExecNodePhase::CLOSE));
SCOPED_TIMER(_runtime_profile->total_time_counter());
_scan_finished.store(true);
_queue_writer_cond.notify_all();
_queue_reader_cond.notify_all();
for (int i = 0; i < _scanner_threads.size(); ++i) {
_scanner_threads[i].join();
}
// Open partition
if (_partition_expr_ctxs.size() > 0) {
Expr::close(_partition_expr_ctxs, state);
for (auto iter : _partition_infos) {
iter->close(state);
}
}
// Close
_batch_queue.clear();
return ExecNode::close(state);
}
// This function is called after plan node has been prepared.
Status BrokerScanNode::set_scan_ranges(const std::vector<TScanRangeParams>& scan_ranges) {
_scan_ranges = scan_ranges;
// Now we initialize partition information
if (_partition_expr_ctxs.size() > 0) {
for (auto& range : _scan_ranges) {
auto& params = range.scan_range.broker_scan_range.params;
if (params.__isset.partition_ids) {
std::sort(params.partition_ids.begin(), params.partition_ids.end());
}
}
}
return Status::OK;
}
void BrokerScanNode::debug_string(int ident_level, std::stringstream* out) const {
(*out) << "BrokerScanNode";
}
Status BrokerScanNode::scanner_scan(
const TBrokerScanRange& scan_range,
const std::vector<ExprContext*>& conjunct_ctxs,
const std::vector<ExprContext*>& partition_expr_ctxs,
BrokerScanCounter* counter) {
std::unique_ptr<BrokerScanner> scanner(new BrokerScanner(
_runtime_state,
runtime_profile(),
scan_range.params,
scan_range.ranges,
scan_range.broker_addresses,
counter));
RETURN_IF_ERROR(scanner->open());
bool scanner_eof = false;
while (!scanner_eof) {
// Fill one row batch
std::shared_ptr<RowBatch> row_batch(
new RowBatch(row_desc(), _runtime_state->batch_size(), mem_tracker()));
// create new tuple buffer for row_batch
MemPool* tuple_pool = row_batch->tuple_data_pool();
int tuple_buffer_size = row_batch->capacity() * _tuple_desc->byte_size();
void* tuple_buffer = tuple_pool->allocate(tuple_buffer_size);
if (tuple_buffer == nullptr) {
return Status("Allocate memory for row batch failed.");
}
Tuple* tuple = reinterpret_cast<Tuple*>(tuple_buffer);
while (!scanner_eof) {
RETURN_IF_CANCELLED(_runtime_state);
// If we have finished all works
if (_scan_finished.load()) {
return Status::OK;
}
// This row batch has been filled up, and break this
if (row_batch->is_full()) {
break;
}
int row_idx = row_batch->add_row();
TupleRow* row = row_batch->get_row(row_idx);
// scan node is the first tuple of tuple row
row->set_tuple(0, tuple);
memset(tuple, 0, sizeof(_tuple_desc->num_null_bytes()));
// Get from scanner
RETURN_IF_ERROR(scanner->get_next(tuple, tuple_pool, &scanner_eof));
if (scanner_eof) {
continue;
}
if (scan_range.params.__isset.partition_ids) {
int64_t partition_id = get_partition_id(partition_expr_ctxs, row);
if (partition_id == -1 ||
!std::binary_search(scan_range.params.partition_ids.begin(),
scan_range.params.partition_ids.end(),
partition_id)) {
counter->num_rows_filtered++;
std::stringstream error_msg;
error_msg << "No corresponding partition, partition id: " << partition_id;
_runtime_state->append_error_msg_to_file(print_tuple(tuple, *_tuple_desc),
error_msg.str());
continue;
}
}
// eval conjuncts of this row.
if (eval_conjuncts(&conjunct_ctxs[0], conjunct_ctxs.size(), row)) {
row_batch->commit_last_row();
char* new_tuple = reinterpret_cast<char*>(tuple);
new_tuple += _tuple_desc->byte_size();
tuple = reinterpret_cast<Tuple*>(new_tuple);
counter->num_rows_returned++;
} else {
counter->num_rows_filtered++;
}
}
// Row batch has been filled, push this to the queue
if (row_batch->num_rows() > 0) {
std::unique_lock<std::mutex> l(_batch_queue_lock);
while (_process_status.ok() &&
!_scan_finished.load() &&
!_runtime_state->is_cancelled() &&
_batch_queue.size() >= _max_buffered_batches) {
_queue_writer_cond.wait_for(l, std::chrono::seconds(1));
}
// Process already set failed, so we just return OK
if (!_process_status.ok()) {
return Status::OK;
}
// Scan already finished, just return
if (_scan_finished.load()) {
return Status::OK;
}
// Runtime state is canceled, just return cancel
if (_runtime_state->is_cancelled()) {
return Status::CANCELLED;
}
// Queue size Must be samller than _max_buffered_batches
_batch_queue.push_back(row_batch);
// Notify reader to
_queue_reader_cond.notify_one();
}
}
return Status::OK;
}
void BrokerScanNode::scanner_worker(int start_idx, int length) {
// Clone expr context
std::vector<ExprContext*> scanner_expr_ctxs;
auto status = Expr::clone_if_not_exists(_conjunct_ctxs, _runtime_state, &scanner_expr_ctxs);
if (!status.ok()) {
LOG(WARNING) << "Clone conjuncts failed.";
}
std::vector<ExprContext*> partition_expr_ctxs;;
if (status.ok()) {
status = Expr::clone_if_not_exists(
_partition_expr_ctxs, _runtime_state, &partition_expr_ctxs);
if (!status.ok()) {
LOG(WARNING) << "Clone conjuncts failed.";
}
}
BrokerScanCounter counter;
for (int i = 0; i < length && status.ok(); ++i) {
const TBrokerScanRange& scan_range =
_scan_ranges[start_idx + i].scan_range.broker_scan_range;
status = scanner_scan(scan_range, scanner_expr_ctxs, partition_expr_ctxs, &counter);
if (!status.ok()) {
LOG(WARNING) << "Scanner[" << start_idx + i << "] prcess failed. status="
<< status.get_error_msg();
}
}
// Update stats
_runtime_state->update_num_rows_load_success(counter.num_rows_returned);
_runtime_state->update_num_rows_load_filtered(counter.num_rows_filtered);
// scanner is going to finish
{
std::lock_guard<std::mutex> l(_batch_queue_lock);
if (!status.ok()) {
update_status(status);
}
// This scanner will finish
_num_running_scanners--;
}
_queue_reader_cond.notify_all();
// If one scanner failed, others don't need scan any more
if (!status.ok()) {
_queue_writer_cond.notify_all();
}
Expr::close(scanner_expr_ctxs, _runtime_state);
Expr::close(partition_expr_ctxs, _runtime_state);
}
int64_t BrokerScanNode::binary_find_partition_id(const PartRangeKey& key) const {
int low = 0;
int high = _partition_infos.size() - 1;
while (low <= high) {
int mid = low + (high - low) / 2;
int cmp = _partition_infos[mid]->range().compare_key(key);
if (cmp == 0) {
return _partition_infos[mid]->id();
} else if (cmp < 0) { // current < partition[mid]
low = mid + 1;
} else {
high = mid - 1;
}
}
return -1;
}
int64_t BrokerScanNode::get_partition_id(
const std::vector<ExprContext*>& partition_expr_ctxs, TupleRow* row) const {
if (_partition_infos.size() == 0) {
return -1;
}
// construct a PartRangeKey
PartRangeKey part_key;
// use binary search to get the right partition.
ExprContext* ctx = partition_expr_ctxs[0];
void* partition_val = ctx->get_value(row);
if (partition_val != nullptr) {
PartRangeKey::from_value(ctx->root()->type().type, partition_val, &part_key);
} else {
part_key = PartRangeKey::neg_infinite();
}
return binary_find_partition_id(part_key);
}
}

View File

@ -0,0 +1,135 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#pragma once
#include <atomic>
#include <condition_variable>
#include <map>
#include <string>
#include <vector>
#include <mutex>
#include <thread>
#include "common/status.h"
#include "exec/scan_node.h"
#include "gen_cpp/PaloInternalService_types.h"
namespace palo {
class RuntimeState;
class PartRangeKey;
class PartitionInfo;
class BrokerScanCounter;
class BrokerScanNode : public ScanNode {
public:
BrokerScanNode(ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs);
virtual ~BrokerScanNode();
// Called after create this scan node
virtual Status init(const TPlanNode& tnode) override;
// initialize _mysql_scanner, and create _text_converter.
virtual Status prepare(RuntimeState* state) override;
// Start MySQL scan using _mysql_scanner.
virtual Status open(RuntimeState* state) override;
// Fill the next row batch by calling next() on the _mysql_scanner,
// converting text data in MySQL cells to binary data.
virtual Status get_next(RuntimeState* state, RowBatch* row_batch, bool* eos) override;
// Close the _mysql_scanner, and report errors.
virtual Status close(RuntimeState* state) override;
// No use
virtual Status set_scan_ranges(const std::vector<TScanRangeParams>& scan_ranges) override;
// Called by broker scanners to get_partition_id
// If there is no partition information, return -1
// Return partition id if we find the partition match this row,
// return -1, if there is no such partition.
int64_t get_partition_id(
const std::vector<ExprContext*>& partition_exprs, TupleRow* row) const;
protected:
// Write debug string of this into out.
virtual void debug_string(int indentation_level, std::stringstream* out) const override;
private:
// Update process status to one failed status,
// NOTE: Must hold the mutex of this scan node
bool update_status(const Status& new_status) {
if (_process_status.ok()) {
_process_status = new_status;
return true;
}
return false;
}
// Create scanners to do scan job
Status start_scanners();
// One scanner worker, This scanner will hanle 'length' ranges start from start_idx
void scanner_worker(int start_idx, int length);
// Scan one range
Status scanner_scan(const TBrokerScanRange& scan_range,
const std::vector<ExprContext*>& conjunct_ctxs,
const std::vector<ExprContext*>& partition_expr_ctxs,
BrokerScanCounter* counter);
// Find partition id with PartRangeKey
int64_t binary_find_partition_id(const PartRangeKey& key) const;
private:
TupleId _tuple_id;
RuntimeState* _runtime_state;
TupleDescriptor* _tuple_desc;
std::map<std::string, SlotDescriptor*> _slots_map;
std::vector<TScanRangeParams> _scan_ranges;
std::mutex _batch_queue_lock;
std::condition_variable _queue_reader_cond;
std::condition_variable _queue_writer_cond;
std::deque<std::shared_ptr<RowBatch>> _batch_queue;
int _num_running_scanners;
// Indicate if all scanners have been finished scan worker
bool _all_scanners_finished;
std::atomic<bool> _scan_finished;
Status _process_status;
std::vector<std::thread> _scanner_threads;
int _max_buffered_batches;
// Partition informations
std::vector<ExprContext*> _partition_expr_ctxs;
std::vector<PartitionInfo*> _partition_infos;
// Profile information
//
RuntimeProfile::Counter* _wait_scanner_timer;
};
}

View File

@ -0,0 +1,615 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/broker_scanner.h"
#include <sstream>
#include <iostream>
#include "runtime/descriptors.h"
#include "runtime/mem_tracker.h"
#include "runtime/raw_value.h"
#include "runtime/tuple.h"
#include "exprs/expr.h"
#include "exec/text_converter.h"
#include "exec/text_converter.hpp"
#include "exec/plain_text_line_reader.h"
#include "exec/local_file_reader.h"
#include "exec/broker_reader.h"
#include "exec/decompressor.h"
namespace palo {
class Slice {
public:
Slice(const uint8_t* data, int size) : _data(data), _size(size) { }
Slice(const std::string& str) : _data((const uint8_t*)str.data()), _size(str.size()) { }
~Slice() {
// No need to delete _begin, because it only record the index in a std::string.
// The c-string will be released along with the std::string object.
}
int size() const {
return _size;
}
const uint8_t* data() const {
return _data;
}
const uint8_t* end() const {
return _data + _size;
}
private:
friend std::ostream& operator<<(std::ostream& os, const Slice& slice);
const uint8_t* _data;
int _size;
};
std::ostream& operator<<(std::ostream& os, const Slice& slice) {
os << std::string((const char*)slice._data, slice._size);
return os;
}
BrokerScanner::BrokerScanner(RuntimeState* state,
RuntimeProfile* profile,
const TBrokerScanRangeParams& params,
const std::vector<TBrokerRangeDesc>& ranges,
const std::vector<TNetworkAddress>& broker_addresses,
BrokerScanCounter* counter) :
_state(state),
_profile(profile),
_params(params),
_ranges(ranges),
_broker_addresses(broker_addresses),
// _splittable(params.splittable),
_value_separator(params.column_separator),
_line_delimiter(params.line_delimiter),
_cur_file_reader(nullptr),
_cur_line_reader(nullptr),
_cur_decompressor(nullptr),
_next_range(0),
_cur_line_reader_eof(false),
_scanner_eof(false),
_skip_next_line(false),
_src_tuple(nullptr),
_src_tuple_row(nullptr),
_mem_pool(_state->instance_mem_tracker()),
_dest_tuple_desc(nullptr),
_mem_tracker(new MemTracker(-1, "Broker Scanner", state->instance_mem_tracker())),
_counter(counter),
_rows_read_counter(nullptr),
_read_timer(nullptr),
_materialize_timer(nullptr) {
}
BrokerScanner::~BrokerScanner() {
close();
}
Status BrokerScanner::init_expr_ctxes() {
// Constcut _src_slot_descs
const TupleDescriptor* src_tuple_desc =
_state->desc_tbl().get_tuple_descriptor(_params.src_tuple_id);
if (src_tuple_desc == nullptr) {
std::stringstream ss;
ss << "Unknown source tuple descriptor, tuple_id=" << _params.src_tuple_id;
return Status(ss.str());
}
std::map<SlotId, SlotDescriptor*> src_slot_desc_map;
for (auto slot_desc : src_tuple_desc->slots()) {
src_slot_desc_map.emplace(slot_desc->id(), slot_desc);
}
for (auto slot_id : _params.src_slot_ids) {
auto it = src_slot_desc_map.find(slot_id);
if (it == std::end(src_slot_desc_map)) {
std::stringstream ss;
ss << "Unknown source slot descriptor, slot_id=" << slot_id;
return Status(ss.str());
}
_src_slot_descs.emplace_back(it->second);
}
// Construct source tuple and tuple row
_src_tuple = (Tuple*) _mem_pool.allocate(src_tuple_desc->byte_size());
_src_tuple_row = (TupleRow*) _mem_pool.allocate(sizeof(Tuple*));
_src_tuple_row->set_tuple(0, _src_tuple);
_row_desc.reset(new RowDescriptor(_state->desc_tbl(),
std::vector<TupleId>({_params.src_tuple_id}),
std::vector<bool>({false})));
// Construct dest slots information
_dest_tuple_desc = _state->desc_tbl().get_tuple_descriptor(_params.dest_tuple_id);
if (_dest_tuple_desc == nullptr) {
std::stringstream ss;
ss << "Unknown dest tuple descriptor, tuple_id=" << _params.dest_tuple_id;
return Status(ss.str());
}
for (auto slot_desc : _dest_tuple_desc->slots()) {
if (!slot_desc->is_materialized()) {
continue;
}
auto it = _params.expr_of_dest_slot.find(slot_desc->id());
if (it == std::end(_params.expr_of_dest_slot)) {
std::stringstream ss;
ss << "No expr for dest slot, id=" << slot_desc->id()
<< ", name=" << slot_desc->col_name();
return Status(ss.str());
}
ExprContext* ctx = nullptr;
RETURN_IF_ERROR(Expr::create_expr_tree(_state->obj_pool(), it->second, &ctx));
RETURN_IF_ERROR(ctx->prepare(_state, *_row_desc.get(), _mem_tracker.get()));
RETURN_IF_ERROR(ctx->open(_state));
_dest_expr_ctx.emplace_back(ctx);
}
return Status::OK;
}
Status BrokerScanner::open() {
RETURN_IF_ERROR(init_expr_ctxes());
_text_converter.reset(new(std::nothrow) TextConverter('\\'));
if (_text_converter == nullptr) {
return Status("No memory error.");
}
_rows_read_counter = ADD_COUNTER(_profile, "RowsRead", TUnit::UNIT);
_read_timer = ADD_TIMER(_profile, "TotalRawReadTime(*)");
_materialize_timer = ADD_TIMER(_profile, "MaterializeTupleTime(*)");
return Status::OK;
}
Status BrokerScanner::get_next(Tuple* tuple, MemPool* tuple_pool, bool* eof) {
SCOPED_TIMER(_read_timer);
// Get one line
while (!_scanner_eof) {
if (_cur_line_reader == nullptr || _cur_line_reader_eof) {
RETURN_IF_ERROR(open_next_reader());
// If there isn't any more reader, break this
if (_scanner_eof) {
continue;
}
}
const uint8_t* ptr = nullptr;
size_t size = 0;
RETURN_IF_ERROR(_cur_line_reader->read_line(
&ptr, &size, &_cur_line_reader_eof));
if (_skip_next_line) {
_skip_next_line = false;
continue;
}
if (size == 0) {
// Read empty row, just continue
continue;
}
{
COUNTER_UPDATE(_rows_read_counter, 1);
SCOPED_TIMER(_materialize_timer);
if (convert_one_row(Slice(ptr, size), tuple, tuple_pool)) {
break;
}
}
}
if (_scanner_eof) {
*eof = true;
} else {
*eof = false;
}
return Status::OK;
}
Status BrokerScanner::open_next_reader() {
if (_next_range >= _ranges.size()) {
_scanner_eof = true;
return Status::OK;
}
RETURN_IF_ERROR(open_file_reader());
RETURN_IF_ERROR(open_line_reader());
_next_range++;
return Status::OK;
}
Status BrokerScanner::open_file_reader() {
if (_cur_file_reader != nullptr) {
delete _cur_file_reader;
_cur_file_reader = nullptr;
}
const TBrokerRangeDesc& range = _ranges[_next_range];
int64_t start_offset = range.start_offset;
if (start_offset != 0) {
start_offset -= 1;
}
switch (range.file_type) {
case TFileType::FILE_LOCAL: {
LocalFileReader* file_reader = new LocalFileReader(range.path, start_offset);
RETURN_IF_ERROR(file_reader->open());
_cur_file_reader = file_reader;
break;
}
case TFileType::FILE_BROKER: {
BrokerReader* broker_reader = new BrokerReader(
_state, _broker_addresses, _params.properties, range.path, start_offset);
RETURN_IF_ERROR(broker_reader->open());
_cur_file_reader = broker_reader;
break;
}
default: {
std::stringstream ss;
ss << "Unknown file type, type=" << range.file_type;
return Status(ss.str());
}
}
return Status::OK;
}
Status BrokerScanner::create_decompressor(TFileFormatType::type type) {
if (_cur_decompressor == nullptr) {
delete _cur_decompressor;
_cur_decompressor = nullptr;
}
CompressType compress_type;
switch (type) {
case TFileFormatType::FORMAT_CSV_PLAIN:
compress_type = CompressType::UNCOMPRESSED;
break;
case TFileFormatType::FORMAT_CSV_GZ:
compress_type = CompressType::GZIP;
break;
case TFileFormatType::FORMAT_CSV_BZ2:
compress_type = CompressType::BZIP2;
break;
case TFileFormatType::FORMAT_CSV_LZ4FRAME:
compress_type = CompressType::LZ4FRAME;
break;
case TFileFormatType::FORMAT_CSV_LZOP:
compress_type = CompressType::LZOP;
break;
default: {
std::stringstream ss;
ss << "Unknown format type, type=" << type;
return Status(ss.str());
}
}
RETURN_IF_ERROR(Decompressor::create_decompressor(
compress_type, &_cur_decompressor));
return Status::OK;
}
Status BrokerScanner::open_line_reader() {
if (_cur_decompressor != nullptr) {
delete _cur_decompressor;
_cur_decompressor = nullptr;
}
if (_cur_line_reader != nullptr) {
delete _cur_line_reader;
_cur_line_reader = nullptr;
}
const TBrokerRangeDesc& range = _ranges[_next_range];
int64_t size = range.size;
if (range.start_offset != 0) {
if (range.format_type != TFileFormatType::FORMAT_CSV_PLAIN) {
std::stringstream ss;
ss << "For now we do not support split compressed file";
return Status(ss.str());
}
size += 1;
_skip_next_line = true;
} else {
_skip_next_line = false;
}
// create decompressor.
// _decompressor may be NULL if this is not a compressed file
RETURN_IF_ERROR(create_decompressor(range.format_type));
// open line reader
switch (range.format_type) {
case TFileFormatType::FORMAT_CSV_PLAIN:
case TFileFormatType::FORMAT_CSV_GZ:
case TFileFormatType::FORMAT_CSV_BZ2:
case TFileFormatType::FORMAT_CSV_LZ4FRAME:
case TFileFormatType::FORMAT_CSV_LZOP:
_cur_line_reader = new PlainTextLineReader(
_profile,
_cur_file_reader, _cur_decompressor,
size, _line_delimiter);
break;
default: {
std::stringstream ss;
ss << "Unknown format type, type=" << range.format_type;
return Status(ss.str());
}
}
_cur_line_reader_eof = false;
return Status::OK;
}
void BrokerScanner::close() {
if (_cur_decompressor != nullptr) {
delete _cur_decompressor;
_cur_decompressor = nullptr;
}
if (_cur_line_reader != nullptr) {
delete _cur_line_reader;
_cur_line_reader = nullptr;
}
if (_cur_file_reader != nullptr) {
delete _cur_file_reader;
_cur_file_reader = nullptr;
}
Expr::close(_dest_expr_ctx, _state);
}
void BrokerScanner::split_line(
const Slice& line, std::vector<Slice>* values) {
// line-begin char and line-end char are considered to be 'delimeter'
const uint8_t* value = line.data();
const uint8_t* ptr = line.data();
for (size_t i = 0; i < line.size(); ++i, ++ptr) {
if (*ptr == _value_separator) {
values->emplace_back(value, ptr - value);
value = ptr + 1;
}
}
values->emplace_back(value, ptr - value);
}
void BrokerScanner::fill_fix_length_string(
const Slice& value, MemPool* pool,
char** new_value_p, const int new_value_length) {
if (new_value_length != 0 && value.size() < new_value_length) {
*new_value_p = reinterpret_cast<char*>(pool->allocate(new_value_length));
// 'value' is guaranteed not to be nullptr
memcpy(*new_value_p, value.data(), value.size());
for (int i = value.size(); i < new_value_length; ++i) {
(*new_value_p)[i] = '\0';
}
}
}
// Following format are included.
// .123 1.23 123. -1.23
// ATTN: The decimal point and (for negative numbers) the "-" sign are not counted.
// like '.123', it will be regarded as '0.123', but it match decimal(3, 3)
bool BrokerScanner::check_decimal_input(
const Slice& slice,
int precision, int scale,
std::stringstream* error_msg) {
const uint8_t* value = slice.data();
int value_length = slice.size();
if (value_length > (precision + 2)) {
(*error_msg) << "the length of decimal value is overflow. "
<< "precision in schema: (" << precision << ", " << scale << "); "
<< "value: [" << slice << "]; "
<< "str actual length: " << value_length << ";";
return false;
}
// ignore leading spaces and trailing spaces
int begin_index = 0;
while (begin_index < value_length && std::isspace(value[begin_index])) {
++begin_index;
}
int end_index = value_length - 1;
while (end_index >= begin_index && std::isspace(value[end_index])) {
--end_index;
}
if (value[begin_index] == '+' || value[begin_index] == '-') {
++begin_index;
}
int point_index = -1;
for (int i = begin_index; i <= end_index; ++i) {
if (value[i] == '.') {
point_index = i;
}
}
int value_int_len = 0;
int value_frac_len = 0;
value_int_len = point_index - begin_index;
value_frac_len = end_index- point_index;
if (point_index == -1) {
// an int value: like 123
value_int_len = end_index - begin_index + 1;
value_frac_len = 0;
} else {
value_int_len = point_index - begin_index;
value_frac_len = end_index- point_index;
}
if (value_int_len > (precision - scale)) {
(*error_msg) << "the int part length longer than schema precision ["
<< precision << "]. "
<< "value [" << slice << "]. ";
return false;
} else if (value_frac_len > scale) {
(*error_msg) << "the frac part length longer than schema scale ["
<< scale << "]. "
<< "value [" << slice << "]. ";
return false;
}
return true;
}
bool is_null(const Slice& slice) {
return slice.size() == 2 &&
slice.data()[0] == '\\' &&
slice.data()[1] == 'N';
}
// Writes a slot in _tuple from an value containing text data.
bool BrokerScanner::write_slot(
const std::string& column_name, const TColumnType& column_type,
const Slice& value, const SlotDescriptor* slot,
Tuple* tuple, MemPool* tuple_pool,
std::stringstream* error_msg) {
if (value.size() == 0 && !slot->type().is_string_type()) {
(*error_msg) << "the length of input should not be 0. "
<< "column_name: " << column_name << "; "
<< "type: " << slot->type();
return false;
}
char* value_to_convert = (char*)value.data();
int value_to_convert_length = value.size();
// Fill all the spaces if it is 'TYPE_CHAR' type
if (slot->type().is_string_type()) {
int char_len = column_type.len;
if (value.size() > char_len) {
(*error_msg) << "the length of input is too long than schema. "
<< "column_name: " << column_name << "; "
<< "input_str: [" << value << "] "
<< "type: " << slot->type() << "; "
<< "schema length: " << char_len << "; "
<< "actual length: " << value.size() << "; ";
return false;
}
if (slot->type().type == TYPE_CHAR && value.size() < char_len) {
if (!is_null(value)) {
fill_fix_length_string(
value, tuple_pool,
&value_to_convert, char_len);
value_to_convert_length = char_len;
}
}
} else if (slot->type().is_decimal_type()) {
bool is_success = check_decimal_input(
value, column_type.precision, column_type.scale, error_msg);
if (is_success == false) {
return false;
}
}
if (!_text_converter->write_slot(
slot, tuple, value_to_convert, value_to_convert_length,
true, false, tuple_pool)) {
(*error_msg) << "convert csv string to "
<< slot->type() << " failed. "
<< "column_name: " << column_name << "; "
<< "input_str: [" << value << "]; ";
return false;
}
return true;
}
// Convert one row to this tuple
bool BrokerScanner::convert_one_row(
const Slice& line,
Tuple* tuple, MemPool* tuple_pool) {
if (!line_to_src_tuple(line)) {
return false;
}
return fill_dest_tuple(line, tuple, tuple_pool);
}
// Convert one row to this tuple
bool BrokerScanner::line_to_src_tuple(const Slice& line) {
std::vector<Slice> values;
{
split_line(line, &values);
}
if (values.size() < _src_slot_descs.size()) {
std::stringstream error_msg;
error_msg << "actual column number is less than schema column number. "
<< "actual number: " << values.size() << " sep: " << _value_separator << ", "
<< "schema number: " << _src_slot_descs.size() << "; ";
_state->append_error_msg_to_file(std::string((const char*)line.data(), line.size()),
error_msg.str());
_counter->num_rows_filtered++;
return false;
} else if (values.size() > _src_slot_descs.size()) {
std::stringstream error_msg;
error_msg << "actual column number is more than schema column number. "
<< "actual number: " << values.size() << " sep: " << _value_separator << ", "
<< "schema number: " << _src_slot_descs.size() << "; ";
_state->append_error_msg_to_file(std::string((const char*)line.data(), line.size()),
error_msg.str());
_counter->num_rows_filtered++;
return false;
}
for (int i = 0; i < values.size(); ++i) {
auto slot_desc = _src_slot_descs[i];
const Slice& value = values[i];
if (slot_desc->is_nullable() && is_null(value)) {
_src_tuple->set_null(slot_desc->null_indicator_offset());
continue;
}
_src_tuple->set_not_null(slot_desc->null_indicator_offset());
void* slot = _src_tuple->get_slot(slot_desc->tuple_offset());
StringValue* str_slot = reinterpret_cast<StringValue*>(slot);
str_slot->ptr = (char*)value.data();
str_slot->len = value.size();
}
return true;
}
bool BrokerScanner::fill_dest_tuple(const Slice& line, Tuple* dest_tuple, MemPool* mem_pool) {
int ctx_idx = 0;
for (auto slot_desc : _dest_tuple_desc->slots()) {
if (!slot_desc->is_materialized()) {
continue;
}
ExprContext* ctx = _dest_expr_ctx[ctx_idx++];
void* value = ctx->get_value(_src_tuple_row);
if (value == nullptr) {
if (slot_desc->is_nullable()) {
dest_tuple->set_null(slot_desc->null_indicator_offset());
continue;
} else {
std::stringstream error_msg;
error_msg << "column(" << slot_desc->col_name() << ") value is null";
_state->append_error_msg_to_file(
std::string((const char*)line.data(), line.size()), error_msg.str());
_counter->num_rows_filtered++;
return false;
}
}
dest_tuple->set_not_null(slot_desc->null_indicator_offset());
void* slot = dest_tuple->get_slot(slot_desc->tuple_offset());
RawValue::write(value, slot, slot_desc->type(), mem_pool);
}
return true;
}
}

View File

@ -0,0 +1,166 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#pragma once
#include <memory>
#include <vector>
#include <string>
#include <map>
#include <sstream>
#include "common/status.h"
#include "gen_cpp/PlanNodes_types.h"
#include "gen_cpp/Types_types.h"
#include "runtime/mem_pool.h"
#include "util/runtime_profile.h"
namespace palo {
class Tuple;
class SlotDescriptor;
class Slice;
class TextConverter;
class FileReader;
class LineReader;
class Decompressor;
class RuntimeState;
class ExprContext;
class TupleDescriptor;
class TupleRow;
class RowDescriptor;
class MemTracker;
class RuntimeProfile;
struct BrokerScanCounter {
BrokerScanCounter() : num_rows_returned(0), num_rows_filtered(0) {
}
int64_t num_rows_returned;
int64_t num_rows_filtered;
};
// Broker scanner convert the data read from broker to palo's tuple.
class BrokerScanner {
public:
BrokerScanner(
RuntimeState* state,
RuntimeProfile* profile,
const TBrokerScanRangeParams& params,
const std::vector<TBrokerRangeDesc>& ranges,
const std::vector<TNetworkAddress>& broker_addresses,
BrokerScanCounter* counter);
~BrokerScanner();
// Open this scanner, will initialize informtion need to
Status open();
// Get next tuple
Status get_next(Tuple* tuple, MemPool* tuple_pool, bool* eof);
// Close this scanner
void close();
private:
Status open_file_reader();
Status create_decompressor(TFileFormatType::type type);
Status open_line_reader();
// Read next buffer from reader
Status open_next_reader();
// Split one text line to values
void split_line(
const Slice& line, std::vector<Slice>* values);
// Writes a slot in _tuple from an value containing text data.
bool write_slot(
const std::string& column_name, const TColumnType& column_type,
const Slice& value, const SlotDescriptor* slot,
Tuple* tuple, MemPool* tuple_pool, std::stringstream* error_msg);
void fill_fix_length_string(
const Slice& value, MemPool* pool,
char** new_value_p, int new_value_length);
bool check_decimal_input(
const Slice& value,
int precision, int scale,
std::stringstream* error_msg);
// Convert one row to one tuple
// 'ptr' and 'len' is csv text line
// output is tuple
bool convert_one_row(const Slice& line, Tuple* tuple, MemPool* tuple_pool);
Status init_expr_ctxes();
Status line_to_src_tuple();
bool line_to_src_tuple(const Slice& line);
bool fill_dest_tuple(const Slice& line, Tuple* dest_tuple, MemPool* mem_pool);
private:
RuntimeState* _state;
RuntimeProfile* _profile;
const TBrokerScanRangeParams& _params;
const std::vector<TBrokerRangeDesc>& _ranges;
const std::vector<TNetworkAddress>& _broker_addresses;
std::unique_ptr<TextConverter> _text_converter;
uint8_t _value_separator;
uint8_t _line_delimiter;
// Reader
FileReader* _cur_file_reader;
LineReader* _cur_line_reader;
Decompressor* _cur_decompressor;
int _next_range;
bool _cur_line_reader_eof;
bool _scanner_eof;
// When we fetch range doesn't start from 0,
// we will read to one ahead, and skip the first line
bool _skip_next_line;
// Used for constructing tuple
// slots for value read from broker file
std::vector<SlotDescriptor*> _src_slot_descs;
std::unique_ptr<RowDescriptor> _row_desc;
Tuple* _src_tuple;
TupleRow* _src_tuple_row;
// Mem pool used to allocate _src_tuple and _src_tuple_row
MemPool _mem_pool;
// Dest tuple descriptor and dest expr context
const TupleDescriptor* _dest_tuple_desc;
std::vector<ExprContext*> _dest_expr_ctx;
std::unique_ptr<MemTracker> _mem_tracker;
// used for process stat
BrokerScanCounter* _counter;
// Profile
RuntimeProfile::Counter* _rows_read_counter;
RuntimeProfile::Counter* _read_timer;
RuntimeProfile::Counter* _materialize_timer;
};
}

View File

@ -0,0 +1,238 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/broker_writer.h"
#include <sstream>
#include "common/logging.h"
#include "gen_cpp/PaloBrokerService_types.h"
#include "gen_cpp/TPaloBrokerService.h"
#include "runtime/broker_mgr.h"
#include "runtime/client_cache.h"
#include "runtime/exec_env.h"
#include "runtime/runtime_state.h"
#include "util/thrift_util.h"
namespace palo {
BrokerWriter::BrokerWriter(
RuntimeState* state,
const std::vector<TNetworkAddress>& broker_addresses,
const std::map<std::string, std::string>& properties,
const std::string& path,
int64_t start_offset) :
_state(state),
_addresses(broker_addresses),
_properties(properties),
_path(path),
_cur_offset(start_offset),
_is_closed(false),
_addr_idx(0) {
}
BrokerWriter::~BrokerWriter() {
close();
}
#ifdef BE_TEST
inline BrokerServiceClientCache* client_cache(RuntimeState* state) {
static BrokerServiceClientCache s_client_cache;
return &s_client_cache;
}
inline const std::string& client_id(RuntimeState* state, const TNetworkAddress& addr) {
static std::string s_client_id = "palo_unit_test";
return s_client_id;
}
#else
inline BrokerServiceClientCache* client_cache(RuntimeState* state) {
return state->exec_env()->broker_client_cache();
}
inline const std::string& client_id(RuntimeState* state, const TNetworkAddress& addr) {
return state->exec_env()->broker_mgr()->get_client_id(addr);
}
#endif
Status BrokerWriter::open() {
TBrokerOpenWriterRequest request;
const TNetworkAddress& broker_addr = _addresses[_addr_idx];
request.__set_version(TBrokerVersion::VERSION_ONE);
request.__set_path(_path);
request.__set_openMode(TBrokerOpenMode::APPEND);
request.__set_clientId(client_id(_state, broker_addr));
request.__set_properties(_properties);
VLOG_ROW << "debug: send broker open writer request: "
<< apache::thrift::ThriftDebugString(request).c_str();
TBrokerOpenWriterResponse response;
try {
Status status;
// 500ms is enough
BrokerServiceConnection client(client_cache(_state), broker_addr, 500, &status);
if (!status.ok()) {
LOG(WARNING) << "Create broker writer client failed. "
<< "broker=" << broker_addr
<< ", status=" << status.get_error_msg();
return status;
}
try {
client->openWriter(response, request);
} catch (apache::thrift::transport::TTransportException& e) {
RETURN_IF_ERROR(client.reopen());
client->openWriter(response, request);
}
} catch (apache::thrift::TException& e) {
std::stringstream ss;
ss << "Open broker writer failed, broker:" << broker_addr << " failed:" << e.what();
LOG(WARNING) << ss.str();
return Status(TStatusCode::THRIFT_RPC_ERROR, ss.str(), false);
}
VLOG_ROW << "debug: send broker open writer response: "
<< apache::thrift::ThriftDebugString(response).c_str();
if (response.opStatus.statusCode != TBrokerOperationStatusCode::OK) {
std::stringstream ss;
ss << "Open broker writer failed, broker:" << broker_addr
<< " failed:" << response.opStatus.message;
LOG(WARNING) << ss.str();
return Status(ss.str());
}
_fd = response.fd;
return Status::OK;
}
Status BrokerWriter::write(const uint8_t* buf, size_t buf_len, size_t* written_len) {
if (buf_len == 0) {
*written_len = 0;
return Status::OK;
}
const TNetworkAddress& broker_addr = _addresses[_addr_idx];
TBrokerPWriteRequest request;
request.__set_version(TBrokerVersion::VERSION_ONE);
request.__set_fd(_fd);
request.__set_offset(_cur_offset);
request.__set_data(std::string(reinterpret_cast<const char*>(buf), buf_len));
VLOG_ROW << "debug: send broker pwrite request: "
<< apache::thrift::ThriftDebugString(request).c_str();
TBrokerOperationStatus response;
try {
Status status;
// 500ms is enough
BrokerServiceConnection client(client_cache(_state), broker_addr, 500, &status);
if (!status.ok()) {
LOG(WARNING) << "Create broker write client failed. "
<< "broker=" << broker_addr
<< ", status=" << status.get_error_msg();
return status;
}
try {
client->pwrite(response, request);
} catch (apache::thrift::transport::TTransportException& e) {
RETURN_IF_ERROR(client.reopen());
client->pwrite(response, request);
}
} catch (apache::thrift::TException& e) {
std::stringstream ss;
ss << "Fail to write to broker, broker:" << broker_addr << " failed:" << e.what();
LOG(WARNING) << ss.str();
return Status(TStatusCode::THRIFT_RPC_ERROR, ss.str(), false);
}
VLOG_ROW << "debug: send broker pwrite response: "
<< apache::thrift::ThriftDebugString(response).c_str();
if (response.statusCode != TBrokerOperationStatusCode::OK) {
std::stringstream ss;
ss << "Fail to write to broker, broker:" << broker_addr
<< " msg:" << response.message;
LOG(WARNING) << ss.str();
return Status(ss.str());
}
*written_len = buf_len;
_cur_offset += buf_len;
return Status::OK;
}
void BrokerWriter::close() {
if (_is_closed) {
return;
}
TBrokerCloseWriterRequest request;
request.__set_version(TBrokerVersion::VERSION_ONE);
request.__set_fd(_fd);
VLOG_ROW << "debug: send broker close writer request: "
<< apache::thrift::ThriftDebugString(request).c_str();
const TNetworkAddress& broker_addr = _addresses[_addr_idx];
TBrokerOperationStatus response;
try {
Status status;
// 500ms is enough
BrokerServiceConnection client(client_cache(_state), broker_addr, 500, &status);
if (!status.ok()) {
LOG(WARNING) << "Create broker write client failed. broker=" << broker_addr
<< ", status=" << status.get_error_msg();
return;
}
try {
client->closeWriter(response, request);
} catch (apache::thrift::transport::TTransportException& e) {
status = client.reopen();
if (!status.ok()) {
LOG(WARNING) << "Close broker writer failed. broker=" << broker_addr
<< ", status=" << status.get_error_msg();
return;
}
client->closeWriter(response, request);
}
} catch (apache::thrift::TException& e) {
LOG(WARNING) << "Close broker writer failed, broker:" << broker_addr
<< " msg:" << e.what();
return;
}
VLOG_ROW << "debug: send broker close writer response: "
<< apache::thrift::ThriftDebugString(response).c_str();
if (response.statusCode != TBrokerOperationStatusCode::OK) {
LOG(WARNING) << "Close broker writer failed, broker:" << broker_addr
<< " msg:" << response.message;
return;
}
_is_closed = true;
}
} // end namespace palo

View File

@ -0,0 +1,72 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_EXEC_BROKER_WRITER_H
#define BDG_PALO_BE_SRC_EXEC_BROKER_WRITER_H
#include <stdint.h>
#include <string>
#include <map>
#include "common/status.h"
#include "exec/file_writer.h"
#include "gen_cpp/Types_types.h"
#include "gen_cpp/PaloBrokerService_types.h"
namespace palo {
class RuntimeState;
class TBrokerRangeDesc;
class TNetworkAddress;
class RuntimeState;
// Reader of broker file
class BrokerWriter : public FileWriter {
public:
BrokerWriter(RuntimeState* state,
const std::vector<TNetworkAddress>& broker_addresses,
const std::map<std::string, std::string>& properties,
const std::string& dir,
int64_t start_offset);
virtual ~BrokerWriter();
virtual Status open() override;
virtual Status write(const uint8_t* buf, size_t buf_len, size_t* written_len) override;
virtual void close() override;
private:
RuntimeState* _state;
const std::vector<TNetworkAddress>& _addresses;
const std::map<std::string, std::string>& _properties;
std::string _path;
int64_t _cur_offset;
bool _is_closed;
TBrokerFD _fd;
// TODO: use for retry if one broker down
int _addr_idx;
};
} // end namespace palo
#endif // BDG_PALO_BE_SRC_EXEC_BROKER_WRITER_H

View File

@ -0,0 +1,205 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/cross_join_node.h"
#include <sstream>
#include "exprs/expr.h"
#include "gen_cpp/PlanNodes_types.h"
#include "runtime/row_batch.h"
#include "runtime/runtime_state.h"
#include "util/debug_util.h"
#include "util/runtime_profile.h"
namespace palo {
CrossJoinNode::CrossJoinNode(
ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs)
: BlockingJoinNode("CrossJoinNode", TJoinOp::CROSS_JOIN, pool, tnode, descs) {
}
Status CrossJoinNode::prepare(RuntimeState* state) {
DCHECK(_join_op == TJoinOp::CROSS_JOIN);
RETURN_IF_ERROR(BlockingJoinNode::prepare(state));
_build_batch_pool.reset(new ObjectPool());
return Status::OK;
}
Status CrossJoinNode::close(RuntimeState* state) {
// avoid double close
if (is_closed()) {
return Status::OK;
}
_build_batches.reset();
_build_batch_pool.reset();
BlockingJoinNode::close(state);
return Status::OK;
}
Status CrossJoinNode::construct_build_side(RuntimeState* state) {
// Do a full scan of child(1) and store all build row batches.
RETURN_IF_ERROR(child(1)->open(state));
while (true) {
RowBatch* batch = _build_batch_pool->add(
new RowBatch(child(1)->row_desc(), state->batch_size(), mem_tracker()));
RETURN_IF_CANCELLED(state);
// TODO(zhaochun):
// RETURN_IF_ERROR(state->CheckQueryState());
bool eos = true;
RETURN_IF_ERROR(child(1)->get_next(state, batch, &eos));
// to prevent use too many memory
RETURN_IF_LIMIT_EXCEEDED(state);
SCOPED_TIMER(_build_timer);
_build_batches.add_row_batch(batch);
VLOG_ROW << build_list_debug_string();
COUNTER_SET(_build_row_counter,
static_cast<int64_t>(_build_batches.total_num_rows()));
if (eos) {
break;
}
}
return Status::OK;
}
void CrossJoinNode::init_get_next(TupleRow* first_left_row) {
_current_build_row = _build_batches.iterator();
}
Status CrossJoinNode::get_next(RuntimeState* state, RowBatch* output_batch, bool* eos) {
// RETURN_IF_ERROR(exec_debug_action(TExecNodePhase::GETNEXT, state));
RETURN_IF_CANCELLED(state);
// TOOD(zhaochun)
// RETURN_IF_ERROR(state->check_query_state());
SCOPED_TIMER(_runtime_profile->total_time_counter());
if (reached_limit() || _eos) {
*eos = true;
return Status::OK;
}
ScopedTimer<MonotonicStopWatch> timer(_left_child_timer);
while (!_eos) {
// Compute max rows that should be added to output_batch
int64_t max_added_rows = output_batch->capacity() - output_batch->num_rows();
if (limit() != -1) {
max_added_rows = std::min(max_added_rows, limit() - rows_returned());
}
// Continue processing this row batch
_num_rows_returned +=
process_left_child_batch(output_batch, _left_batch.get(), max_added_rows);
COUNTER_SET(_rows_returned_counter, _num_rows_returned);
if (reached_limit() || output_batch->is_full()) {
*eos = reached_limit();
break;
}
// Check to see if we're done processing the current left child batch
if (_current_build_row.at_end() && _left_batch_pos == _left_batch->num_rows()) {
_left_batch->transfer_resource_ownership(output_batch);
_left_batch_pos = 0;
if (output_batch->is_full()) {
break;
}
if (_left_side_eos) {
*eos = _eos = true;
break;
} else {
timer.stop();
RETURN_IF_ERROR(child(0)->get_next(state, _left_batch.get(), &_left_side_eos));
timer.start();
COUNTER_UPDATE(_left_child_row_counter, _left_batch->num_rows());
}
}
}
return Status::OK;
}
std::string CrossJoinNode::build_list_debug_string() {
std::stringstream out;
out << "BuildList(";
out << _build_batches.debug_string(child(1)->row_desc());
out << ")";
return out.str();
}
// TODO: this can be replaced with a codegen'd function
int CrossJoinNode::process_left_child_batch(RowBatch* output_batch, RowBatch* batch,
int max_added_rows) {
int row_idx = output_batch->add_rows(max_added_rows);
DCHECK(row_idx != RowBatch::INVALID_ROW_INDEX);
uint8_t* output_row_mem = reinterpret_cast<uint8_t*>(output_batch->get_row(row_idx));
TupleRow* output_row = reinterpret_cast<TupleRow*>(output_row_mem);
int rows_returned = 0;
ExprContext* const* ctxs = &_conjunct_ctxs[0];
int ctx_size = _conjunct_ctxs.size();
while (true) {
while (!_current_build_row.at_end()) {
create_output_row(output_row, _current_left_child_row, _current_build_row.get_row());
_current_build_row.next();
if (!eval_conjuncts(ctxs, ctx_size, output_row)) {
continue;
}
++rows_returned;
// Filled up out batch or hit limit
if (UNLIKELY(rows_returned == max_added_rows)) {
output_batch->commit_rows(rows_returned);
return rows_returned;
}
// Advance to next out row
output_row_mem += output_batch->row_byte_size();
output_row = reinterpret_cast<TupleRow*>(output_row_mem);
}
DCHECK(_current_build_row.at_end());
// Advance to the next row in the left child batch
if (UNLIKELY(_left_batch_pos == batch->num_rows())) {
output_batch->commit_rows(rows_returned);
return rows_returned;
}
_current_left_child_row = batch->get_row(_left_batch_pos++);
_current_build_row = _build_batches.iterator();
}
output_batch->commit_rows(rows_returned);
return rows_returned;
}
}

View File

@ -0,0 +1,81 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_QUERY_EXEC_CROSS_JOIN_NODE_H
#define BDG_PALO_BE_SRC_QUERY_EXEC_CROSS_JOIN_NODE_H
#include <boost/scoped_ptr.hpp>
#include <boost/unordered_set.hpp>
#include <boost/thread.hpp>
#include <string>
#include "exec/exec_node.h"
#include "exec/blocking_join_node.h"
#include "exec/row_batch_list.h"
#include "runtime/descriptors.h"
#include "runtime/mem_pool.h"
#include "gen_cpp/PlanNodes_types.h"
namespace palo {
class RowBatch;
class TupleRow;
// Node for cross joins.
// Iterates over the left child rows and then the right child rows and, for
// each combination, writes the output row if the conjuncts are satisfied. The
// build batches are kept in a list that is fully constructed from the right child in
// construct_build_side() (called by BlockingJoinNode::open()) while rows are fetched from
// the left child as necessary in get_next().
class CrossJoinNode : public BlockingJoinNode {
public:
CrossJoinNode(ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs);
virtual Status prepare(RuntimeState* state);
virtual Status get_next(RuntimeState* state, RowBatch* row_batch, bool* eos);
virtual Status close(RuntimeState* state);
protected:
virtual void init_get_next(TupleRow* first_left_row);
virtual Status construct_build_side(RuntimeState* state);
private:
// Object pool for build RowBatches, stores all BuildBatches in _build_rows
boost::scoped_ptr<ObjectPool> _build_batch_pool;
// List of build batches, constructed in prepare()
RowBatchList _build_batches;
RowBatchList::TupleRowIterator _current_build_row;
// Processes a batch from the left child.
// output_batch: the batch for resulting tuple rows
// batch: the batch from the left child to process. This function can be called to
// continue processing a batch in the middle
// max_added_rows: maximum rows that can be added to output_batch
// return the number of rows added to output_batch
int process_left_child_batch(RowBatch* output_batch, RowBatch* batch, int max_added_rows);
// Returns a debug string for _build_rows. This is used for debugging during the
// build list construction and before doing the join.
std::string build_list_debug_string();
};
}
#endif

View File

@ -0,0 +1,679 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "csv_scan_node.h"
#include <string>
#include <vector>
#include <boost/algorithm/string.hpp>
#include <boost/filesystem.hpp>
#include <boost/foreach.hpp>
#include <thrift/protocol/TDebugProtocol.h>
#include "exec/text_converter.hpp"
#include "gen_cpp/PlanNodes_types.h"
#include "runtime/runtime_state.h"
#include "runtime/row_batch.h"
#include "runtime/string_value.h"
#include "runtime/tuple_row.h"
#include "util/file_utils.h"
#include "util/runtime_profile.h"
#include "util/debug_util.h"
#include "util/hash_util.hpp"
#include "olap/olap_common.h"
#include "olap/utils.h"
namespace palo {
class StringRef {
public:
StringRef(char const* const begin, int const size) :
_begin(begin), _size(size) {
}
~StringRef() {
// No need to delete _begin, because it only record the index in a std::string.
// The c-string will be released along with the std::string object.
}
int size() const {
return _size;
}
int length() const {
return _size;
}
char const* c_str() const {
return _begin;
}
char const* begin() const {
return _begin;
}
char const* end() const {
return _begin + _size;
}
private:
char const* _begin;
int _size;
};
void split_line(const std::string& str, char delimiter, std::vector<StringRef>& result) {
enum State {
IN_DELIM = 1,
IN_TOKEN = 0
};
// line-begin char and line-end char are considered to be 'delimeter'
State state = IN_DELIM;
char const* p_begin = str.c_str(); // Begin of either a token or a delimiter
for (string::const_iterator it = str.begin(); it != str.end(); ++it) {
State const new_state = (*it == delimiter? IN_DELIM : IN_TOKEN);
if (new_state != state) {
if (new_state == IN_DELIM) {
result.push_back(StringRef(p_begin, &*it - p_begin));
}
p_begin = &*it;
} else if (new_state == IN_DELIM) {
result.push_back(StringRef(&*p_begin, 0));
p_begin = &*it;
}
state = new_state;
}
result.push_back(StringRef(p_begin, (&*str.end() - p_begin) - state));
}
CsvScanNode::CsvScanNode(
ObjectPool* pool,
const TPlanNode& tnode,
const DescriptorTbl& descs) :
ScanNode(pool, tnode, descs),
_tuple_id(tnode.csv_scan_node.tuple_id),
_file_paths(tnode.csv_scan_node.file_paths),
_column_separator(tnode.csv_scan_node.column_separator),
_column_type_map(tnode.csv_scan_node.column_type_mapping),
_column_function_map(tnode.csv_scan_node.column_function_mapping),
_columns(tnode.csv_scan_node.columns),
_unspecified_columns(tnode.csv_scan_node.unspecified_columns),
_default_values(tnode.csv_scan_node.default_values),
_is_init(false),
_tuple_desc(nullptr),
_tuple_pool(nullptr),
_text_converter(nullptr),
_hll_column_num(0) {
// do nothing
LOG(INFO) << "csv scan node: " << apache::thrift::ThriftDebugString(tnode).c_str();
}
CsvScanNode::~CsvScanNode() {
// do nothing
}
Status CsvScanNode::init(const TPlanNode& tnode) {
return ExecNode::init(tnode);
}
Status CsvScanNode::prepare(RuntimeState* state) {
VLOG(1) << "CsvScanNode::Prepare";
if (_is_init) {
return Status::OK;
}
if (nullptr == state) {
return Status("input runtime_state pointer is nullptr.");
}
RETURN_IF_ERROR(ScanNode::prepare(state));
// add timer
_split_check_timer = ADD_TIMER(_runtime_profile, "split check timer");
_split_line_timer = ADD_TIMER(_runtime_profile, "split line timer");
_tuple_desc = state->desc_tbl().get_tuple_descriptor(_tuple_id);
if (nullptr == _tuple_desc) {
return Status("Failed to get tuple descriptor.");
}
_slot_num = _tuple_desc->slots().size();
const OlapTableDescriptor* csv_table =
static_cast<const OlapTableDescriptor*>(_tuple_desc->table_desc());
if (nullptr == csv_table) {
return Status("csv table pointer is nullptr.");
}
// <column_name, slot_descriptor>
for (int i = 0; i < _slot_num; ++i) {
SlotDescriptor* slot = _tuple_desc->slots()[i];
const std::string& column_name = slot->col_name();
if (slot->type().type == TYPE_HLL) {
TMiniLoadEtlFunction& function = _column_function_map[column_name];
if (check_hll_function(function) == false) {
return Status("Function name or param error.");
}
_hll_column_num++;
}
// NOTE: not all the columns in '_column' is exist in table schema
if (_columns.end() != std::find(_columns.begin(), _columns.end(), column_name)) {
_column_slot_map[column_name] = slot;
} else {
_column_slot_map[column_name] = nullptr;
}
// add 'unspecified_columns' which have default values
if (_columns.end() != std::find(
_unspecified_columns.begin(),
_unspecified_columns.end(),
column_name)) {
_column_slot_map[column_name] = slot;
} else {
_column_slot_map[column_name] = nullptr;
}
}
_column_type_vec.resize(_columns.size());
for (int i = 0; i < _columns.size(); ++i) {
const std::string& column_name = _columns[i];
SlotDescriptor* slot = _column_slot_map[column_name];
_column_slot_vec.push_back(slot);
if (slot != nullptr) {
_column_type_vec[i] = _column_type_map[column_name];
}
}
for (int i = 0; i < _default_values.size(); ++i) {
const std::string& column_name = _unspecified_columns[i];
SlotDescriptor* slot = _column_slot_map[column_name];
_unspecified_colomn_slot_vec.push_back(slot);
_unspecified_colomn_type_vec.push_back(_column_type_map[column_name]);
}
// new one scanner
_csv_scanner.reset(new(std::nothrow) CsvScanner(_file_paths));
if (_csv_scanner.get() == nullptr) {
return Status("new a csv scanner failed.");
}
_tuple_pool.reset(new(std::nothrow) MemPool(state->instance_mem_tracker()));
if (_tuple_pool.get() == nullptr) {
return Status("new a mem pool failed.");
}
_text_converter.reset(new(std::nothrow) TextConverter('\\'));
if (_text_converter.get() == nullptr) {
return Status("new a text convertor failed.");
}
_is_init = true;
return Status::OK;
}
Status CsvScanNode::open(RuntimeState* state) {
RETURN_IF_ERROR(ExecNode::open(state));
VLOG(1) << "CsvScanNode::Open";
if (nullptr == state) {
return Status("input pointer is nullptr.");
}
if (!_is_init) {
return Status("used before initialize.");
}
_runtime_state = state;
RETURN_IF_ERROR(exec_debug_action(TExecNodePhase::OPEN));
RETURN_IF_CANCELLED(state);
SCOPED_TIMER(_runtime_profile->total_time_counter());
RETURN_IF_ERROR(_csv_scanner->open());
return Status::OK;
}
Status CsvScanNode::get_next(RuntimeState* state, RowBatch* row_batch, bool* eos) {
VLOG(1) << "CsvScanNode::GetNext";
if (nullptr == state || nullptr == row_batch || nullptr == eos) {
return Status("input is nullptr pointer");
}
if (!_is_init) {
return Status("used before initialize.");
}
RETURN_IF_ERROR(exec_debug_action(TExecNodePhase::GETNEXT));
RETURN_IF_CANCELLED(state);
SCOPED_TIMER(_runtime_profile->total_time_counter());
SCOPED_TIMER(materialize_tuple_timer());
if (reached_limit()) {
*eos = true;
return Status::OK;
}
// create new tuple buffer for row_batch
int tuple_buffer_size = row_batch->capacity() * _tuple_desc->byte_size();
void* tuple_buffer = _tuple_pool->allocate(tuple_buffer_size);
if (nullptr == tuple_buffer) {
return Status("Allocate memory failed.");
}
_tuple = reinterpret_cast<Tuple*>(tuple_buffer);
memset(_tuple, 0, sizeof(_tuple_desc->num_null_bytes()));
// Indicates whether there are more rows to process.
bool csv_eos = false;
// NOTE: not like Mysql, we need check correctness.
while (!csv_eos) {
RETURN_IF_CANCELLED(state);
if (reached_limit() || row_batch->is_full()) {
// hang on to last allocated chunk in pool, we'll keep writing into it in the
// next get_next() call
row_batch->tuple_data_pool()->acquire_data(_tuple_pool.get(), !reached_limit());
*eos = reached_limit();
return Status::OK;
}
// read csv
std::string line;
RETURN_IF_ERROR(_csv_scanner->get_next_row(&line, &csv_eos));
//VLOG_ROW << "line readed: [" << line << "]";
if (line.empty()) {
continue;
}
// split & check line & fill default value
bool is_success = split_check_fill(line, state);
if (is_success) {
++_normal_row_number;
state->set_normal_row_number(state->get_normal_row_number() + 1);
} else {
++_error_row_number;
state->set_error_row_number(state->get_error_row_number() + 1);
continue;
}
int row_idx = row_batch->add_row();
TupleRow* row = row_batch->get_row(row_idx);
// scan node is the first tuple of tuple row
row->set_tuple(0, _tuple);
{
row_batch->commit_last_row();
++_num_rows_returned;
COUNTER_SET(_rows_returned_counter, _num_rows_returned);
char* new_tuple = reinterpret_cast<char*>(_tuple);
new_tuple += _tuple_desc->byte_size();
_tuple = reinterpret_cast<Tuple*>(new_tuple);
}
}
VLOG_ROW << "normal_row_number: " << state->get_normal_row_number()
<< "; error_row_number: " << state->get_error_row_number() << std::endl;
row_batch->tuple_data_pool()->acquire_data(_tuple_pool.get(), false);
*eos = csv_eos;
return Status::OK;
}
Status CsvScanNode::close(RuntimeState* state) {
if (is_closed()) {
return Status::OK;
}
VLOG(1) << "CsvScanNode::Close";
RETURN_IF_ERROR(exec_debug_action(TExecNodePhase::CLOSE));
SCOPED_TIMER(_runtime_profile->total_time_counter());
if (memory_used_counter() != nullptr) {
COUNTER_UPDATE(memory_used_counter(), _tuple_pool->peak_allocated_bytes());
}
RETURN_IF_ERROR(ExecNode::close(state));
if (state->get_normal_row_number() == 0) {
std::stringstream error_msg;
error_msg << "Read zero normal line file. ";
state->append_error_msg_to_file("", error_msg.str());
return Status(error_msg.str());
}
// Summary normal line and error line number info
std::stringstream summary_msg;
summary_msg << "error line: " << _error_row_number
<< "; normal line: " << _normal_row_number;
state->append_error_msg_to_file("", summary_msg.str());
return Status::OK;
}
void CsvScanNode::debug_string(int indentation_level, stringstream* out) const {
*out << string(indentation_level * 2, ' ');
*out << "csvScanNode(tupleid=" << _tuple_id;
*out << ")" << std::endl;
for (int i = 0; i < _children.size(); ++i) {
_children[i]->debug_string(indentation_level + 1, out);
}
}
Status CsvScanNode::set_scan_ranges(const vector<TScanRangeParams>& scan_ranges) {
return Status::OK;
}
void CsvScanNode::fill_fix_length_string(
const char* value, const int value_length, MemPool* pool,
char** new_value_p, const int new_value_length) {
if (new_value_length != 0 && value_length < new_value_length) {
DCHECK(pool != nullptr);
*new_value_p = reinterpret_cast<char*>(pool->allocate(new_value_length));
// 'value' is guaranteed not to be nullptr
memcpy(*new_value_p, value, value_length);
for (int i = value_length; i < new_value_length; ++i) {
(*new_value_p)[i] = '\0';
}
VLOG_ROW << "Fill fix length string. "
<< "value: [" << std::string(value, value_length) << "]; "
<< "value_length: " << value_length << "; "
<< "*new_value_p: [" << *new_value_p << "]; "
<< "new value length: " << new_value_length << std::endl;
}
}
// Following format are included.
// .123 1.23 123. -1.23
// ATTN: The decimal point and (for negative numbers) the "-" sign are not counted.
// like '.123', it will be regarded as '0.123', but it match decimal(3, 3)
bool CsvScanNode::check_decimal_input(
const char* value, const int value_length,
const int precision, const int scale,
std::stringstream* error_msg) {
if (value_length > (precision + 2)) {
(*error_msg) << "the length of decimal value is overflow. "
<< "precision in schema: (" << precision << ", " << scale << "); "
<< "value: [" << std::string(value, value_length) << "]; "
<< "str actual length: " << value_length << ";";
return false;
}
// ignore leading spaces and trailing spaces
int begin_index = 0;
while (begin_index < value_length && std::isspace(value[begin_index])) {
++begin_index;
}
int end_index = value_length - 1;
while (end_index >= begin_index && std::isspace(value[end_index])) {
--end_index;
}
if (value[begin_index] == '+' || value[begin_index] == '-') {
++begin_index;
}
int point_index = -1;
for (int i = begin_index; i <= end_index; ++i) {
if (value[i] == '.') {
point_index = i;
}
}
int value_int_len = 0;
int value_frac_len = 0;
value_int_len = point_index - begin_index;
value_frac_len = end_index- point_index;
if (point_index == -1) {
// an int value: like 123
value_int_len = end_index - begin_index + 1;
value_frac_len = 0;
} else {
value_int_len = point_index - begin_index;
value_frac_len = end_index- point_index;
}
if (value_int_len > (precision - scale)) {
(*error_msg) << "the int part length longer than schema precision ["
<< precision << "]. "
<< "value [" << std::string(value, value_length) << "]. ";
return false;
} else if (value_frac_len > scale) {
(*error_msg) << "the frac part length longer than schema scale ["
<< scale << "]. "
<< "value [" << std::string(value, value_length) << "]. ";
return false;
}
return true;
}
static bool is_null(const char* value, int value_length) {
return value_length == 2 && value[0] == '\\' && value[1] == 'N';
}
// Writes a slot in _tuple from an value containing text data.
bool CsvScanNode::check_and_write_text_slot(
const std::string& column_name, const TColumnType& column_type,
const char* value, int value_length,
const SlotDescriptor* slot, RuntimeState* state,
std::stringstream* error_msg) {
if (value_length == 0 && !slot->type().is_string_type()) {
(*error_msg) << "the length of input should not be 0. "
<< "column_name: " << column_name << "; "
<< "type: " << slot->type() << "; "
<< "input_str: [" << std::string(value, value_length) << "].";
return false;
}
if (slot->is_nullable() && is_null(value, value_length)) {
_tuple->set_null(slot->null_indicator_offset());
return true;
}
char* value_to_convert = const_cast<char*>(value);
int value_to_convert_length = value_length;
// Fill all the spaces if it is 'TYPE_CHAR' type
if (slot->type().is_string_type()) {
int char_len = column_type.len;
if (slot->type().type != TYPE_HLL && value_length > char_len) {
(*error_msg) << "the length of input is too long than schema. "
<< "column_name: " << column_name << "; "
<< "input_str: [" << std::string(value, value_length) << "] "
<< "type: " << slot->type() << "; "
<< "schema length: " << char_len << "; "
<< "actual length: " << value_length << "; ";
return false;
}
if (slot->type().type == TYPE_CHAR && value_length < char_len) {
fill_fix_length_string(
value, value_length, _tuple_pool.get(),
&value_to_convert, char_len);
value_to_convert_length = char_len;
}
} else if (slot->type().is_decimal_type()) {
int precision = column_type.precision;
int scale = column_type.scale;
bool is_success = check_decimal_input(value, value_length, precision, scale, error_msg);
if (is_success == false) {
return false;
}
}
if (!_text_converter->write_slot(slot, _tuple, value_to_convert, value_to_convert_length,
true, false, _tuple_pool.get())) {
(*error_msg) << "convert csv string to "
<< slot->type() << " failed. "
<< "column_name: " << column_name << "; "
<< "input_str: [" << std::string(value, value_length) << "]; ";
return false;
}
return true;
}
bool CsvScanNode::split_check_fill(const std::string& line, RuntimeState* state) {
SCOPED_TIMER(_split_check_timer);
std::stringstream error_msg;
// std::vector<std::string> fields;
std::vector<StringRef> fields;
{
SCOPED_TIMER(_split_line_timer);
// boost::split(fields, line, boost::is_any_of(_column_separator));
split_line(line, _column_separator[0], fields);
}
if (_hll_column_num == 0 && fields.size() < _columns.size()) {
error_msg << "actual column number is less than schema column number. "
<< "actual number: " << fields.size() << " ,"
<< "schema number: " << _columns.size() << "; ";
_runtime_state->append_error_msg_to_file(line, error_msg.str());
return false;
} else if (_hll_column_num == 0 && fields.size() > _columns.size()) {
error_msg << "actual column number is more than schema column number. "
<< "actual number: " << fields.size() << " ,"
<< "schema number: " << _columns.size() << "; ";
_runtime_state->append_error_msg_to_file(line, error_msg.str());
return false;
}
for (int i = 0; i < _columns.size(); ++i) {
const std::string& column_name = _columns[i];
const SlotDescriptor* slot = _column_slot_vec[i];
// ignore unspecified columns
if (slot == nullptr) {
continue;
}
if (!slot->is_materialized()) {
continue;
}
if (slot->type().type == TYPE_HLL) {
continue;
}
const TColumnType& column_type = _column_type_vec[i];
bool flag = check_and_write_text_slot(
column_name, column_type,
fields[i].c_str(),
fields[i].length(),
slot, state, &error_msg);
if (flag == false) {
_runtime_state->append_error_msg_to_file(line, error_msg.str());
return false;
}
}
for (int i = 0; i < _unspecified_columns.size(); ++i) {
const std::string& column_name = _unspecified_columns[i];
const SlotDescriptor* slot = _unspecified_colomn_slot_vec[i];
if (slot == nullptr) {
continue;
}
if (!slot->is_materialized()) {
continue;
}
if (slot->type().type == TYPE_HLL) {
continue;
}
const TColumnType& column_type = _unspecified_colomn_type_vec[i];
bool flag = check_and_write_text_slot(
column_name, column_type,
_default_values[i].c_str(),
_default_values[i].length(),
slot, state, &error_msg);
if (flag == false) {
_runtime_state->append_error_msg_to_file(line, error_msg.str());
return false;
}
}
for (std::map<std::string, TMiniLoadEtlFunction>::iterator iter = _column_function_map.begin();
iter != _column_function_map.end();
iter++) {
TMiniLoadEtlFunction& function = iter->second;
const std::string& column_name = iter->first;
const SlotDescriptor* slot = _column_slot_map[column_name];
const TColumnType& column_type = _column_type_map[column_name];
std::string column_string = "";
const char* src = fields[function.param_column_index].c_str();
int src_column_len = fields[function.param_column_index].length();
hll_hash(src, src_column_len, &column_string);
bool flag = check_and_write_text_slot(
column_name, column_type,
column_string.c_str(),
column_string.length(),
slot, state, &error_msg);
if (flag == false) {
_runtime_state->append_error_msg_to_file(line, error_msg.str());
return false;
}
}
return true;
}
bool CsvScanNode::check_hll_function(TMiniLoadEtlFunction& function) {
if (function.function_name.empty()
|| function.function_name != "hll_hash"
|| function.param_column_index < 0) {
return false;
}
return true;
}
void CsvScanNode::hll_hash(const char* src, int len, std::string* result) {
std::string str(src, len);
if (str != "\\N") {
uint64_t hash = HashUtil::murmur_hash64A(src, len, HashUtil::MURMUR_SEED);
char buf[11];
memset(buf, 0, 11);
// expliclit set
buf[0] = HLL_DATA_EXPLICIT;
buf[1] = 1;
*((uint64_t*)(buf + 2)) = hash;
*result = std::string(buf, 11);
} else {
char buf[2];
memset(buf, 0, 2);
// empty set
buf[0] = HLL_DATA_EMPTY;
*result = std::string(buf, 2);
}
}
} // end namespace palo

144
be/src/exec/csv_scan_node.h Normal file
View File

@ -0,0 +1,144 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_QUERY_EXEC_CSV_SCAN_NODE_H
#define BDG_PALO_BE_SRC_QUERY_EXEC_CSV_SCAN_NODE_H
#include <fstream>
#include <sstream>
#include <boost/scoped_ptr.hpp>
#include "common/config.h"
#include "exec/csv_scanner.h"
#include "exec/scan_node.h"
#include "runtime/descriptors.h"
namespace palo {
class TextConverter;
class Tuple;
class TupleDescriptor;
class RuntimeState;
class MemPool;
class Status;
class CsvScanNode : public ScanNode {
public:
CsvScanNode(ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs);
~CsvScanNode();
virtual Status init(const TPlanNode& tnode);
// initialize _csv_scanner, and create _text_converter.
virtual Status prepare(RuntimeState* state);
// Start CSV scan using _csv_scanner.
virtual Status open(RuntimeState* state);
// Fill the next row batch by calling next() on the _csv_scanner,
// converting text data in CSV cells to binary data.
virtual Status get_next(RuntimeState* state, RowBatch* row_batch, bool* eos);
// Release memory, report 'Counter', and report errors.
virtual Status close(RuntimeState* state);
// No use in csv scan process
virtual Status set_scan_ranges(const std::vector<TScanRangeParams>& scan_ranges);
// Write debug string of this into out.
virtual void debug_string(int indentation_level, std::stringstream* out) const;
private:
bool check_and_write_text_slot(
const std::string& column_name, const TColumnType& column_type,
const char* value, int value_length,
const SlotDescriptor* slot, RuntimeState* state,
std::stringstream* error_msg);
// split one line into fields, check every fields, fill every filed into tuple
bool split_check_fill(const std::string& line, RuntimeState* state);
void fill_fix_length_string(
const char* value, int value_length, MemPool* pool,
char** new_value, int new_value_length);
bool check_decimal_input(
const char* value, int value_length,
int precision, int scale,
std::stringstream* error_msg);
void hll_hash(const char* src, int len, std::string* result);
bool check_hll_function(TMiniLoadEtlFunction& function);
// Tuple id resolved in prepare() to set _tuple_desc;
TupleId _tuple_id;
std::vector<std::string> _file_paths;
std::string _column_separator;
std::map<std::string, TColumnType> _column_type_map;
// mapping function
std::map<std::string, TMiniLoadEtlFunction> _column_function_map;
std::vector<std::string> _columns;
// 'unspecified_columns' is map one-for-one to '_default_values' in the same order
std::vector<std::string> _unspecified_columns;
std::vector<std::string> _default_values;
// Map one-for-one to '_columns' in the same order
std::vector<SlotDescriptor*> _column_slot_vec;
std::vector<TColumnType> _column_type_vec;
// Map one-for-one to '_unspecified_columns' in the same order
std::vector<SlotDescriptor*> _unspecified_colomn_slot_vec;
std::vector<TColumnType> _unspecified_colomn_type_vec;
bool _is_init;
// Descriptor of tuples read from CSV file.
const TupleDescriptor* _tuple_desc;
// Tuple index in tuple row.
int _slot_num;
// Pool for allocating tuple data, including all varying-length slots.
boost::scoped_ptr<MemPool> _tuple_pool;
// Util class for doing real file reading
boost::scoped_ptr<CsvScanner> _csv_scanner;
// Helper class for converting text to other types;
boost::scoped_ptr<TextConverter> _text_converter;
// Current tuple.
Tuple* _tuple;
// Current RuntimeState
RuntimeState* _runtime_state;
int64_t _error_row_number = 0L;
int64_t _normal_row_number = 0L;
RuntimeProfile::Counter* _split_check_timer;
RuntimeProfile::Counter* _split_line_timer;
// count hll value num
int _hll_column_num;
std::map<std::string, SlotDescriptor*> _column_slot_map;
};
} // end namespace palo
#endif // BDG_PALO_BE_SRC_QUERY_EXEC_CSV_SCAN_NODE_H

View File

@ -0,0 +1,94 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/csv_scanner.h"
#include <boost/algorithm/string.hpp>
namespace palo {
CsvScanner::CsvScanner(const std::vector<std::string>& csv_file_paths) :
_is_open(false),
_file_paths(csv_file_paths),
_current_file(nullptr),
_current_file_idx(0){
// do nothing
}
CsvScanner::~CsvScanner() {
// close file
if (_current_file != nullptr) {
if (_current_file->is_open()) {
_current_file->close();
}
delete _current_file;
_current_file = nullptr;
}
}
Status CsvScanner::open() {
VLOG(1) << "CsvScanner::Connect";
if (_is_open) {
LOG(INFO) << "this scanner already opened";
return Status::OK;
}
if (_file_paths.empty()) {
return Status("no file specified.");
}
_is_open = true;
return Status::OK;
}
// TODO(lingbin): read more than one line at a time to reduce IO comsumption
Status CsvScanner::get_next_row(std::string* line_str, bool* eos) {
if (_current_file == nullptr && _current_file_idx == _file_paths.size()) {
*eos = true;
return Status::OK;
}
if (_current_file == nullptr && _current_file_idx < _file_paths.size()) {
std::string& file_path = _file_paths[_current_file_idx];
LOG(INFO) << "open csv file: [" << _current_file_idx << "] " << file_path;
_current_file = new std::ifstream(file_path, std::ifstream::in);
if (!_current_file->is_open()) {
return Status("Fail to read csv file: " + file_path);
}
++_current_file_idx;
}
getline(*_current_file, *line_str);
if (_current_file->eof()) {
_current_file->close();
delete _current_file;
_current_file = nullptr;
if (_current_file_idx == _file_paths.size()) {
*eos = true;
return Status::OK;
}
}
*eos = false;
return Status::OK;
}
} // end namespace palo

48
be/src/exec/csv_scanner.h Normal file
View File

@ -0,0 +1,48 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_QUERY_EXEC_CSV_SCANNER_H
#define BDG_PALO_BE_SRC_QUERY_EXEC_CSV_SCANNER_H
#include <fstream>
#include <string>
#include <vector>
#include "common/status.h"
namespace palo {
class CsvScanner {
public:
CsvScanner(const std::vector<std::string>& csv_file_paths);
~CsvScanner();
Status open();
Status get_next_row(std::string* line_str, bool* eos);
private:
bool _is_open;
std::vector<std::string> _file_paths;
// the current opened file
std::ifstream* _current_file;
int32_t _current_file_idx;
};
} // end namespace palo
#endif // BDG_PALO_BE_SRC_QUERY_EXEC_CSV_SCANNER_H

139
be/src/exec/data_sink.cpp Normal file
View File

@ -0,0 +1,139 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/data_sink.h"
#include <string>
#include <map>
#include <memory>
#include "exec/exec_node.h"
#include "exprs/expr.h"
#include "gen_cpp/PaloInternalService_types.h"
#include "runtime/data_stream_sender.h"
#include "runtime/result_sink.h"
#include "runtime/mysql_table_sink.h"
#include "runtime/data_spliter.h"
#include "runtime/export_sink.h"
#include "runtime/runtime_state.h"
#include "util/logging.h"
namespace palo {
Status DataSink::create_data_sink(
ObjectPool* pool,
const TDataSink& thrift_sink,
const std::vector<TExpr>& output_exprs,
const TPlanFragmentExecParams& params,
const RowDescriptor& row_desc,
boost::scoped_ptr<DataSink>* sink) {
DataSink* tmp_sink = NULL;
switch (thrift_sink.type) {
case TDataSinkType::DATA_STREAM_SINK: {
if (!thrift_sink.__isset.stream_sink) {
return Status("Missing data stream sink.");
}
// TODO: figure out good buffer size based on size of output row
tmp_sink = new DataStreamSender(
pool, params.sender_id, row_desc,
thrift_sink.stream_sink, params.destinations, 16 * 1024);
// RETURN_IF_ERROR(sender->prepare(state->obj_pool(), thrift_sink.stream_sink));
sink->reset(tmp_sink);
break;
}
case TDataSinkType::RESULT_SINK:
if (!thrift_sink.__isset.result_sink) {
return Status("Missing data buffer sink.");
}
// TODO: figure out good buffer size based on size of output row
tmp_sink = new ResultSink(row_desc, output_exprs, thrift_sink.result_sink, 1024);
sink->reset(tmp_sink);
break;
case TDataSinkType::MYSQL_TABLE_SINK: {
if (!thrift_sink.__isset.mysql_table_sink) {
return Status("Missing data buffer sink.");
}
// TODO: figure out good buffer size based on size of output row
MysqlTableSink* mysql_tbl_sink = new MysqlTableSink(
pool, row_desc, output_exprs);
sink->reset(mysql_tbl_sink);
break;
}
case TDataSinkType::DATA_SPLIT_SINK: {
if (!thrift_sink.__isset.split_sink) {
return Status("Missing data split buffer sink.");
}
// TODO: figure out good buffer size based on size of output row
std::unique_ptr<DataSpliter> data_spliter(new DataSpliter(row_desc));
RETURN_IF_ERROR(DataSpliter::from_thrift(pool,
thrift_sink.split_sink,
data_spliter.get()));
sink->reset(data_spliter.release());
break;
}
case TDataSinkType::EXPORT_SINK: {
if (!thrift_sink.__isset.export_sink) {
return Status("Missing export sink sink.");
}
std::unique_ptr<ExportSink> export_sink(new ExportSink(pool, row_desc, output_exprs));
sink->reset(export_sink.release());
break;
}
default:
std::stringstream error_msg;
std::map<int, const char*>::const_iterator i =
_TDataSinkType_VALUES_TO_NAMES.find(thrift_sink.type);
const char* str = "Unknown data sink type ";
if (i != _TDataSinkType_VALUES_TO_NAMES.end()) {
str = i->second;
}
error_msg << str << " not implemented.";
return Status(error_msg.str());
}
if (sink->get() != NULL) {
RETURN_IF_ERROR((*sink)->init(thrift_sink));
}
return Status::OK;
}
Status DataSink::init(const TDataSink& thrift_sink) {
return Status::OK;
}
Status DataSink::prepare(RuntimeState* state) {
_expr_mem_tracker.reset(new MemTracker(-1, "Data sink", state->instance_mem_tracker()));
return Status::OK;
}
} // namespace palo

88
be/src/exec/data_sink.h Normal file
View File

@ -0,0 +1,88 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_QUERY_EXEC_DATA_SINK_H
#define BDG_PALO_BE_SRC_QUERY_EXEC_DATA_SINK_H
#include <boost/scoped_ptr.hpp>
#include <vector>
#include "common/status.h"
#include "gen_cpp/DataSinks_types.h"
#include "gen_cpp/Exprs_types.h"
#include "runtime/mem_tracker.h"
namespace palo {
class ObjectPool;
class RowBatch;
class RuntimeProfile;
class RuntimeState;
class TPlanExecRequest;
class TPlanExecParams;
class TPlanFragmentExecParams;
class RowDescriptor;
// Superclass of all data sinks.
class DataSink {
public:
DataSink() : _closed(false) {}
virtual ~DataSink() {}
virtual Status init(const TDataSink& thrift_sink);
// Setup. Call before send(), Open(), or Close().
// Subclasses must call DataSink::Prepare().
virtual Status prepare(RuntimeState* state);
// Setup. Call before send() or close().
virtual Status open(RuntimeState* state) = 0;
// Send a row batch into this sink.
// eos should be true when the last batch is passed to send()
virtual Status send(RuntimeState* state, RowBatch* batch) = 0;
// virtual Status send(RuntimeState* state, RowBatch* batch, bool eos) = 0;
// Releases all resources that were allocated in prepare()/send().
// Further send() calls are illegal after calling close().
// It must be okay to call this multiple times. Subsequent calls should
// be ignored.
virtual Status close(RuntimeState* state, Status exec_status) = 0;
// Creates a new data sink from thrift_sink. A pointer to the
// new sink is written to *sink, and is owned by the caller.
static Status create_data_sink(
ObjectPool* pool,
const TDataSink& thrift_sink, const std::vector<TExpr>& output_exprs,
const TPlanFragmentExecParams& params,
const RowDescriptor& row_desc, boost::scoped_ptr<DataSink>* sink);
// Returns the runtime profile for the sink.
virtual RuntimeProfile* profile() = 0;
protected:
// Set to true after close() has been called. subclasses should check and set this in
// close().
bool _closed;
std::unique_ptr<MemTracker> _expr_mem_tracker;
};
} // namespace palo
#endif

View File

@ -0,0 +1,726 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/decompressor.h"
namespace palo {
Status Decompressor::create_decompressor(CompressType type,
Decompressor** decompressor) {
switch(type) {
case CompressType::UNCOMPRESSED:
*decompressor = nullptr;
break;
case CompressType::GZIP:
*decompressor = new GzipDecompressor(false);
break;
case CompressType::DEFLATE:
*decompressor = new GzipDecompressor(true);
break;
case CompressType::BZIP2:
*decompressor = new Bzip2Decompressor();
break;
case CompressType::LZ4FRAME:
*decompressor = new Lz4FrameDecompressor();
break;
case CompressType::LZOP:
*decompressor = new LzopDecompressor();
break;
default:
std::stringstream ss;
ss << "Unknown compress type: " << type;
return Status(ss.str());
}
Status st = Status::OK;
if (*decompressor != nullptr) {
st = (*decompressor)->init();
}
return st;
}
Decompressor::~Decompressor() {
}
std::string Decompressor::debug_info() {
return "Decompressor";
}
// Gzip
GzipDecompressor::GzipDecompressor(bool is_deflate):
Decompressor(_is_deflate ? CompressType::DEFLATE : CompressType::GZIP),
_is_deflate(is_deflate) {
}
GzipDecompressor::~GzipDecompressor() {
(void) inflateEnd(&_z_strm);
}
Status GzipDecompressor::init() {
_z_strm = {0};
_z_strm.zalloc = Z_NULL;
_z_strm.zfree = Z_NULL;
_z_strm.opaque = Z_NULL;
int window_bits = _is_deflate ? WINDOW_BITS : (WINDOW_BITS | DETECT_CODEC);
int ret = inflateInit2(&_z_strm, window_bits);
if (ret < 0) {
std::stringstream ss;
ss << "Failed to init inflate. status code: " << ret;
return Status(ss.str());
}
return Status::OK;
}
Status GzipDecompressor::decompress(
uint8_t* input, size_t input_len, size_t* input_bytes_read,
uint8_t* output, size_t output_max_len,
size_t* decompressed_len, bool* stream_end,
size_t* more_input_bytes, size_t* more_output_bytes) {
// 1. set input and output
_z_strm.next_in = input;
_z_strm.avail_in = input_len;
_z_strm.next_out = output;
_z_strm.avail_out = output_max_len;
while (_z_strm.avail_out > 0 && _z_strm.avail_in > 0) {
*stream_end = false;
// inflate() performs one or both of the following actions:
// Decompress more input starting at next_in and update next_in and avail_in
// accordingly.
// Provide more output starting at next_out and update next_out and avail_out
// accordingly.
// inflate() returns Z_OK if some progress has been made (more input processed
// or more output produced)
int ret = inflate(&_z_strm, Z_NO_FLUSH);
*input_bytes_read = input_len - _z_strm.avail_in;
*decompressed_len = output_max_len - _z_strm.avail_out;
LOG(INFO) << "gzip dec ret: " << ret
<< " input_bytes_read: " << *input_bytes_read
<< " decompressed_len: " << *decompressed_len;
if (ret == Z_BUF_ERROR) {
// Z_BUF_ERROR indicates that inflate() could not consume more input or
// produce more output. inflate() can be called again with more output space
// or more available input
// ATTN: even if ret == Z_OK, decompressed_len may also be zero
return Status::OK;
} else if (ret == Z_STREAM_END) {
*stream_end = true;
// reset _z_strm to continue decoding a subsequent gzip stream
ret = inflateReset(&_z_strm);
if (ret != Z_OK) {
std::stringstream ss;
ss << "Failed to inflateRset. return code: " << ret;
return Status(ss.str());
}
} else if (ret != Z_OK) {
std::stringstream ss;
ss << "Failed to inflate. return code: " << ret;
return Status(ss.str());
} else {
// here ret must be Z_OK.
// we continue if avail_out and avail_in > 0.
// this means 'inflate' is not done yet.
}
}
return Status::OK;
}
std::string GzipDecompressor::debug_info() {
std::stringstream ss;
ss << "GzipDecompressor." << " is_deflate: " << _is_deflate;
return ss.str();
}
// Bzip2
Bzip2Decompressor::~Bzip2Decompressor() {
BZ2_bzDecompressEnd(&_bz_strm);
}
Status Bzip2Decompressor::init() {
bzero(&_bz_strm, sizeof(_bz_strm));
int ret = BZ2_bzDecompressInit(&_bz_strm, 0, 0);
if (ret != BZ_OK) {
std::stringstream ss;
ss << "Failed to init bz2. status code: " << ret;
return Status(ss.str());
}
return Status::OK;
}
Status Bzip2Decompressor::decompress(
uint8_t* input, size_t input_len, size_t* input_bytes_read,
uint8_t* output, size_t output_max_len,
size_t* decompressed_len, bool* stream_end,
size_t* more_input_bytes, size_t* more_output_bytes) {
// 1. set input and output
_bz_strm.next_in = const_cast<char*>(reinterpret_cast<const char*>(input));
_bz_strm.avail_in = input_len;
_bz_strm.next_out = reinterpret_cast<char*>(output);
_bz_strm.avail_out = output_max_len;
while (_bz_strm.avail_out > 0 && _bz_strm.avail_in > 0) {
*stream_end = false;
// decompress
int ret = BZ2_bzDecompress(&_bz_strm);
*input_bytes_read = input_len - _bz_strm.avail_in;
*decompressed_len = output_max_len - _bz_strm.avail_out;
if (ret == BZ_DATA_ERROR || ret == BZ_DATA_ERROR_MAGIC) {
LOG(INFO) << "input_bytes_read: " << *input_bytes_read
<< " decompressed_len: " << *decompressed_len;
std::stringstream ss;
ss << "Failed to bz2 decompress. status code: " << ret;
return Status(ss.str());
} else if (ret == BZ_STREAM_END) {
*stream_end = true;
ret = BZ2_bzDecompressEnd(&_bz_strm);
if (ret != BZ_OK) {
std::stringstream ss;
ss << "Failed to end bz2 after meet BZ_STREAM_END. status code: " << ret;
return Status(ss.str());
}
ret = BZ2_bzDecompressInit(&_bz_strm, 0, 0);
if (ret != BZ_OK) {
std::stringstream ss;
ss << "Failed to init bz2 after meet BZ_STREAM_END. status code: " << ret;
return Status(ss.str());
}
} else if (ret != BZ_OK) {
std::stringstream ss;
ss << "Failed to bz2 decompress. status code: " << ret;
return Status(ss.str());
} else {
// continue
}
}
return Status::OK;
}
std::string Bzip2Decompressor::debug_info() {
std::stringstream ss;
ss << "Bzip2Decompressor.";
return ss.str();
}
// Lz4Frame
// Lz4 version: 1.7.5
// define LZ4F_VERSION = 100
const unsigned Lz4FrameDecompressor::PALO_LZ4F_VERSION = 100;
Lz4FrameDecompressor::~Lz4FrameDecompressor() {
LZ4F_freeDecompressionContext(_dctx);
}
Status Lz4FrameDecompressor::init() {
size_t ret = LZ4F_createDecompressionContext(&_dctx, PALO_LZ4F_VERSION);
if (LZ4F_isError(ret)) {
std::stringstream ss;
ss << "LZ4F_dctx creation error: " << std::string(LZ4F_getErrorName(ret));
return Status(ss.str());
}
// init as -1
_expect_dec_buf_size = -1;
return Status::OK;
}
Status Lz4FrameDecompressor::decompress(
uint8_t* input, size_t input_len, size_t* input_bytes_read,
uint8_t* output, size_t output_max_len,
size_t* decompressed_len, bool* stream_end,
size_t* more_input_bytes, size_t* more_output_bytes) {
uint8_t* src = input;
size_t src_size = input_len;
size_t ret = 1;
*input_bytes_read = 0;
if (_expect_dec_buf_size == -1) {
// init expected decompress buf size, and check if output_max_len is large enough
// ATTN: _expect_dec_buf_size is uninit, which means this is the first time to call
// decompress(), so *input* should point to the head of the compressed file,
// where lz4 header section is there.
if (input_len < 15) {
std::stringstream ss;
ss << "Lz4 header size is between 7 and 15 bytes. "
<< "but input size is only: " << input_len;
return Status(ss.str());
}
LZ4F_frameInfo_t info;
ret = LZ4F_getFrameInfo(_dctx, &info, (void*) src, &src_size);
if (LZ4F_isError(ret)) {
std::stringstream ss;
ss << "LZ4F_getFrameInfo error: " << std::string(LZ4F_getErrorName(ret));
return Status(ss.str());
}
_expect_dec_buf_size = get_block_size(&info);
if (_expect_dec_buf_size == -1) {
std::stringstream ss;
ss << "Impossible lz4 block size unless more block sizes are allowed"
<< std::string(LZ4F_getErrorName(ret));
return Status(ss.str());
}
*input_bytes_read = src_size;
src += src_size;
src_size = input_len - src_size;
LOG(INFO) << "lz4 block size: " << _expect_dec_buf_size;
}
// decompress
size_t output_len = output_max_len;
ret = LZ4F_decompress(_dctx, (void*) output, &output_len, (void*) src, &src_size,
/* LZ4F_decompressOptions_t */ NULL);
if (LZ4F_isError(ret)) {
std::stringstream ss;
ss << "Decompression error: " << std::string(LZ4F_getErrorName(ret));
return Status(ss.str());
}
// update
*input_bytes_read += src_size;
*decompressed_len = output_len;
if (ret == 0) {
*stream_end = true;
} else {
*stream_end = false;
}
return Status::OK;
}
std::string Lz4FrameDecompressor::debug_info() {
std::stringstream ss;
ss << "Lz4FrameDecompressor."
<< " expect dec buf size: " << _expect_dec_buf_size
<< " Lz4 Frame Version: " << PALO_LZ4F_VERSION;
return ss.str();
}
size_t Lz4FrameDecompressor::get_block_size(const LZ4F_frameInfo_t* info) {
switch (info->blockSizeID) {
case LZ4F_default:
case LZ4F_max64KB: return 1 << 16;
case LZ4F_max256KB: return 1 << 18;
case LZ4F_max1MB: return 1 << 20;
case LZ4F_max4MB: return 1 << 22;
default:
// error
return -1;
}
}
// Lzop
const uint8_t LzopDecompressor::LZOP_MAGIC[9] =
{ 0x89, 0x4c, 0x5a, 0x4f, 0x00, 0x0d, 0x0a, 0x1a, 0x0a };
const uint64_t LzopDecompressor::LZOP_VERSION = 0x1030;
const uint64_t LzopDecompressor::MIN_LZO_VERSION = 0x0100;
// magic(9) + ver(2) + lib_ver(2) + ver_needed(2) + method(1)
// + lvl(1) + flags(4) + mode/mtime(12) + filename_len(1)
// without the real file name, extra field and checksum
const uint32_t LzopDecompressor::MIN_HEADER_SIZE = 34;
const uint32_t LzopDecompressor::LZO_MAX_BLOCK_SIZE = (64*1024l*1024l);
const uint32_t LzopDecompressor::CRC32_INIT_VALUE = 0;
const uint32_t LzopDecompressor::ADLER32_INIT_VALUE = 1;
const uint64_t LzopDecompressor::F_H_CRC32 = 0x00001000L;
const uint64_t LzopDecompressor::F_MASK = 0x00003FFFL;
const uint64_t LzopDecompressor::F_OS_MASK = 0xff000000L;
const uint64_t LzopDecompressor::F_CS_MASK = 0x00f00000L;
const uint64_t LzopDecompressor::F_RESERVED = ((F_MASK | F_OS_MASK | F_CS_MASK) ^ 0xffffffffL);
const uint64_t LzopDecompressor::F_MULTIPART = 0x00000400L;
const uint64_t LzopDecompressor::F_H_FILTER = 0x00000800L;
const uint64_t LzopDecompressor::F_H_EXTRA_FIELD = 0x00000040L;
const uint64_t LzopDecompressor::F_CRC32_C = 0x00000200L;
const uint64_t LzopDecompressor::F_ADLER32_C = 0x00000002L;
const uint64_t LzopDecompressor::F_CRC32_D = 0x00000100L;
const uint64_t LzopDecompressor::F_ADLER32_D = 0x00000001L;
LzopDecompressor::~LzopDecompressor() {
}
Status LzopDecompressor::init() {
return Status::OK;
}
Status LzopDecompressor::decompress(
uint8_t* input, size_t input_len, size_t* input_bytes_read,
uint8_t* output, size_t output_max_len,
size_t* decompressed_len, bool* stream_end,
size_t* more_input_bytes, size_t* more_output_bytes) {
if (!_is_header_loaded) {
// this is the first time to call lzo decompress, parse the header info first
RETURN_IF_ERROR(parse_header_info(input, input_len, input_bytes_read, more_input_bytes));
if (*more_input_bytes > 0) {
return Status::OK;
}
}
// LOG(INFO) << "after load header: " << *input_bytes_read;
// read compressed block
// compressed-block ::=
// <uncompressed-size>
// <compressed-size>
// <uncompressed-checksums>
// <compressed-checksums>
// <compressed-data>
int left_input_len = input_len - *input_bytes_read;
if (left_input_len < sizeof(uint32_t)) {
// block is at least have uncompressed_size
*more_input_bytes = sizeof(uint32_t) - left_input_len;
return Status::OK;
}
uint8_t* block_start = input + *input_bytes_read;
uint8_t* ptr = block_start;
// 1. uncompressed size
uint32_t uncompressed_size;
ptr = get_uint32(ptr, &uncompressed_size);
left_input_len -= sizeof(uint32_t);
if (uncompressed_size == 0) {
*stream_end = true;
return Status::OK;
}
// 2. compressed size
if (left_input_len < sizeof(uint32_t)) {
*more_input_bytes = sizeof(uint32_t) - left_input_len;
return Status::OK;
}
uint32_t compressed_size;
ptr = get_uint32(ptr, &compressed_size);
left_input_len -= sizeof(uint32_t);
if (compressed_size > LZO_MAX_BLOCK_SIZE) {
std::stringstream ss;
ss << "lzo block size: " << compressed_size << " is greater than LZO_MAX_BLOCK_SIZE: "
<< LZO_MAX_BLOCK_SIZE;
return Status(ss.str());
}
// 3. out checksum
uint32_t out_checksum = 0;
if (_header_info.output_checksum_type != CHECK_NONE) {
if (left_input_len < sizeof(uint32_t)) {
*more_input_bytes = sizeof(uint32_t) - left_input_len;
return Status::OK;
}
ptr = get_uint32(ptr, &out_checksum);
left_input_len -= sizeof(uint32_t);
}
// 4. in checksum
uint32_t in_checksum = 0;
if (compressed_size < uncompressed_size && _header_info.input_checksum_type != CHECK_NONE) {
if (left_input_len < sizeof(uint32_t)) {
*more_input_bytes = sizeof(uint32_t) - left_input_len;
return Status::OK;
}
ptr = get_uint32(ptr, &out_checksum);
left_input_len -= sizeof(uint32_t);
} else {
// If the compressed data size is equal to the uncompressed data size, then
// the uncompressed data is stored and there is no compressed checksum.
in_checksum = out_checksum;
}
// 5. checksum compressed data
if (left_input_len < compressed_size) {
*more_input_bytes = compressed_size - left_input_len;
return Status::OK;
}
RETURN_IF_ERROR(checksum(_header_info.input_checksum_type,
"compressed", in_checksum, ptr, compressed_size));
// 6. decompress
if (output_max_len < uncompressed_size) {
*more_output_bytes = uncompressed_size - output_max_len;
return Status::OK;
}
if (compressed_size == uncompressed_size) {
// the data is uncompressed, just copy to the output buf
memmove(output, ptr, compressed_size);
ptr += compressed_size;
} else {
// decompress
*decompressed_len = uncompressed_size;
int ret = lzo1x_decompress_safe(ptr, compressed_size,
output, reinterpret_cast<lzo_uint*>(&uncompressed_size), nullptr);
if (ret != LZO_E_OK || uncompressed_size != *decompressed_len) {
std::stringstream ss;
ss << "Lzo decompression failed with ret: " << ret
<< " decompressed len: " << uncompressed_size
<< " expected: " << *decompressed_len;
return Status(ss.str());
}
RETURN_IF_ERROR(checksum(_header_info.output_checksum_type, "decompressed",
out_checksum, output, uncompressed_size));
ptr += compressed_size;
}
// 7. peek next block's uncompressed size
uint32_t next_uncompressed_size;
get_uint32(ptr, &next_uncompressed_size);
if (next_uncompressed_size == 0) {
// 0 means current block is the last block.
// consume this uncompressed_size to finish reading.
ptr += sizeof(uint32_t);
}
// 8. done
*stream_end = true;
*decompressed_len = uncompressed_size;
*input_bytes_read += ptr - block_start;
LOG(INFO) << "finished decompress lzo block."
<< " compressed_size: " << compressed_size
<< " decompressed_len: " << *decompressed_len
<< " input_bytes_read: " << *input_bytes_read
<< " next_uncompressed_size: " << next_uncompressed_size;
return Status::OK;
}
// file-header ::= -- most of this information is not used.
// <magic>
// <version>
// <lib-version>
// [<version-needed>] -- present for all modern files.
// <method>
// <level>
// <flags>
// <mode>
// <mtime>
// <file-name>
// <header-checksum>
// <extra-field> -- presence indicated in flags, not currently used.
Status LzopDecompressor::parse_header_info(uint8_t* input, size_t input_len,
size_t* input_bytes_read,
size_t* more_input_bytes) {
if (input_len < MIN_HEADER_SIZE) {
LOG(INFO) << "highly recommanded that Lzo header size is larger than " << MIN_HEADER_SIZE
<< ", or parsing header info may failed."
<< " only given: " << input_len;
*more_input_bytes = MIN_HEADER_SIZE - input_len;
return Status::OK;
}
uint8_t* ptr = input;
// 1. magic
if (memcmp(ptr, LZOP_MAGIC, sizeof(LZOP_MAGIC))) {
std::stringstream ss;
ss << "invalid lzo magic number";
return Status(ss.str());
}
ptr += sizeof(LZOP_MAGIC);
uint8_t* header = ptr;
// 2. version
ptr = get_uint16(ptr, &_header_info.version);
if (_header_info.version > LZOP_VERSION) {
std::stringstream ss;
ss << "compressed with later version of lzop: " << &_header_info.version
<< " must be less than: " << LZOP_VERSION;
return Status(ss.str());
}
// 3. lib version
ptr = get_uint16(ptr, &_header_info.lib_version);
if (_header_info.lib_version < MIN_LZO_VERSION) {
std::stringstream ss;
ss << "compressed with incompatible lzo version: " << &_header_info.lib_version
<< "must be at least: " << MIN_LZO_VERSION;
return Status(ss.str());
}
// 4. version needed
ptr = get_uint16(ptr, &_header_info.version_needed);
if (_header_info.version_needed > LZOP_VERSION) {
std::stringstream ss;
ss << "compressed with imp incompatible lzo version: " << &_header_info.version
<< " must be at no more than: " << LZOP_VERSION;
return Status(ss.str());
}
// 5. method
ptr = get_uint8(ptr, &_header_info.method);
if (_header_info.method < 1 || _header_info.method > 3) {
std::stringstream ss;
ss << "invalid compression method: " << _header_info.method;
return Status(ss.str());
}
// 6. skip level
++ptr;
// 7. flags
uint32_t flags;
ptr = get_uint32(ptr, &flags);
if (flags & (F_RESERVED | F_MULTIPART | F_H_FILTER)) {
std::stringstream ss;
ss << "unsupported lzo flags: " << flags;
return Status(ss.str());
}
_header_info.header_checksum_type = header_type(flags);
_header_info.input_checksum_type = input_type(flags);
_header_info.output_checksum_type = output_type(flags);
// 8. skip mode and mtime
ptr += 3 * sizeof(int32_t);
// 9. filename
uint8_t filename_len;
ptr = get_uint8(ptr, &filename_len);
// here we already consume (MIN_HEADER_SIZE)
// from now we have to check left input is enough for each step
size_t left = input_len - (ptr - input);
if (left < filename_len) {
*more_input_bytes = filename_len - left;
return Status::OK;
}
_header_info.filename = std::string((char*) ptr, (size_t) filename_len);
ptr += filename_len;
left -= filename_len;
// 10. checksum
if (left < sizeof(uint32_t)) {
*more_input_bytes = sizeof(uint32_t) - left;
return Status::OK;
}
uint32_t expected_checksum;
uint8_t* cur = ptr;
ptr = get_uint32(ptr, &expected_checksum);
uint32_t computed_checksum;
if (_header_info.header_checksum_type == CHECK_CRC32) {
computed_checksum = CRC32_INIT_VALUE;
computed_checksum = lzo_crc32(computed_checksum, header, cur - header);
} else {
computed_checksum = ADLER32_INIT_VALUE;
computed_checksum = lzo_adler32(computed_checksum, header, cur - header);
}
if (computed_checksum != expected_checksum) {
std::stringstream ss;
ss << "invalid header checksum: " << computed_checksum
<< " expected: " << expected_checksum;
return Status(ss.str());
}
left -= sizeof(uint32_t);
// 11. skip extra
if (flags & F_H_EXTRA_FIELD) {
if (left < sizeof(uint32_t)) {
*more_input_bytes = sizeof(uint32_t) - left;
return Status::OK;
}
uint32_t extra_len;
ptr = get_uint32(ptr, &extra_len);
left -= sizeof(uint32_t);
// add the checksum and the len to the total ptr size.
if (left < sizeof(int32_t) + extra_len) {
*more_input_bytes = sizeof(int32_t) + extra_len - left;
return Status::OK;
}
left -= sizeof(int32_t) + extra_len;
ptr += sizeof(int32_t) + extra_len;
}
_header_info.header_size = ptr - input;
*input_bytes_read = _header_info.header_size;
_is_header_loaded = true;
LOG(INFO) << debug_info();
return Status::OK;
}
Status LzopDecompressor::checksum(LzoChecksum type, const std::string& source,
uint32_t expected,
uint8_t* ptr, size_t len) {
uint32_t computed_checksum;
switch (type) {
case CHECK_NONE:
return Status::OK;
case CHECK_CRC32:
computed_checksum = lzo_crc32(CRC32_INIT_VALUE, ptr, len);
break;
case CHECK_ADLER:
computed_checksum = lzo_adler32(ADLER32_INIT_VALUE, ptr, len);
break;
default:
std::stringstream ss;
ss << "Invalid checksum type: " << type;
return Status(ss.str());
}
if (computed_checksum != expected) {
std::stringstream ss;
ss << "checksum of " << source << " block failed."
<< " computed checksum: " << computed_checksum
<< " expected: " << expected;
return Status(ss.str());
}
return Status::OK;
}
std::string LzopDecompressor::debug_info() {
std::stringstream ss;
ss << "LzopDecompressor."
<< " version: " << _header_info.version
<< " lib version: " << _header_info.lib_version
<< " version needed: " << _header_info.version_needed
<< " method: " << (uint16_t) _header_info.method
<< " filename: " << _header_info.filename
<< " header size: " << _header_info.header_size
<< " header checksum type: " << _header_info.header_checksum_type
<< " input checksum type: " << _header_info.input_checksum_type
<< " ouput checksum type: " << _header_info.output_checksum_type;
return ss.str();
}
} // namespace

262
be/src/exec/decompressor.h Normal file
View File

@ -0,0 +1,262 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#pragma once
#include <zlib.h>
#include <bzlib.h>
#include <lz4/lz4frame.h>
#include <lzo/lzoconf.h>
#include <lzo/lzo1x.h>
#include "common/status.h"
namespace palo {
enum CompressType {
UNCOMPRESSED,
GZIP,
DEFLATE,
BZIP2,
LZ4FRAME,
LZOP
};
class Decompressor {
public:
virtual ~Decompressor();
// implement in derived class
// input(in): buf where decompress begin
// input_len(in): max length of input buf
// input_bytes_read(out): bytes which is consumed by decompressor
// output(out): buf where to save decompressed data
// output_max_len(in): max length of output buf
// decompressed_len(out): decompressed data size in output buf
// stream_end(out): true if reach the and of stream,
// or normally finished decompressing entire block
// more_input_bytes(out): decompressor need more bytes to consume
// more_output_bytes(out): decompressor need more space to save decompressed data
//
// input and output buf should be allocated and released outside
virtual Status decompress(
uint8_t* input, size_t input_len, size_t* input_bytes_read,
uint8_t* output, size_t output_max_len,
size_t* decompressed_len, bool* stream_end,
size_t* more_input_bytes, size_t* more_output_bytes) = 0;
public:
static Status create_decompressor(CompressType type,
Decompressor** decompressor);
virtual std::string debug_info();
CompressType get_type() { return _ctype; }
protected:
virtual Status init() = 0;
Decompressor(CompressType ctype):_ctype(ctype) {}
CompressType _ctype;
};
class GzipDecompressor : public Decompressor {
public:
virtual ~GzipDecompressor();
virtual Status decompress(
uint8_t* input, size_t input_len, size_t* input_bytes_read,
uint8_t* output, size_t output_max_len,
size_t* decompressed_len, bool* stream_end,
size_t* more_input_bytes, size_t* more_output_bytes) override;
virtual std::string debug_info() override;
private:
friend class Decompressor;
GzipDecompressor(bool is_deflate);
virtual Status init() override;
private:
bool _is_deflate;
z_stream _z_strm;
// These are magic numbers from zlib.h. Not clear why they are not defined there.
const static int WINDOW_BITS = 15; // Maximum window size
const static int DETECT_CODEC = 32; // Determine if this is libz or gzip from header.
};
class Bzip2Decompressor : public Decompressor {
public:
virtual ~Bzip2Decompressor();
virtual Status decompress(
uint8_t* input, size_t input_len, size_t* input_bytes_read,
uint8_t* output, size_t output_max_len,
size_t* decompressed_len, bool* stream_end,
size_t* more_input_bytes, size_t* more_output_bytes) override;
virtual std::string debug_info() override;
private:
friend class Decompressor;
Bzip2Decompressor() : Decompressor(CompressType::BZIP2) {}
virtual Status init() override;
private:
bz_stream _bz_strm;
};
class Lz4FrameDecompressor : public Decompressor {
public:
virtual ~Lz4FrameDecompressor();
virtual Status decompress(
uint8_t* input, size_t input_len, size_t* input_bytes_read,
uint8_t* output, size_t output_max_len,
size_t* decompressed_len, bool* stream_end,
size_t* more_input_bytes, size_t* more_output_bytes) override;
virtual std::string debug_info() override;
private:
friend class Decompressor;
Lz4FrameDecompressor() : Decompressor(CompressType::LZ4FRAME) {}
virtual Status init() override;
size_t get_block_size(const LZ4F_frameInfo_t* info);
private:
LZ4F_dctx* _dctx;
size_t _expect_dec_buf_size;
const static unsigned PALO_LZ4F_VERSION;
};
class LzopDecompressor : public Decompressor {
public:
virtual ~LzopDecompressor();
virtual Status decompress(
uint8_t* input, size_t input_len, size_t* input_bytes_read,
uint8_t* output, size_t output_max_len,
size_t* decompressed_len, bool* stream_end,
size_t* more_input_bytes, size_t* more_output_bytes) override;
virtual std::string debug_info() override;
private:
friend class Decompressor;
LzopDecompressor() :
Decompressor(CompressType::LZOP),
_header_info({0}),
_is_header_loaded(false) {}
virtual Status init() override;
private:
enum LzoChecksum {
CHECK_NONE,
CHECK_CRC32,
CHECK_ADLER
};
private:
inline uint8_t* get_uint8(uint8_t* ptr, uint8_t* value) {
*value = *ptr;
return ptr + sizeof(uint8_t);
}
inline uint8_t* get_uint16(uint8_t* ptr, uint16_t* value) {
*value = *ptr << 8 | *(ptr + 1);
return ptr + sizeof(uint16_t);
}
inline uint8_t* get_uint32(uint8_t* ptr, uint32_t* value) {
*value = (*ptr << 24) | (*(ptr + 1) << 16) | (*(ptr + 2) << 8) | *(ptr + 3);
return ptr + sizeof(uint32_t);
}
inline LzoChecksum header_type(int flags) {
return (flags & F_H_CRC32) ? CHECK_CRC32 : CHECK_ADLER;
}
inline LzoChecksum input_type(int flags) {
return (flags & F_CRC32_C) ? CHECK_CRC32 :
(flags & F_ADLER32_C) ? CHECK_ADLER : CHECK_NONE;
}
inline LzoChecksum output_type(int flags) {
return (flags & F_CRC32_D) ? CHECK_CRC32 :
(flags & F_ADLER32_D) ? CHECK_ADLER : CHECK_NONE;
}
Status parse_header_info(uint8_t* input, size_t input_len,
size_t* input_bytes_read,
size_t* more_bytes_needed);
Status checksum(LzoChecksum type, const std::string& source,
uint32_t expected,
uint8_t* ptr, size_t len);
private:
// lzop header info
struct HeaderInfo {
uint16_t version;
uint16_t lib_version;
uint16_t version_needed;
uint8_t method;
std::string filename;
uint32_t header_size;
LzoChecksum header_checksum_type;
LzoChecksum input_checksum_type;
LzoChecksum output_checksum_type;
};
struct HeaderInfo _header_info;
// true if header is decompressed and loaded
bool _is_header_loaded;
private:
const static uint8_t LZOP_MAGIC[9];
const static uint64_t LZOP_VERSION;
const static uint64_t MIN_LZO_VERSION;
const static uint32_t MIN_HEADER_SIZE;
const static uint32_t LZO_MAX_BLOCK_SIZE;
const static uint32_t CRC32_INIT_VALUE;
const static uint32_t ADLER32_INIT_VALUE;
const static uint64_t F_H_CRC32;
const static uint64_t F_MASK;
const static uint64_t F_OS_MASK;
const static uint64_t F_CS_MASK;
const static uint64_t F_RESERVED;
const static uint64_t F_MULTIPART;
const static uint64_t F_H_FILTER;
const static uint64_t F_H_EXTRA_FIELD;
const static uint64_t F_CRC32_C;
const static uint64_t F_ADLER32_C;
const static uint64_t F_CRC32_D;
const static uint64_t F_ADLER32_D;
};
} // namespace

View File

@ -0,0 +1,35 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/empty_set_node.h"
namespace palo {
EmptySetNode::EmptySetNode(ObjectPool* pool, const TPlanNode& tnode,
const DescriptorTbl& descs)
: ExecNode(pool, tnode, descs) {
}
Status EmptySetNode::get_next(RuntimeState* state, RowBatch* row_batch, bool* eos) {
*eos = true;
return Status::OK;
}
}

View File

@ -0,0 +1,35 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#pragma once
#include "exec/exec_node.h"
namespace palo {
/// Node that returns an empty result set, i.e., just sets eos_ in GetNext().
/// Corresponds to EmptySetNode.java in the FE.
class EmptySetNode : public ExecNode {
public:
EmptySetNode(ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs);
virtual Status get_next(RuntimeState* state, RowBatch* row_batch, bool* eos) override;
};
}

View File

@ -0,0 +1,254 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/exchange_node.h"
#include <boost/scoped_ptr.hpp>
#include "runtime/data_stream_mgr.h"
#include "runtime/data_stream_recvr.h"
#include "runtime/runtime_state.h"
#include "runtime/row_batch.h"
#include "util/debug_util.h"
#include "util/runtime_profile.h"
#include "gen_cpp/PlanNodes_types.h"
namespace palo {
ExchangeNode::ExchangeNode(
ObjectPool* pool,
const TPlanNode& tnode,
const DescriptorTbl& descs) :
ExecNode(pool, tnode, descs),
_num_senders(0),
_stream_recvr(NULL),
_input_row_desc(descs, tnode.exchange_node.input_row_tuples,
vector<bool>(
tnode.nullable_tuples.begin(),
tnode.nullable_tuples.begin() + tnode.exchange_node.input_row_tuples.size())),
_next_row_idx(0),
_is_merging(tnode.exchange_node.__isset.sort_info),
_offset(tnode.exchange_node.__isset.offset ? tnode.exchange_node.offset : 0),
_num_rows_skipped(0) {
DCHECK_GE(_offset, 0);
DCHECK(_is_merging || (_offset == 0));
}
Status ExchangeNode::init(const TPlanNode& tnode) {
RETURN_IF_ERROR(ExecNode::init(tnode));
if (!_is_merging) {
return Status::OK;
}
RETURN_IF_ERROR(_sort_exec_exprs.init(tnode.exchange_node.sort_info, _pool));
_is_asc_order = tnode.exchange_node.sort_info.is_asc_order;
_nulls_first = tnode.exchange_node.sort_info.nulls_first;
return Status::OK;
}
Status ExchangeNode::prepare(RuntimeState* state) {
RETURN_IF_ERROR(ExecNode::prepare(state));
_convert_row_batch_timer = ADD_TIMER(runtime_profile(), "ConvertRowBatchTime");
// TODO: figure out appropriate buffer size
DCHECK_GT(_num_senders, 0);
_stream_recvr = state->exec_env()->stream_mgr()->create_recvr(
state, _input_row_desc,
state->fragment_instance_id(), _id,
_num_senders, config::exchg_node_buffer_size_bytes,
state->runtime_profile(), _is_merging);
if (_is_merging) {
RETURN_IF_ERROR(_sort_exec_exprs.prepare(
state, _row_descriptor, _row_descriptor, expr_mem_tracker()));
// AddExprCtxsToFree(_sort_exec_exprs);
}
return Status::OK;
}
Status ExchangeNode::open(RuntimeState* state) {
SCOPED_TIMER(_runtime_profile->total_time_counter());
RETURN_IF_ERROR(ExecNode::open(state));
if (_is_merging) {
RETURN_IF_ERROR(_sort_exec_exprs.open(state));
TupleRowComparator less_than(_sort_exec_exprs, _is_asc_order, _nulls_first);
// create_merger() will populate its merging heap with batches from the _stream_recvr,
// so it is not necessary to call fill_input_row_batch().
RETURN_IF_ERROR(_stream_recvr->create_merger(less_than));
} else {
RETURN_IF_ERROR(fill_input_row_batch(state));
}
return Status::OK;
}
Status ExchangeNode::close(RuntimeState* state) {
if (is_closed()) {
return Status::OK;
}
if (_is_merging) {
_sort_exec_exprs.close(state);
}
if (_stream_recvr != NULL) {
_stream_recvr->close();
}
_stream_recvr.reset();
return ExecNode::close(state);
}
Status ExchangeNode::fill_input_row_batch(RuntimeState* state) {
DCHECK(!_is_merging);
Status ret_status;
{
// SCOPED_TIMER(state->total_network_receive_timer());
ret_status = _stream_recvr->get_batch(&_input_batch);
}
VLOG_FILE << "exch: has batch=" << (_input_batch == NULL ? "false" : "true")
<< " #rows=" << (_input_batch != NULL ? _input_batch->num_rows() : 0)
<< " is_cancelled=" << (ret_status.is_cancelled() ? "true" : "false")
<< " instance_id=" << state->fragment_instance_id();
return ret_status;
}
Status ExchangeNode::get_next(RuntimeState* state, RowBatch* output_batch, bool* eos) {
RETURN_IF_ERROR(exec_debug_action(TExecNodePhase::GETNEXT));
SCOPED_TIMER(_runtime_profile->total_time_counter());
if (reached_limit()) {
_stream_recvr->transfer_all_resources(output_batch);
*eos = true;
return Status::OK;
} else {
*eos = false;
}
if (_is_merging) {
return get_next_merging(state, output_batch, eos);
}
ExprContext* const* ctxs = &_conjunct_ctxs[0];
int num_ctxs = _conjunct_ctxs.size();
while (true) {
{
SCOPED_TIMER(_convert_row_batch_timer);
RETURN_IF_CANCELLED(state);
// RETURN_IF_ERROR(QueryMaintenance(state));
RETURN_IF_ERROR(state->check_query_state());
// copy rows until we hit the limit/capacity or until we exhaust _input_batch
while (!reached_limit() && !output_batch->at_capacity()
&& _input_batch != NULL && _next_row_idx < _input_batch->capacity()) {
TupleRow* src = _input_batch->get_row(_next_row_idx);
if (ExecNode::eval_conjuncts(ctxs, num_ctxs, src)) {
int j = output_batch->add_row();
TupleRow* dest = output_batch->get_row(j);
// if the input row is shorter than the output row, make sure not to leave
// uninitialized Tuple* around
output_batch->clear_row(dest);
// this works as expected if rows from input_batch form a prefix of
// rows in output_batch
_input_batch->copy_row(src, dest);
output_batch->commit_last_row();
++_num_rows_returned;
}
++_next_row_idx;
}
if (VLOG_ROW_IS_ON) {
VLOG_ROW << "ExchangeNode output batch: " << print_batch(output_batch);
}
COUNTER_SET(_rows_returned_counter, _num_rows_returned);
if (reached_limit()) {
_stream_recvr->transfer_all_resources(output_batch);
*eos = true;
return Status::OK;
}
if (output_batch->at_capacity()) {
*eos = false;
return Status::OK;
}
}
// we need more rows
if (_input_batch != NULL) {
_input_batch->transfer_resource_ownership(output_batch);
}
RETURN_IF_ERROR(fill_input_row_batch(state));
*eos = (_input_batch == NULL);
if (*eos) {
return Status::OK;
}
_next_row_idx = 0;
DCHECK(_input_batch->row_desc().is_prefix_of(output_batch->row_desc()));
}
}
Status ExchangeNode::get_next_merging(RuntimeState* state, RowBatch* output_batch, bool* eos) {
DCHECK_EQ(output_batch->num_rows(), 0);
RETURN_IF_CANCELLED(state);
// RETURN_IF_ERROR(QueryMaintenance(state));
RETURN_IF_ERROR(state->check_query_state());
RETURN_IF_ERROR(_stream_recvr->get_next(output_batch, eos));
while ((_num_rows_skipped < _offset)) {
_num_rows_skipped += output_batch->num_rows();
// Throw away rows in the output batch until the offset is skipped.
int rows_to_keep = _num_rows_skipped - _offset;
if (rows_to_keep > 0) {
output_batch->copy_rows(0, output_batch->num_rows() - rows_to_keep, rows_to_keep);
output_batch->set_num_rows(rows_to_keep);
} else {
output_batch->set_num_rows(0);
}
if (rows_to_keep > 0 || *eos || output_batch->at_capacity()) {
break;
}
RETURN_IF_ERROR(_stream_recvr->get_next(output_batch, eos));
}
_num_rows_returned += output_batch->num_rows();
if (reached_limit()) {
output_batch->set_num_rows(output_batch->num_rows() - (_num_rows_returned - _limit));
*eos = true;
}
// On eos, transfer all remaining resources from the input batches maintained
// by the merger to the output batch.
if (*eos) {
_stream_recvr->transfer_all_resources(output_batch);
}
COUNTER_SET(_rows_returned_counter, _num_rows_returned);
return Status::OK;
}
void ExchangeNode::debug_string(int indentation_level, std::stringstream* out) const {
*out << string(indentation_level * 2, ' ');
*out << "ExchangeNode(#senders=" << _num_senders;
ExecNode::debug_string(indentation_level, out);
*out << ")";
}
}

117
be/src/exec/exchange_node.h Normal file
View File

@ -0,0 +1,117 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_QUERY_EXEC_EXCHANGE_NODE_H
#define BDG_PALO_BE_SRC_QUERY_EXEC_EXCHANGE_NODE_H
#include <boost/scoped_ptr.hpp>
#include "exec/exec_node.h"
#include "exec/sort_exec_exprs.h"
namespace palo {
class RowBatch;
class DataStreamRecvr;
// Receiver node for data streams. The data stream receiver is created in Prepare()
// and closed in Close().
// is_merging is set to indicate that rows from different senders must be merged
// according to the sort parameters in _sort_exec_exprs. (It is assumed that the rows
// received from the senders themselves are sorted.)
// If _is_merging is true, the exchange node creates a DataStreamRecvr with the
// _is_merging flag and retrieves retrieves rows from the receiver via calls to
// DataStreamRecvr::GetNext(). It also prepares, opens and closes the ordering exprs in
// its SortExecExprs member that are used to compare rows.
// If _is_merging is false, the exchange node directly retrieves batches from the row
// batch queue of the DataStreamRecvr via calls to DataStreamRecvr::GetBatch().
class ExchangeNode : public ExecNode {
public:
ExchangeNode(ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs);
virtual ~ExchangeNode() {}
virtual Status init(const TPlanNode& tnode);
virtual Status prepare(RuntimeState* state);
// Blocks until the first batch is available for consumption via GetNext().
virtual Status open(RuntimeState* state);
virtual Status get_next(RuntimeState* state, RowBatch* row_batch, bool* eos);
virtual Status close(RuntimeState* state);
// the number of senders needs to be set after the c'tor, because it's not
// recorded in TPlanNode, and before calling prepare()
void set_num_senders(int num_senders) {
_num_senders = num_senders;
}
protected:
virtual void debug_string(int indentation_level, std::stringstream* out) const;
private:
// Implements GetNext() for the case where _is_merging is true. Delegates the GetNext()
// call to the underlying DataStreamRecvr.
Status get_next_merging(RuntimeState* state, RowBatch* output_batch, bool* eos);
// Resets _input_batch to the next batch from the from _stream_recvr's queue.
// Only used when _is_merging is false.
Status fill_input_row_batch(RuntimeState* state);
int _num_senders; // needed for _stream_recvr construction
// created in prepare() and owned by the RuntimeState
boost::shared_ptr<DataStreamRecvr> _stream_recvr;
// our input rows are a prefix of the rows we produce
RowDescriptor _input_row_desc;
// the size of our input batches does not necessarily match the capacity
// of our output batches, which means that we need to buffer the input
// Current batch of rows from the receiver queue being processed by this node.
// Only valid if _is_merging is false. (If _is_merging is true, GetNext() is
// delegated to the receiver). Owned by the stream receiver.
// boost::scoped_ptr<RowBatch> _input_batch;
RowBatch* _input_batch;
// Next row to copy from _input_batch. For non-merging exchanges, _input_batch
// is retrieved directly from the sender queue in the stream recvr, and rows from
// _input_batch must be copied to the output batch in GetNext().
int _next_row_idx;
// time spent reconstructing received rows
RuntimeProfile::Counter* _convert_row_batch_timer;
// True if this is a merging exchange node. If true, GetNext() is delegated to the
// underlying _stream_recvr, and _input_batch is not used/valid.
bool _is_merging;
// Sort expressions and parameters passed to the merging receiver..
SortExecExprs _sort_exec_exprs;
std::vector<bool> _is_asc_order;
std::vector<bool> _nulls_first;
// Offset specifying number of rows to skip.
int64_t _offset;
// Number of rows skipped so far.
int64_t _num_rows_skipped;
};
};
#endif

611
be/src/exec/exec_node.cpp Normal file
View File

@ -0,0 +1,611 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/exec_node.h"
#include <sstream>
#include <thrift/protocol/TDebugProtocol.h>
#include <unistd.h>
#include "codegen/llvm_codegen.h"
#include "codegen/codegen_anyval.h"
#include "common/object_pool.h"
#include "common/status.h"
#include "exprs/expr_context.h"
#include "exec/aggregation_node.h"
#include "exec/partitioned_aggregation_node.h"
#include "exec/csv_scan_node.h"
#include "exec/pre_aggregation_node.h"
#include "exec/hash_join_node.h"
#include "exec/broker_scan_node.h"
#include "exec/cross_join_node.h"
#include "exec/empty_set_node.h"
#include "exec/mysql_scan_node.h"
#include "exec/schema_scan_node.h"
#include "exec/exchange_node.h"
#include "exec/merge_join_node.h"
#include "exec/merge_node.h"
#include "exec/olap_rewrite_node.h"
#include "exec/olap_scan_node.h"
#include "exec/topn_node.h"
#include "exec/sort_node.h"
#include "exec/spill_sort_node.h"
#include "exec/analytic_eval_node.h"
#include "exec/select_node.h"
#include "exec/union_node.h"
#include "runtime/descriptors.h"
#include "runtime/mem_pool.h"
#include "runtime/mem_tracker.h"
#include "runtime/row_batch.h"
#include "runtime/runtime_state.h"
#include "util/debug_util.h"
#include "util/runtime_profile.h"
using llvm::Function;
using llvm::PointerType;
using llvm::Type;
using llvm::Value;
using llvm::LLVMContext;
using llvm::BasicBlock;
namespace palo {
const std::string ExecNode::ROW_THROUGHPUT_COUNTER = "RowsReturnedRate";
int ExecNode::get_node_id_from_profile(RuntimeProfile* p) {
return p->metadata();
}
ExecNode::RowBatchQueue::RowBatchQueue(int max_batches) :
BlockingQueue<RowBatch*>(max_batches) {
}
ExecNode::RowBatchQueue::~RowBatchQueue() {
DCHECK(cleanup_queue_.empty());
}
void ExecNode::RowBatchQueue::AddBatch(RowBatch* batch) {
if (!blocking_put(batch)) {
std::lock_guard<std::mutex> lock(lock_);
cleanup_queue_.push_back(batch);
}
}
bool ExecNode::RowBatchQueue::AddBatchWithTimeout(RowBatch* batch,
int64_t timeout_micros) {
// return blocking_put_with_timeout(batch, timeout_micros);
return blocking_put(batch);
}
RowBatch* ExecNode::RowBatchQueue::GetBatch() {
RowBatch* result = NULL;
if (blocking_get(&result)) return result;
return NULL;
}
int ExecNode::RowBatchQueue::Cleanup() {
int num_io_buffers = 0;
// RowBatch* batch = NULL;
// while ((batch = GetBatch()) != NULL) {
// num_io_buffers += batch->num_io_buffers();
// delete batch;
// }
lock_guard<std::mutex> l(lock_);
for (std::list<RowBatch*>::iterator it = cleanup_queue_.begin();
it != cleanup_queue_.end(); ++it) {
// num_io_buffers += (*it)->num_io_buffers();
delete *it;
}
cleanup_queue_.clear();
return num_io_buffers;
}
ExecNode::ExecNode(ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs) :
_id(tnode.node_id),
_type(tnode.node_type),
_pool(pool),
_tuple_ids(tnode.row_tuples),
_row_descriptor(descs, tnode.row_tuples, tnode.nullable_tuples),
_debug_phase(TExecNodePhase::INVALID),
_debug_action(TDebugAction::WAIT),
_limit(tnode.limit),
_num_rows_returned(0),
_rows_returned_counter(NULL),
_rows_returned_rate(NULL),
_memory_used_counter(NULL),
_is_closed(false){
init_runtime_profile(print_plan_node_type(tnode.node_type));
}
ExecNode::~ExecNode() {
}
void ExecNode::push_down_predicate(
RuntimeState* state, std::list<ExprContext*>* expr_ctxs) {
for (int i = 0; i < _children.size(); ++i) {
_children[i]->push_down_predicate(state, expr_ctxs);
if (expr_ctxs->size() == 0) {
return;
}
}
std::list<ExprContext*>::iterator iter = expr_ctxs->begin();
while (iter != expr_ctxs->end()) {
if ((*iter)->root()->is_bound(&_tuple_ids)) {
// LOG(INFO) << "push down success expr is " << (*iter)->debug_string()
// << " and node is " << debug_string();
(*iter)->prepare(state, row_desc(), _expr_mem_tracker.get());
(*iter)->open(state);
_conjunct_ctxs.push_back(*iter);
iter = expr_ctxs->erase(iter);
} else {
++iter;
}
}
}
Status ExecNode::init(const TPlanNode& tnode) {
RETURN_IF_ERROR(
Expr::create_expr_trees(_pool, tnode.conjuncts, &_conjunct_ctxs));
return Status::OK;
}
Status ExecNode::prepare(RuntimeState* state) {
RETURN_IF_ERROR(exec_debug_action(TExecNodePhase::PREPARE));
DCHECK(_runtime_profile.get() != NULL);
_rows_returned_counter =
ADD_COUNTER(_runtime_profile, "RowsReturned", TUnit::UNIT);
_memory_used_counter =
ADD_COUNTER(_runtime_profile, "MemoryUsed", TUnit::BYTES);
_rows_returned_rate = runtime_profile()->add_derived_counter(
ROW_THROUGHPUT_COUNTER, TUnit::UNIT_PER_SECOND,
boost::bind<int64_t>(&RuntimeProfile::units_per_second,
_rows_returned_counter,
runtime_profile()->total_time_counter()),
"");
_mem_tracker.reset(new MemTracker(-1, _runtime_profile->name(), state->instance_mem_tracker()));
_expr_mem_tracker.reset(new MemTracker(-1, "Exprs", _mem_tracker.get()));
RETURN_IF_ERROR(Expr::prepare(_conjunct_ctxs, state, row_desc(), expr_mem_tracker()));
// TODO(zc):
// AddExprCtxsToFree(_conjunct_ctxs);
for (int i = 0; i < _children.size(); ++i) {
RETURN_IF_ERROR(_children[i]->prepare(state));
}
return Status::OK;
}
Status ExecNode::open(RuntimeState* state) {
RETURN_IF_ERROR(exec_debug_action(TExecNodePhase::OPEN));
return Expr::open(_conjunct_ctxs, state);
}
Status ExecNode::close(RuntimeState* state) {
if (_is_closed) {
return Status::OK;
}
_is_closed = true;
RETURN_IF_ERROR(exec_debug_action(TExecNodePhase::CLOSE));
if (_rows_returned_counter != NULL) {
COUNTER_SET(_rows_returned_counter, _num_rows_returned);
}
Status result;
for (int i = 0; i < _children.size(); ++i) {
result.add_error(_children[i]->close(state));
}
Expr::close(_conjunct_ctxs, state);
return result;
}
void ExecNode::add_runtime_exec_option(const std::string& str) {
lock_guard<mutex> l(_exec_options_lock);
if (_runtime_exec_options.empty()) {
_runtime_exec_options = str;
} else {
_runtime_exec_options.append(", ");
_runtime_exec_options.append(str);
}
runtime_profile()->add_info_string("ExecOption", _runtime_exec_options);
}
Status ExecNode::create_tree(ObjectPool* pool, const TPlan& plan,
const DescriptorTbl& descs, ExecNode** root) {
if (plan.nodes.size() == 0) {
*root = NULL;
return Status::OK;
}
int node_idx = 0;
RETURN_IF_ERROR(create_tree_helper(pool, plan.nodes, descs, NULL, &node_idx, root));
if (node_idx + 1 != plan.nodes.size()) {
// TODO: print thrift msg for diagnostic purposes.
return Status(
"Plan tree only partially reconstructed. Not all thrift nodes were used.");
}
return Status::OK;
}
Status ExecNode::create_tree_helper(
ObjectPool* pool,
const vector<TPlanNode>& tnodes,
const DescriptorTbl& descs,
ExecNode* parent,
int* node_idx,
ExecNode** root) {
// propagate error case
if (*node_idx >= tnodes.size()) {
// TODO: print thrift msg
return Status("Failed to reconstruct plan tree from thrift.");
}
const TPlanNode& tnode = tnodes[*node_idx];
int num_children = tnodes[*node_idx].num_children;
ExecNode* node = NULL;
RETURN_IF_ERROR(create_node(pool, tnodes[*node_idx], descs, &node));
// assert(parent != NULL || (node_idx == 0 && root_expr != NULL));
if (parent != NULL) {
parent->_children.push_back(node);
} else {
*root = node;
}
for (int i = 0; i < num_children; i++) {
++*node_idx;
RETURN_IF_ERROR(create_tree_helper(pool, tnodes, descs, node, node_idx, NULL));
// we are expecting a child, but have used all nodes
// this means we have been given a bad tree and must fail
if (*node_idx >= tnodes.size()) {
// TODO: print thrift msg
return Status("Failed to reconstruct plan tree from thrift.");
}
}
RETURN_IF_ERROR(node->init(tnode));
// build up tree of profiles; add children >0 first, so that when we print
// the profile, child 0 is printed last (makes the output more readable)
for (int i = 1; i < node->_children.size(); ++i) {
node->runtime_profile()->add_child(node->_children[i]->runtime_profile(), true, NULL);
}
if (!node->_children.empty()) {
node->runtime_profile()->add_child(node->_children[0]->runtime_profile(), false, NULL);
}
return Status::OK;
}
Status ExecNode::create_node(ObjectPool* pool, const TPlanNode& tnode,
const DescriptorTbl& descs, ExecNode** node) {
std::stringstream error_msg;
switch (tnode.node_type) {
VLOG(2) << "tnode:\n" << apache::thrift::ThriftDebugString(tnode);
case TPlanNodeType::CSV_SCAN_NODE:
*node = pool->add(new CsvScanNode(pool, tnode, descs));
return Status::OK;
case TPlanNodeType::MYSQL_SCAN_NODE:
*node = pool->add(new MysqlScanNode(pool, tnode, descs));
return Status::OK;
case TPlanNodeType::SCHEMA_SCAN_NODE:
*node = pool->add(new SchemaScanNode(pool, tnode, descs));
return Status::OK;
case TPlanNodeType::OLAP_SCAN_NODE:
*node = pool->add(new OlapScanNode(pool, tnode, descs));
return Status::OK;
case TPlanNodeType::AGGREGATION_NODE:
if (config::enable_partitioned_aggregation) {
*node = pool->add(new PartitionedAggregationNode(pool, tnode, descs));
} else {
*node = pool->add(new AggregationNode(pool, tnode, descs));
}
return Status::OK;
/*case TPlanNodeType::PRE_AGGREGATION_NODE:
*node = pool->add(new PreAggregationNode(pool, tnode, descs));
return Status::OK;*/
case TPlanNodeType::HASH_JOIN_NODE:
*node = pool->add(new HashJoinNode(pool, tnode, descs));
return Status::OK;
case TPlanNodeType::CROSS_JOIN_NODE:
*node = pool->add(new CrossJoinNode(pool, tnode, descs));
return Status::OK;
case TPlanNodeType::MERGE_JOIN_NODE:
*node = pool->add(new MergeJoinNode(pool, tnode, descs));
return Status::OK;
case TPlanNodeType::EMPTY_SET_NODE:
*node = pool->add(new EmptySetNode(pool, tnode, descs));
return Status::OK;
case TPlanNodeType::EXCHANGE_NODE:
*node = pool->add(new ExchangeNode(pool, tnode, descs));
return Status::OK;
case TPlanNodeType::SELECT_NODE:
*node = pool->add(new SelectNode(pool, tnode, descs));
return Status::OK;
case TPlanNodeType::OLAP_REWRITE_NODE:
*node = pool->add(new OlapRewriteNode(pool, tnode, descs));
return Status::OK;
case TPlanNodeType::SORT_NODE:
if (tnode.sort_node.use_top_n) {
*node = pool->add(new TopNNode(pool, tnode, descs));
} else {
*node = pool->add(new SpillSortNode(pool, tnode, descs));
}
return Status::OK;
case TPlanNodeType::ANALYTIC_EVAL_NODE:
*node = pool->add(new AnalyticEvalNode(pool, tnode, descs));
break;
case TPlanNodeType::MERGE_NODE:
*node = pool->add(new MergeNode(pool, tnode, descs));
return Status::OK;
case TPlanNodeType::UNION_NODE:
*node = pool->add(new UnionNode(pool, tnode, descs));
return Status::OK;
case TPlanNodeType::BROKER_SCAN_NODE:
*node = pool->add(new BrokerScanNode(pool, tnode, descs));
return Status::OK;
default:
map<int, const char*>::const_iterator i =
_TPlanNodeType_VALUES_TO_NAMES.find(tnode.node_type);
const char* str = "unknown node type";
if (i != _TPlanNodeType_VALUES_TO_NAMES.end()) {
str = i->second;
}
error_msg << str << " not implemented";
return Status(error_msg.str());
}
return Status::OK;
}
void ExecNode::set_debug_options(
int node_id, TExecNodePhase::type phase, TDebugAction::type action,
ExecNode* root) {
if (root->_id == node_id) {
root->_debug_phase = phase;
root->_debug_action = action;
return;
}
for (int i = 0; i < root->_children.size(); ++i) {
set_debug_options(node_id, phase, action, root->_children[i]);
}
}
std::string ExecNode::debug_string() const {
std::stringstream out;
this->debug_string(0, &out);
return out.str();
}
void ExecNode::debug_string(int indentation_level, std::stringstream* out) const {
*out << " conjuncts=" << Expr::debug_string(_conjuncts);
*out << " id=" << _id;
*out << " type=" << print_plan_node_type(_type);
*out << " tuple_ids=[";
for (auto id : _tuple_ids) {
*out << id << ", ";
}
*out << "]";
for (int i = 0; i < _children.size(); ++i) {
*out << "\n";
_children[i]->debug_string(indentation_level + 1, out);
}
}
bool ExecNode::eval_conjuncts(ExprContext* const* ctxs, int num_ctxs, TupleRow* row) {
for (int i = 0; i < num_ctxs; ++i) {
BooleanVal v = ctxs[i]->get_boolean_val(row);
if (v.is_null || !v.val) {
return false;
}
}
return true;
}
void ExecNode::collect_nodes(TPlanNodeType::type node_type, vector<ExecNode*>* nodes) {
if (_type == node_type) {
nodes->push_back(this);
}
for (int i = 0; i < _children.size(); ++i) {
_children[i]->collect_nodes(node_type, nodes);
}
}
void ExecNode::collect_scan_nodes(vector<ExecNode*>* nodes) {
collect_nodes(TPlanNodeType::OLAP_SCAN_NODE, nodes);
collect_nodes(TPlanNodeType::BROKER_SCAN_NODE, nodes);
}
void ExecNode::init_runtime_profile(const std::string& name) {
std::stringstream ss;
ss << name << " (id=" << _id << ")";
_runtime_profile.reset(new RuntimeProfile(_pool, ss.str()));
_runtime_profile->set_metadata(_id);
}
Status ExecNode::exec_debug_action(TExecNodePhase::type phase) {
DCHECK(phase != TExecNodePhase::INVALID);
if (_debug_phase != phase) {
return Status::OK;
}
if (_debug_action == TDebugAction::FAIL) {
return Status(TStatusCode::INTERNAL_ERROR, "Debug Action: FAIL");
}
if (_debug_action == TDebugAction::WAIT) {
while (true) {
sleep(1);
}
}
return Status::OK;
}
// Codegen for EvalConjuncts. The generated signature is
// For a node with two conjunct predicates
// define i1 @EvalConjuncts(%"class.impala::ExprContext"** %ctxs, i32 %num_ctxs,
// %"class.impala::TupleRow"* %row) #20 {
// entry:
// %ctx_ptr = getelementptr %"class.impala::ExprContext"** %ctxs, i32 0
// %ctx = load %"class.impala::ExprContext"** %ctx_ptr
// %result = call i16 @Eq_StringVal_StringValWrapper3(
// %"class.impala::ExprContext"* %ctx, %"class.impala::TupleRow"* %row)
// %is_null = trunc i16 %result to i1
// %0 = ashr i16 %result, 8
// %1 = trunc i16 %0 to i8
// %val = trunc i8 %1 to i1
// %is_false = xor i1 %val, true
// %return_false = or i1 %is_null, %is_false
// br i1 %return_false, label %false, label %continue
//
// continue: ; preds = %entry
// %ctx_ptr2 = getelementptr %"class.impala::ExprContext"** %ctxs, i32 1
// %ctx3 = load %"class.impala::ExprContext"** %ctx_ptr2
// %result4 = call i16 @Gt_BigIntVal_BigIntValWrapper5(
// %"class.impala::ExprContext"* %ctx3, %"class.impala::TupleRow"* %row)
// %is_null5 = trunc i16 %result4 to i1
// %2 = ashr i16 %result4, 8
// %3 = trunc i16 %2 to i8
// %val6 = trunc i8 %3 to i1
// %is_false7 = xor i1 %val6, true
// %return_false8 = or i1 %is_null5, %is_false7
// br i1 %return_false8, label %false, label %continue1
//
// continue1: ; preds = %continue
// ret i1 true
//
// false: ; preds = %continue, %entry
// ret i1 false
// }
Function* ExecNode::codegen_eval_conjuncts(
RuntimeState* state, const std::vector<ExprContext*>& conjunct_ctxs, const char* name) {
Function* conjunct_fns[conjunct_ctxs.size()];
for (int i = 0; i < conjunct_ctxs.size(); ++i) {
Status status =
conjunct_ctxs[i]->root()->get_codegend_compute_fn(state, &conjunct_fns[i]);
if (!status.ok()) {
VLOG_QUERY << "Could not codegen EvalConjuncts: " << status.get_error_msg();
return NULL;
}
}
LlvmCodeGen* codegen = NULL;
if (!state->get_codegen(&codegen).ok()) {
return NULL;
}
// Construct function signature to match
// bool EvalConjuncts(Expr** exprs, int num_exprs, TupleRow* row)
Type* tuple_row_type = codegen->get_type(TupleRow::_s_llvm_class_name);
Type* expr_ctx_type = codegen->get_type(ExprContext::_s_llvm_class_name);
DCHECK(tuple_row_type != NULL);
DCHECK(expr_ctx_type != NULL);
PointerType* tuple_row_ptr_type = PointerType::get(tuple_row_type, 0);
PointerType* expr_ctx_ptr_type = PointerType::get(expr_ctx_type, 0);
LlvmCodeGen::FnPrototype prototype(codegen, name, codegen->get_type(TYPE_BOOLEAN));
prototype.add_argument(
LlvmCodeGen::NamedVariable("ctxs", PointerType::get(expr_ctx_ptr_type, 0)));
prototype.add_argument(
LlvmCodeGen::NamedVariable("num_ctxs", codegen->get_type(TYPE_INT)));
prototype.add_argument(LlvmCodeGen::NamedVariable("row", tuple_row_ptr_type));
LlvmCodeGen::LlvmBuilder builder(codegen->context());
Value* args[3];
Function* fn = prototype.generate_prototype(&builder, args);
Value* ctxs_arg = args[0];
Value* tuple_row_arg = args[2];
if (conjunct_ctxs.size() > 0) {
LLVMContext& context = codegen->context();
BasicBlock* false_block = BasicBlock::Create(context, "false", fn);
for (int i = 0; i < conjunct_ctxs.size(); ++i) {
BasicBlock* true_block = BasicBlock::Create(context, "continue", fn, false_block);
Value* ctx_arg_ptr = builder.CreateConstGEP1_32(ctxs_arg, i, "ctx_ptr");
Value* ctx_arg = builder.CreateLoad(ctx_arg_ptr, "ctx");
Value* expr_args[] = { ctx_arg, tuple_row_arg };
// Call conjunct_fns[i]
CodegenAnyVal result = CodegenAnyVal::create_call_wrapped(
codegen, &builder, conjunct_ctxs[i]->root()->type(),
conjunct_fns[i], expr_args, "result", NULL);
// Return false if result.is_null || !result
Value* is_null = result.get_is_null();
Value* is_false = builder.CreateNot(result.get_val(), "is_false");
Value* return_false = builder.CreateOr(is_null, is_false, "return_false");
builder.CreateCondBr(return_false, false_block, true_block);
// Set insertion point for continue/end
builder.SetInsertPoint(true_block);
}
builder.CreateRet(codegen->true_value());
builder.SetInsertPoint(false_block);
builder.CreateRet(codegen->false_value());
} else {
builder.CreateRet(codegen->true_value());
}
return codegen->finalize_function(fn);
}
}

336
be/src/exec/exec_node.h Normal file
View File

@ -0,0 +1,336 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_QUERY_EXEC_EXEC_NODE_H
#define BDG_PALO_BE_SRC_QUERY_EXEC_EXEC_NODE_H
#include <sstream>
#include <vector>
#include <mutex>
#include "common/status.h"
#include "gen_cpp/PlanNodes_types.h"
#include "runtime/descriptors.h"
#include "runtime/mem_tracker.h"
#include "util/runtime_profile.h"
#include "util/blocking_queue.hpp"
namespace llvm {
class Function;
}
namespace palo {
class Expr;
class ExprContext;
class ObjectPool;
class Counters;
class RowBatch;
class RuntimeState;
class TPlan;
class TupleRow;
class DataSink;
class MemTracker;
using std::string;
using std::stringstream;
using std::vector;
using std::map;
using boost::lock_guard;
using boost::mutex;
// Superclass of all executor nodes.
// All subclasses need to make sure to check RuntimeState::is_cancelled()
// periodically in order to ensure timely termination after the cancellation
// flag gets set.
class ExecNode {
public:
// Init conjuncts.
ExecNode(ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs);
virtual ~ExecNode();
/// Initializes this object from the thrift tnode desc. The subclass should
/// do any initialization that can fail in Init() rather than the ctor.
/// If overridden in subclass, must first call superclass's Init().
virtual Status init(const TPlanNode& tnode);
// Sets up internal structures, etc., without doing any actual work.
// Must be called prior to open(). Will only be called once in this
// node's lifetime.
// All code generation (adding functions to the LlvmCodeGen object) must happen
// in prepare(). Retrieving the jit compiled function pointer must happen in
// open().
// If overridden in subclass, must first call superclass's prepare().
virtual Status prepare(RuntimeState* state);
// Performs any preparatory work prior to calling get_next().
// Can be called repeatedly (after calls to close()).
// Caller must not be holding any io buffers. This will cause deadlock.
virtual Status open(RuntimeState* state);
// Retrieves rows and returns them via row_batch. Sets eos to true
// if subsequent calls will not retrieve any more rows.
// Data referenced by any tuples returned in row_batch must not be overwritten
// by the callee until close() is called. The memory holding that data
// can be returned via row_batch's tuple_data_pool (in which case it may be deleted
// by the caller) or held on to by the callee. The row_batch, including its
// tuple_data_pool, will be destroyed by the caller at some point prior to the final
// close() call.
// In other words, if the memory holding the tuple data will be referenced
// by the callee in subsequent get_next() calls, it must *not* be attached to the
// row_batch's tuple_data_pool.
// Caller must not be holding any io buffers. This will cause deadlock.
// TODO: AggregationNode and HashJoinNode cannot be "re-opened" yet.
virtual Status get_next(RuntimeState* state, RowBatch* row_batch, bool* eos) = 0;
// close() will get called for every exec node, regardless of what else is called and
// the status of these calls (i.e. prepare() may never have been called, or
// prepare()/open()/get_next() returned with an error).
// close() releases all resources that were allocated in open()/get_next(), even if the
// latter ended with an error. close() can be called if the node has been prepared or
// the node is closed.
// After calling close(), the caller calls open() again prior to subsequent calls to
// get_next(). The default implementation updates runtime profile counters and calls
// close() on the children. To ensure that close() is called on the entire plan tree,
// each implementation should start out by calling the default implementation.
virtual Status close(RuntimeState* state);
llvm::Function* codegen_eval_conjuncts(
RuntimeState* state, const std::vector<ExprContext*>& conjunct_ctxs, const char* name);
llvm::Function* codegen_eval_conjuncts(
RuntimeState* state, const std::vector<ExprContext*>& conjunct_ctxs) {
return codegen_eval_conjuncts(state, conjunct_ctxs, "EvalConjuncts");
}
// Creates exec node tree from list of nodes contained in plan via depth-first
// traversal. All nodes are placed in pool.
// Returns error if 'plan' is corrupted, otherwise success.
static Status create_tree(ObjectPool* pool, const TPlan& plan,
const DescriptorTbl& descs, ExecNode** root);
// Set debug action for node with given id in 'tree'
static void set_debug_options(int node_id, TExecNodePhase::type phase,
TDebugAction::type action, ExecNode* tree);
// Collect all nodes of given 'node_type' that are part of this subtree, and return in
// 'nodes'.
void collect_nodes(TPlanNodeType::type node_type, std::vector<ExecNode*>* nodes);
// Collect all scan node types.
void collect_scan_nodes(std::vector<ExecNode*>* nodes);
typedef bool (*EvalConjunctsFn)(ExprContext* const* ctxs, int num_ctxs, TupleRow* row);
// Evaluate exprs over row. Returns true if all exprs return true.
// TODO: This doesn't use the vector<Expr*> signature because I haven't figured
// out how to deal with declaring a templated std:vector type in IR
static bool eval_conjuncts(ExprContext* const* ctxs, int num_ctxs, TupleRow* row);
// Returns a string representation in DFS order of the plan rooted at this.
std::string debug_string() const;
virtual void push_down_predicate(RuntimeState* state, std::list<ExprContext*>* expr_ctxs);
// recursive helper method for generating a string for Debug_string().
// implementations should call debug_string(int, std::stringstream) on their children.
// Input parameters:
// indentation_level: Current level in plan tree.
// Output parameters:
// out: Stream to accumulate debug string.
virtual void debug_string(int indentation_level, std::stringstream* out) const;
const std::vector<ExprContext*>& conjunct_ctxs() const {
return _conjunct_ctxs;
}
int id() const {
return _id;
}
TPlanNodeType::type type() const {
return _type;
}
const RowDescriptor& row_desc() const {
return _row_descriptor;
}
int64_t rows_returned() const {
return _num_rows_returned;
}
int64_t limit() const {
return _limit;
}
bool reached_limit() {
return _limit != -1 && _num_rows_returned >= _limit;
}
const std::vector<TupleId>& get_tuple_ids() const {
return _tuple_ids;
}
RuntimeProfile* runtime_profile() {
return _runtime_profile.get();
}
RuntimeProfile::Counter* memory_used_counter() const {
return _memory_used_counter;
}
MemTracker* mem_tracker() const {
return _mem_tracker.get();
}
MemTracker* expr_mem_tracker() const {
return _expr_mem_tracker.get();
}
// Extract node id from p->name().
static int get_node_id_from_profile(RuntimeProfile* p);
// Names of counters shared by all exec nodes
static const std::string ROW_THROUGHPUT_COUNTER;
protected:
friend class DataSink;
/// Extends blocking queue for row batches. Row batches have a property that
/// they must be processed in the order they were produced, even in cancellation
/// paths. Preceding row batches can contain ptrs to memory in subsequent row batches
/// and we need to make sure those ptrs stay valid.
/// Row batches that are added after Shutdown() are queued in another queue, which can
/// be cleaned up during Close().
/// All functions are thread safe.
class RowBatchQueue : public BlockingQueue<RowBatch*> {
public:
/// max_batches is the maximum number of row batches that can be queued.
/// When the queue is full, producers will block.
RowBatchQueue(int max_batches);
~RowBatchQueue();
/// Adds a batch to the queue. This is blocking if the queue is full.
void AddBatch(RowBatch* batch);
/// Adds a batch to the queue. If the queue is full, this blocks until space becomes
/// available or 'timeout_micros' has elapsed.
/// Returns true if the element was added to the queue, false if it wasn't. If this
/// method returns false, the queue didn't take ownership of the batch and it must be
/// managed externally.
bool AddBatchWithTimeout(RowBatch* batch, int64_t timeout_micros);
/// Gets a row batch from the queue. Returns NULL if there are no more.
/// This function blocks.
/// Returns NULL after Shutdown().
RowBatch* GetBatch();
/// Deletes all row batches in cleanup_queue_. Not valid to call AddBatch()
/// after this is called.
/// Returns the number of io buffers that were released (for debug tracking)
int Cleanup();
private:
/// Lock protecting cleanup_queue_
// SpinLock lock_;
// TODO(dhc): need to modify spinlock
std::mutex lock_;
/// Queue of orphaned row batches
std::list<RowBatch*> cleanup_queue_;
};
int _id; // unique w/in single plan tree
TPlanNodeType::type _type;
ObjectPool* _pool;
std::vector<Expr*> _conjuncts;
std::vector<ExprContext*> _conjunct_ctxs;
std::vector<TupleId> _tuple_ids;
std::vector<ExecNode*> _children;
RowDescriptor _row_descriptor;
// debug-only: if _debug_action is not INVALID, node will perform action in
// _debug_phase
TExecNodePhase::type _debug_phase;
TDebugAction::type _debug_action;
int64_t _limit; // -1: no limit
int64_t _num_rows_returned;
boost::scoped_ptr<RuntimeProfile> _runtime_profile;
boost::scoped_ptr<MemTracker> _mem_tracker;
boost::scoped_ptr<MemTracker> _expr_mem_tracker;
RuntimeProfile::Counter* _rows_returned_counter;
RuntimeProfile::Counter* _rows_returned_rate;
// Account for peak memory used by this node
RuntimeProfile::Counter* _memory_used_counter;
// Execution options that are determined at runtime. This is added to the
// runtime profile at close(). Examples for options logged here would be
// "Codegen Enabled"
boost::mutex _exec_options_lock;
std::string _runtime_exec_options;
ExecNode* child(int i) {
return _children[i];
}
bool is_closed() const {
return _is_closed;
}
// TODO(zc)
/// Pointer to the containing SubplanNode or NULL if not inside a subplan.
/// Set by SubplanNode::Init(). Not owned.
// SubplanNode* containing_subplan_;
/// Returns true if this node is inside the right-hand side plan tree of a SubplanNode.
/// Valid to call in or after Prepare().
bool is_in_subplan() const { return false; }
// Create a single exec node derived from thrift node; place exec node in 'pool'.
static Status create_node(ObjectPool* pool, const TPlanNode& tnode,
const DescriptorTbl& descs, ExecNode** node);
static Status create_tree_helper(ObjectPool* pool, const std::vector<TPlanNode>& tnodes,
const DescriptorTbl& descs, ExecNode* parent, int* node_idx, ExecNode** root);
virtual bool is_scan_node() const {
return false;
}
void init_runtime_profile(const std::string& name);
// Executes _debug_action if phase matches _debug_phase.
// 'phase' must not be INVALID.
Status exec_debug_action(TExecNodePhase::type phase);
// Appends option to '_runtime_exec_options'
void add_runtime_exec_option(const std::string& option);
private:
bool _is_closed;
};
#define RETURN_IF_LIMIT_EXCEEDED(state) \
do { \
/* if (UNLIKELY(MemTracker::limit_exceeded(*(state)->mem_trackers()))) { */ \
if (UNLIKELY(state->instance_mem_tracker()->any_limit_exceeded())) { \
return Status::MEM_LIMIT_EXCEEDED; \
} \
} while (false)
}
#endif

43
be/src/exec/file_reader.h Normal file
View File

@ -0,0 +1,43 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#pragma once
#include <stdint.h>
#include "common/status.h"
namespace palo {
class FileReader {
public:
virtual ~FileReader() {
}
// Read content to 'buf', 'buf_len' is the max size of this buffer.
// Return ok when read success, and 'buf_len' is set to size of read content
// If reach to end of file, the eof is set to true. meanwhile 'buf_len'
// is set to zero.
virtual Status read(uint8_t* buf, size_t* buf_len, bool* eof) = 0;
virtual void close() = 0;
};
}

45
be/src/exec/file_writer.h Normal file
View File

@ -0,0 +1,45 @@
// Copyright (c) 2017, Baidu.com, Inc. All Rights Reserved
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_EXEC_FILE_WRITER_H
#define BDG_PALO_BE_SRC_EXEC_FILE_WRITER_H
#include <stdint.h>
#include "common/status.h"
namespace palo {
class FileWriter {
public:
virtual ~FileWriter() {
}
virtual Status open() = 0;
// Writes up to count bytes from the buffer pointed buf to the file.
// NOTE: the number of bytes written may be less than count if.
virtual Status write(const uint8_t* buf, size_t buf_len, size_t* written_len) = 0;
virtual void close() = 0;
};
} // end namespace palo
#endif // BDG_PALO_BE_SRC_EXEC_FILE_WRITER_H

View File

@ -0,0 +1,962 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/hash_join_node.h"
#include <sstream>
#include "codegen/llvm_codegen.h"
#include "exec/hash_table.hpp"
#include "exprs/expr.h"
#include "exprs/in_predicate.h"
#include "exprs/slot_ref.h"
#include "runtime/row_batch.h"
#include "runtime/runtime_state.h"
#include "util/debug_util.h"
#include "util/runtime_profile.h"
#include "gen_cpp/PlanNodes_types.h"
using llvm::Function;
using llvm::PointerType;
using llvm::Type;
using llvm::Value;
using llvm::BasicBlock;
using llvm::LLVMContext;
namespace palo {
const char* HashJoinNode::_s_llvm_class_name = "class.palo::HashJoinNode";
HashJoinNode::HashJoinNode(
ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs) :
ExecNode(pool, tnode, descs),
_join_op(tnode.hash_join_node.join_op),
_codegen_process_build_batch_fn(NULL),
_process_build_batch_fn(NULL),
_process_probe_batch_fn(NULL),
_anti_join_last_pos(NULL) {
_match_all_probe =
(_join_op == TJoinOp::LEFT_OUTER_JOIN || _join_op == TJoinOp::FULL_OUTER_JOIN);
_match_one_build = (_join_op == TJoinOp::LEFT_SEMI_JOIN);
_match_all_build =
(_join_op == TJoinOp::RIGHT_OUTER_JOIN || _join_op == TJoinOp::FULL_OUTER_JOIN);
_is_push_down = tnode.hash_join_node.is_push_down;
}
HashJoinNode::~HashJoinNode() {
// _probe_batch must be cleaned up in close() to ensure proper resource freeing.
DCHECK(_probe_batch == NULL);
}
Status HashJoinNode::init(const TPlanNode& tnode) {
RETURN_IF_ERROR(ExecNode::init(tnode));
DCHECK(tnode.__isset.hash_join_node);
const vector<TEqJoinCondition>& eq_join_conjuncts = tnode.hash_join_node.eq_join_conjuncts;
for (int i = 0; i < eq_join_conjuncts.size(); ++i) {
ExprContext* ctx = NULL;
RETURN_IF_ERROR(Expr::create_expr_tree(_pool, eq_join_conjuncts[i].left, &ctx));
_probe_expr_ctxs.push_back(ctx);
RETURN_IF_ERROR(Expr::create_expr_tree(_pool, eq_join_conjuncts[i].right, &ctx));
_build_expr_ctxs.push_back(ctx);
}
RETURN_IF_ERROR(
Expr::create_expr_trees(_pool, tnode.hash_join_node.other_join_conjuncts,
&_other_join_conjunct_ctxs));
return Status::OK;
}
Status HashJoinNode::prepare(RuntimeState* state) {
RETURN_IF_ERROR(ExecNode::prepare(state));
_build_pool.reset(new MemPool(mem_tracker()));
_build_timer =
ADD_TIMER(runtime_profile(), "BuildTime");
_push_down_timer =
ADD_TIMER(runtime_profile(), "PushDownTime");
_push_compute_timer =
ADD_TIMER(runtime_profile(), "PushDownComputeTime");
_probe_timer =
ADD_TIMER(runtime_profile(), "ProbeTime");
_build_row_counter =
ADD_COUNTER(runtime_profile(), "BuildRows", TUnit::UNIT);
_build_buckets_counter =
ADD_COUNTER(runtime_profile(), "BuildBuckets", TUnit::UNIT);
_probe_row_counter =
ADD_COUNTER(runtime_profile(), "ProbeRows", TUnit::UNIT);
_hash_tbl_load_factor_counter =
ADD_COUNTER(runtime_profile(), "LoadFactor", TUnit::DOUBLE_VALUE);
// build and probe exprs are evaluated in the context of the rows produced by our
// right and left children, respectively
RETURN_IF_ERROR(Expr::prepare(
_build_expr_ctxs, state, child(1)->row_desc(), expr_mem_tracker()));
RETURN_IF_ERROR(Expr::prepare(
_probe_expr_ctxs, state, child(0)->row_desc(), expr_mem_tracker()));
// _other_join_conjuncts are evaluated in the context of the rows produced by this node
RETURN_IF_ERROR(Expr::prepare(
_other_join_conjunct_ctxs, state, _row_descriptor, expr_mem_tracker()));
_result_tuple_row_size = _row_descriptor.tuple_descriptors().size() * sizeof(Tuple*);
int num_left_tuples = child(0)->row_desc().tuple_descriptors().size();
int num_build_tuples = child(1)->row_desc().tuple_descriptors().size();
_probe_tuple_row_size = num_left_tuples * sizeof(Tuple*);
_build_tuple_row_size = num_build_tuples * sizeof(Tuple*);
// pre-compute the tuple index of build tuples in the output row
_build_tuple_size = num_build_tuples;
_build_tuple_idx.reserve(_build_tuple_size);
for (int i = 0; i < _build_tuple_size; ++i) {
TupleDescriptor* build_tuple_desc = child(1)->row_desc().tuple_descriptors()[i];
_build_tuple_idx.push_back(_row_descriptor.get_tuple_idx(build_tuple_desc->id()));
}
_probe_tuple_row_size = num_left_tuples * sizeof(Tuple*);
_build_tuple_row_size = num_build_tuples * sizeof(Tuple*);
// TODO: default buckets
_hash_tbl.reset(new HashTable(
_build_expr_ctxs, _probe_expr_ctxs, _build_tuple_size,
false, id(), mem_tracker(), 1024));
_probe_batch.reset(new RowBatch(child(0)->row_desc(), state->batch_size(), mem_tracker()));
if (state->codegen_level() > 0) {
if (_join_op == TJoinOp::LEFT_ANTI_JOIN) {
return Status::OK;
}
LlvmCodeGen* codegen = NULL;
RETURN_IF_ERROR(state->get_codegen(&codegen));
// Codegen for hashing rows
Function* hash_fn = _hash_tbl->codegen_hash_current_row(state);
if (hash_fn == NULL) {
return Status::OK;
}
// Codegen for build path
_codegen_process_build_batch_fn = codegen_process_build_batch(state, hash_fn);
if (_codegen_process_build_batch_fn != NULL) {
codegen->add_function_to_jit(
_codegen_process_build_batch_fn,
reinterpret_cast<void**>(&_process_build_batch_fn));
// AddRuntimeExecOption("Build Side Codegen Enabled");
}
// Codegen for probe path (only for left joins)
if (!_match_all_build) {
Function* codegen_process_probe_batch_fn = codegen_process_probe_batch(state, hash_fn);
if (codegen_process_probe_batch_fn != NULL) {
codegen->add_function_to_jit(codegen_process_probe_batch_fn,
reinterpret_cast<void**>(&_process_probe_batch_fn));
// AddRuntimeExecOption("Probe Side Codegen Enabled");
}
}
}
return Status::OK;
}
Status HashJoinNode::close(RuntimeState* state) {
if (is_closed()) {
return Status::OK;
}
RETURN_IF_ERROR(exec_debug_action(TExecNodePhase::CLOSE));
// Must reset _probe_batch in close() to release resources
_probe_batch.reset(NULL);
if (_memory_used_counter != NULL && _hash_tbl.get() != NULL) {
COUNTER_UPDATE(_memory_used_counter, _build_pool->peak_allocated_bytes());
COUNTER_UPDATE(_memory_used_counter, _hash_tbl->byte_size());
}
if (_hash_tbl.get() != NULL) {
_hash_tbl->close();
}
if (_build_pool.get() != NULL) {
_build_pool->free_all();
}
Expr::close(_build_expr_ctxs, state);
Expr::close(_probe_expr_ctxs, state);
Expr::close(_other_join_conjunct_ctxs, state);
#if 0
for (auto iter : _push_down_expr_ctxs) {
iter->close(state);
}
#endif
return ExecNode::close(state);
}
void HashJoinNode::build_side_thread(RuntimeState* state, boost::promise<Status>* status) {
status->set_value(construct_hash_table(state));
// Release the thread token as soon as possible (before the main thread joins
// on it). This way, if we had a chain of 10 joins using 1 additional thread,
// we'd keep the additional thread busy the whole time.
state->resource_pool()->release_thread_token(false);
}
Status HashJoinNode::construct_hash_table(RuntimeState* state) {
// Do a full scan of child(1) and store everything in _hash_tbl
// The hash join node needs to keep in memory all build tuples, including the tuple
// row ptrs. The row ptrs are copied into the hash table's internal structure so they
// don't need to be stored in the _build_pool.
RowBatch build_batch(child(1)->row_desc(), state->batch_size(), mem_tracker());
RETURN_IF_ERROR(child(1)->open(state));
while (true) {
RETURN_IF_CANCELLED(state);
bool eos = true;
RETURN_IF_ERROR(child(1)->get_next(state, &build_batch, &eos));
SCOPED_TIMER(_build_timer);
// take ownership of tuple data of build_batch
_build_pool->acquire_data(build_batch.tuple_data_pool(), false);
RETURN_IF_LIMIT_EXCEEDED(state);
// Call codegen version if possible
if (_process_build_batch_fn == NULL) {
process_build_batch(&build_batch);
} else {
_process_build_batch_fn(this, &build_batch);
}
VLOG_ROW << _hash_tbl->debug_string(true, &child(1)->row_desc());
COUNTER_SET(_build_row_counter, _hash_tbl->size());
COUNTER_SET(_build_buckets_counter, _hash_tbl->num_buckets());
COUNTER_SET(_hash_tbl_load_factor_counter, _hash_tbl->load_factor());
build_batch.reset();
if (eos) {
break;
}
}
return Status::OK;
}
Status HashJoinNode::open(RuntimeState* state) {
RETURN_IF_ERROR(ExecNode::open(state));
RETURN_IF_ERROR(exec_debug_action(TExecNodePhase::OPEN));
SCOPED_TIMER(_runtime_profile->total_time_counter());
RETURN_IF_CANCELLED(state);
RETURN_IF_ERROR(Expr::open(_build_expr_ctxs, state));
RETURN_IF_ERROR(Expr::open(_probe_expr_ctxs, state));
RETURN_IF_ERROR(Expr::open(_other_join_conjunct_ctxs, state));
_eos = false;
// TODO: fix problems with asynchronous cancellation
// Kick-off the construction of the build-side table in a separate
// thread, so that the left child can do any initialisation in parallel.
// Only do this if we can get a thread token. Otherwise, do this in the
// main thread
boost::promise<Status> thread_status;
if (state->resource_pool()->try_acquire_thread_token()) {
add_runtime_exec_option("Hash Table Built Asynchronously");
boost::thread(bind(&HashJoinNode::build_side_thread, this, state, &thread_status));
} else {
thread_status.set_value(construct_hash_table(state));
}
if (_children[0]->type() == TPlanNodeType::EXCHANGE_NODE
&& _children[1]->type() == TPlanNodeType::EXCHANGE_NODE) {
_is_push_down = false;
}
if (_is_push_down) {
// Blocks until ConstructHashTable has returned, after which
// the hash table is fully constructed and we can start the probe
// phase.
RETURN_IF_ERROR(thread_status.get_future().get());
if (_hash_tbl->size() == 0 && _join_op == TJoinOp::INNER_JOIN) {
// Hash table size is zero
LOG(INFO) << "No element need to push down, no need to read probe table";
RETURN_IF_ERROR(child(0)->open(state));
_probe_batch_pos = 0;
_hash_tbl_iterator = _hash_tbl->begin();
_eos = true;
return Status::OK;
}
if (_hash_tbl->size() > 500 * 1024) {
_is_push_down = false;
}
// TODO: this is used for Code Check, Remove this later
if (_is_push_down || 0 != child(1)->conjunct_ctxs().size()) {
for (int i = 0; i < _probe_expr_ctxs.size(); ++i) {
TExprNode node;
node.__set_node_type(TExprNodeType::IN_PRED);
TScalarType tscalar_type;
tscalar_type.__set_type(TPrimitiveType::BOOLEAN);
TTypeNode ttype_node;
ttype_node.__set_type(TTypeNodeType::SCALAR);
ttype_node.__set_scalar_type(tscalar_type);
TTypeDesc t_type_desc;
t_type_desc.types.push_back(ttype_node);
node.__set_type(t_type_desc);
node.in_predicate.__set_is_not_in(false);
node.__set_opcode(TExprOpcode::FILTER_IN);
node.__isset.vector_opcode = true;
node.__set_vector_opcode(to_in_opcode(_probe_expr_ctxs[i]->root()->type().type));
// NOTE(zc): in predicate only used here, no need prepare.
InPredicate* in_pred = _pool->add(new InPredicate(node));
RETURN_IF_ERROR(in_pred->prepare(state, _probe_expr_ctxs[i]->root()->type()));
in_pred->add_child(Expr::copy(_pool, _probe_expr_ctxs[i]->root()));
ExprContext* ctx = _pool->add(new ExprContext(in_pred));
_push_down_expr_ctxs.push_back(ctx);
}
{
SCOPED_TIMER(_push_compute_timer);
HashTable::Iterator iter = _hash_tbl->begin();
while (iter.has_next()) {
TupleRow* row = iter.get_row();
std::list<ExprContext*>::iterator ctx_iter = _push_down_expr_ctxs.begin();
for (int i = 0; i < _build_expr_ctxs.size(); ++i, ++ctx_iter) {
void* val = _build_expr_ctxs[i]->get_value(row);
InPredicate* in_pre = (InPredicate*)((*ctx_iter)->root());
in_pre->insert(val);
}
SCOPED_TIMER(_build_timer);
iter.next<false>();
}
}
SCOPED_TIMER(_push_down_timer);
push_down_predicate(state, &_push_down_expr_ctxs);
}
// Open the probe-side child so that it may perform any initialisation in parallel.
// Don't exit even if we see an error, we still need to wait for the build thread
// to finish.
Status open_status = child(0)->open(state);
RETURN_IF_ERROR(open_status);
} else {
// Open the probe-side child so that it may perform any initialisation in parallel.
// Don't exit even if we see an error, we still need to wait for the build thread
// to finish.
Status open_status = child(0)->open(state);
// Blocks until ConstructHashTable has returned, after which
// the hash table is fully constructed and we can start the probe
// phase.
RETURN_IF_ERROR(thread_status.get_future().get());
// ISSUE-1247, check open_status after buildThread execute.
// If this return first, build thread will use 'thread_status'
// which is already destructor and then coredump.
RETURN_IF_ERROR(open_status);
}
// seed probe batch and _current_probe_row, etc.
while (true) {
RETURN_IF_ERROR(child(0)->get_next(state, _probe_batch.get(), &_probe_eos));
COUNTER_UPDATE(_probe_row_counter, _probe_batch->num_rows());
_probe_batch_pos = 0;
if (_probe_batch->num_rows() == 0) {
if (_probe_eos) {
_hash_tbl_iterator = _hash_tbl->begin();
_eos = true;
break;
}
_probe_batch->reset();
continue;
} else {
_current_probe_row = _probe_batch->get_row(_probe_batch_pos++);
VLOG_ROW << "probe row: " << get_probe_row_output_string(_current_probe_row);
_matched_probe = false;
_hash_tbl_iterator = _hash_tbl->find(_current_probe_row);
break;
}
}
return Status::OK;
}
Status HashJoinNode::get_next(RuntimeState* state, RowBatch* out_batch, bool* eos) {
RETURN_IF_ERROR(exec_debug_action(TExecNodePhase::GETNEXT));
RETURN_IF_CANCELLED(state);
SCOPED_TIMER(_runtime_profile->total_time_counter());
if (reached_limit()) {
*eos = true;
return Status::OK;
}
// These cases are simpler and use a more efficient processing loop
if (!(_match_all_build || _join_op == TJoinOp::RIGHT_SEMI_JOIN
|| _join_op == TJoinOp::RIGHT_ANTI_JOIN)) {
if (_eos) {
*eos = true;
return Status::OK;
}
return left_join_get_next(state, out_batch, eos);
}
ExprContext* const* other_conjunct_ctxs = &_other_join_conjunct_ctxs[0];
int num_other_conjunct_ctxs = _other_join_conjunct_ctxs.size();
ExprContext* const* conjunct_ctxs = &_conjunct_ctxs[0];
int num_conjunct_ctxs = _conjunct_ctxs.size();
// Explicitly manage the timer counter to avoid measuring time in the child
// GetNext call.
ScopedTimer<MonotonicStopWatch> probe_timer(_probe_timer);
while (!_eos) {
// create output rows as long as:
// 1) we haven't already created an output row for the probe row and are doing
// a semi-join;
// 2) there are more matching build rows
VLOG_ROW << "probe row: " << get_probe_row_output_string(_current_probe_row);
while (_hash_tbl_iterator.has_next()) {
TupleRow* matched_build_row = _hash_tbl_iterator.get_row();
VLOG_ROW << "matched_build_row: " << print_row(matched_build_row, child(1)->row_desc());
if ((_join_op == TJoinOp::RIGHT_ANTI_JOIN || _join_op == TJoinOp::RIGHT_SEMI_JOIN)
&& _hash_tbl_iterator.matched()) {
// We have already matched this build row, continue to next match.
// _hash_tbl_iterator.next<true>();
_hash_tbl_iterator.next<true>();
continue;
}
int row_idx = out_batch->add_row();
TupleRow* out_row = out_batch->get_row(row_idx);
// right anti join
// 1. find pos in hash table which meets equi-join
// 2. judge if set matched with other join predicates
// 3. scans hash table to choose row which is't set matched and meets conjuncts
if (_join_op == TJoinOp::RIGHT_ANTI_JOIN) {
create_output_row(out_row, _current_probe_row, matched_build_row);
if (eval_conjuncts(other_conjunct_ctxs, num_other_conjunct_ctxs, out_row)) {
_hash_tbl_iterator.set_matched();
}
_hash_tbl_iterator.next<true>();
continue;
} else {
// right semi join
// 1. find pos in hash table which meets equi-join and set_matched
// 2. check if the row meets other join predicates
// 3. check if the row meets conjuncts
// right join and full join
// 1. find pos in hash table which meets equi-join
// 2. check if the row meets other join predicates
// 3. check if the row meets conjuncts
// 4. output left and right meeting other predicates and conjuncts
// 5. if full join, output left meeting and right no meeting other
// join predicates and conjuncts
// 6. output left no meeting and right meeting other join predicate
// and conjuncts
create_output_row(out_row, _current_probe_row, matched_build_row);
}
if (!eval_conjuncts(other_conjunct_ctxs, num_other_conjunct_ctxs, out_row)) {
_hash_tbl_iterator.next<true>();
continue;
}
if (_join_op == TJoinOp::RIGHT_SEMI_JOIN) {
_hash_tbl_iterator.set_matched();
}
// we have a match for the purpose of the (outer?) join as soon as we
// satisfy the JOIN clause conjuncts
_matched_probe = true;
if (_match_all_build) {
// remember that we matched this build row
_joined_build_rows.insert(matched_build_row);
VLOG_ROW << "joined build row: " << matched_build_row;
}
_hash_tbl_iterator.next<true>();
if (eval_conjuncts(conjunct_ctxs, num_conjunct_ctxs, out_row)) {
out_batch->commit_last_row();
VLOG_ROW << "match row: " << print_row(out_row, row_desc());
++_num_rows_returned;
COUNTER_SET(_rows_returned_counter, _num_rows_returned);
if (out_batch->is_full() || reached_limit()) {
*eos = reached_limit();
return Status::OK;
}
}
}
// check whether we need to output the current probe row before
// getting a new probe batch
if (_match_all_probe && !_matched_probe) {
int row_idx = out_batch->add_row();
TupleRow* out_row = out_batch->get_row(row_idx);
create_output_row(out_row, _current_probe_row, NULL);
if (eval_conjuncts(conjunct_ctxs, num_conjunct_ctxs, out_row)) {
out_batch->commit_last_row();
VLOG_ROW << "match row: " << print_row(out_row, row_desc());
++_num_rows_returned;
COUNTER_SET(_rows_returned_counter, _num_rows_returned);
_matched_probe = true;
if (out_batch->is_full() || reached_limit()) {
*eos = reached_limit();
return Status::OK;
}
}
}
if (_probe_batch_pos == _probe_batch->num_rows()) {
// pass on resources, out_batch might still need them
_probe_batch->transfer_resource_ownership(out_batch);
_probe_batch_pos = 0;
if (out_batch->is_full() || out_batch->at_resource_limit()) {
return Status::OK;
}
// get new probe batch
if (!_probe_eos) {
while (true) {
probe_timer.stop();
RETURN_IF_ERROR(child(0)->get_next(state, _probe_batch.get(), &_probe_eos));
probe_timer.start();
if (_probe_batch->num_rows() == 0) {
// Empty batches can still contain IO buffers, which need to be passed up to
// the caller; transferring resources can fill up out_batch.
_probe_batch->transfer_resource_ownership(out_batch);
if (_probe_eos) {
_eos = true;
break;
}
if (out_batch->is_full() || out_batch->at_resource_limit()) {
return Status::OK;
}
continue;
} else {
COUNTER_UPDATE(_probe_row_counter, _probe_batch->num_rows());
break;
}
}
} else {
_eos = true;
}
// finish up right outer join
if (_eos && (_match_all_build || _join_op == TJoinOp::RIGHT_ANTI_JOIN)) {
_hash_tbl_iterator = _hash_tbl->begin();
}
}
if (_eos) {
break;
}
// join remaining rows in probe _batch
_current_probe_row = _probe_batch->get_row(_probe_batch_pos++);
VLOG_ROW << "probe row: " << get_probe_row_output_string(_current_probe_row);
_matched_probe = false;
_hash_tbl_iterator = _hash_tbl->find(_current_probe_row);
}
*eos = true;
if (_match_all_build || _join_op == TJoinOp::RIGHT_ANTI_JOIN) {
// output remaining unmatched build rows
TupleRow* build_row = NULL;
if (_join_op == TJoinOp::RIGHT_ANTI_JOIN) {
if (_anti_join_last_pos != NULL) {
_hash_tbl_iterator = *_anti_join_last_pos;
} else {
_hash_tbl_iterator = _hash_tbl->begin();
}
}
while (!out_batch->is_full() && _hash_tbl_iterator.has_next()) {
build_row = _hash_tbl_iterator.get_row();
if (_match_all_build) {
if (_joined_build_rows.find(build_row) != _joined_build_rows.end()) {
_hash_tbl_iterator.next<false>();
continue;
}
} else if (_join_op == TJoinOp::RIGHT_ANTI_JOIN) {
if (_hash_tbl_iterator.matched()) {
_hash_tbl_iterator.next<false>();
continue;
}
}
int row_idx = out_batch->add_row();
TupleRow* out_row = out_batch->get_row(row_idx);
create_output_row(out_row, NULL, build_row);
if (eval_conjuncts(conjunct_ctxs, num_conjunct_ctxs, out_row)) {
out_batch->commit_last_row();
VLOG_ROW << "match row: " << print_row(out_row, row_desc());
++_num_rows_returned;
COUNTER_SET(_rows_returned_counter, _num_rows_returned);
if (reached_limit()) {
*eos = true;
return Status::OK;
}
}
_hash_tbl_iterator.next<false>();
}
if (_join_op == TJoinOp::RIGHT_ANTI_JOIN) {
_anti_join_last_pos = &_hash_tbl_iterator;
}
// we're done if there are no more rows left to check
*eos = !_hash_tbl_iterator.has_next();
}
return Status::OK;
}
Status HashJoinNode::left_join_get_next(RuntimeState* state,
RowBatch* out_batch, bool* eos) {
*eos = _eos;
ScopedTimer<MonotonicStopWatch> probe_timer(_probe_timer);
while (!_eos) {
// Compute max rows that should be added to out_batch
int64_t max_added_rows = out_batch->capacity() - out_batch->num_rows();
if (limit() != -1) {
max_added_rows = std::min(max_added_rows, limit() - rows_returned());
}
// Continue processing this row batch
if (_process_probe_batch_fn == NULL) {
_num_rows_returned +=
process_probe_batch(out_batch, _probe_batch.get(), max_added_rows);
COUNTER_SET(_rows_returned_counter, _num_rows_returned);
} else {
// Use codegen'd function
_num_rows_returned +=
_process_probe_batch_fn(this, out_batch, _probe_batch.get(), max_added_rows);
COUNTER_SET(_rows_returned_counter, _num_rows_returned);
}
if (reached_limit() || out_batch->is_full()) {
*eos = reached_limit();
break;
}
// Check to see if we're done processing the current probe batch
if (!_hash_tbl_iterator.has_next() && _probe_batch_pos == _probe_batch->num_rows()) {
_probe_batch->transfer_resource_ownership(out_batch);
_probe_batch_pos = 0;
if (out_batch->is_full() || out_batch->at_resource_limit()) {
break;
}
if (_probe_eos) {
*eos = _eos = true;
break;
} else {
probe_timer.stop();
RETURN_IF_ERROR(child(0)->get_next(state, _probe_batch.get(), &_probe_eos));
probe_timer.start();
COUNTER_UPDATE(_probe_row_counter, _probe_batch->num_rows());
}
}
}
return Status::OK;
}
string HashJoinNode::get_probe_row_output_string(TupleRow* probe_row) {
std::stringstream out;
out << "[";
int* _build_tuple_idx_ptr = &_build_tuple_idx[0];
for (int i = 0; i < row_desc().tuple_descriptors().size(); ++i) {
if (i != 0) {
out << " ";
}
int* is_build_tuple =
std::find(_build_tuple_idx_ptr, _build_tuple_idx_ptr + _build_tuple_size, i);
if (is_build_tuple != _build_tuple_idx_ptr + _build_tuple_size) {
out << print_tuple(NULL, *row_desc().tuple_descriptors()[i]);
} else {
out << print_tuple(probe_row->get_tuple(i), *row_desc().tuple_descriptors()[i]);
}
}
out << "]";
return out.str();
}
void HashJoinNode::debug_string(int indentation_level, std::stringstream* out) const {
*out << string(indentation_level * 2, ' ');
*out << "_hashJoin(eos=" << (_eos ? "true" : "false")
<< " probe_batch_pos=" << _probe_batch_pos
<< " hash_tbl=";
*out << string(indentation_level * 2, ' ');
*out << "HashTbl(";
// << " build_exprs=" << Expr::debug_string(_build_expr_ctxs)
// << " probe_exprs=" << Expr::debug_string(_probe_expr_ctxs);
*out << ")";
ExecNode::debug_string(indentation_level, out);
*out << ")";
}
// This function is replaced by codegen
void HashJoinNode::create_output_row(TupleRow* out, TupleRow* probe, TupleRow* build) {
uint8_t* out_ptr = reinterpret_cast<uint8_t*>(out);
if (probe == NULL) {
memset(out_ptr, 0, _probe_tuple_row_size);
} else {
memcpy(out_ptr, probe, _probe_tuple_row_size);
}
if (build == NULL) {
memset(out_ptr + _probe_tuple_row_size, 0, _build_tuple_row_size);
} else {
memcpy(out_ptr + _probe_tuple_row_size, build, _build_tuple_row_size);
}
}
// This codegen'd function should only be used for left join cases so it assumes that
// the probe row is non-null. For a left outer join, the IR looks like:
// define void @CreateOutputRow(%"class.impala::HashBlockingNode"* %this_ptr,
// %"class.impala::TupleRow"* %out_arg,
// %"class.impala::TupleRow"* %probe_arg,
// %"class.impala::TupleRow"* %build_arg) {
// entry:
// %out = bitcast %"class.impala::TupleRow"* %out_arg to i8**
// %probe = bitcast %"class.impala::TupleRow"* %probe_arg to i8**
// %build = bitcast %"class.impala::TupleRow"* %build_arg to i8**
// %0 = bitcast i8** %out to i8*
// %1 = bitcast i8** %probe to i8*
// call void @llvm.memcpy.p0i8.p0i8.i32(i8* %0, i8* %1, i32 16, i32 16, i1 false)
// %is_build_null = icmp eq i8** %build, null
// br i1 %is_build_null, label %build_null, label %build_not_null
//
// build_not_null: ; preds = %entry
// %dst_tuple_ptr1 = getelementptr i8** %out, i32 1
// %src_tuple_ptr = getelementptr i8** %build, i32 0
// %2 = load i8** %src_tuple_ptr
// store i8* %2, i8** %dst_tuple_ptr1
// ret void
//
// build_null: ; preds = %entry
// %dst_tuple_ptr = getelementptr i8** %out, i32 1
// call void @llvm.memcpy.p0i8.p0i8.i32(
// i8* %dst_tuple_ptr, i8* %1, i32 16, i32 16, i1 false)
// ret void
// }
Function* HashJoinNode::codegen_create_output_row(LlvmCodeGen* codegen) {
Type* tuple_row_type = codegen->get_type(TupleRow::_s_llvm_class_name);
DCHECK(tuple_row_type != NULL);
PointerType* tuple_row_ptr_type = PointerType::get(tuple_row_type, 0);
Type* this_type = codegen->get_type(HashJoinNode::_s_llvm_class_name);
DCHECK(this_type != NULL);
PointerType* this_ptr_type = PointerType::get(this_type, 0);
// TupleRows are really just an array of pointers. Easier to work with them
// this way.
PointerType* tuple_row_working_type = PointerType::get(codegen->ptr_type(), 0);
// Construct function signature to match CreateOutputRow()
LlvmCodeGen::FnPrototype prototype(codegen, "CreateOutputRow", codegen->void_type());
prototype.add_argument(LlvmCodeGen::NamedVariable("this_ptr", this_ptr_type));
prototype.add_argument(LlvmCodeGen::NamedVariable("out_arg", tuple_row_ptr_type));
prototype.add_argument(LlvmCodeGen::NamedVariable("probe_arg", tuple_row_ptr_type));
prototype.add_argument(LlvmCodeGen::NamedVariable("build_arg", tuple_row_ptr_type));
LLVMContext& context = codegen->context();
LlvmCodeGen::LlvmBuilder builder(context);
Value* args[4];
Function* fn = prototype.generate_prototype(&builder, args);
Value* out_row_arg = builder.CreateBitCast(args[1], tuple_row_working_type, "out");
Value* probe_row_arg = builder.CreateBitCast(args[2], tuple_row_working_type, "probe");
Value* build_row_arg = builder.CreateBitCast(args[3], tuple_row_working_type, "build");
int num_probe_tuples = child(0)->row_desc().tuple_descriptors().size();
int num_build_tuples = child(1)->row_desc().tuple_descriptors().size();
// Copy probe row
codegen->codegen_memcpy(&builder, out_row_arg, probe_row_arg, _probe_tuple_row_size);
Value* build_row_idx[] = { codegen->get_int_constant(TYPE_INT, num_probe_tuples) };
Value* build_row_dst = builder.CreateGEP(out_row_arg, build_row_idx, "build_dst_ptr");
// Copy build row.
BasicBlock* build_not_null_block = BasicBlock::Create(context, "build_not_null", fn);
BasicBlock* build_null_block = NULL;
if (_match_all_probe) {
// build tuple can be null
build_null_block = BasicBlock::Create(context, "build_null", fn);
Value* is_build_null = builder.CreateIsNull(build_row_arg, "is_build_null");
builder.CreateCondBr(is_build_null, build_null_block, build_not_null_block);
// Set tuple build ptrs to NULL
// TODO: this should be replaced with memset() but I can't get the llvm intrinsic
// to work.
builder.SetInsertPoint(build_null_block);
for (int i = 0; i < num_build_tuples; ++i) {
Value* array_idx[] =
{ codegen->get_int_constant(TYPE_INT, i + num_probe_tuples) };
Value* dst = builder.CreateGEP(out_row_arg, array_idx, "dst_tuple_ptr");
builder.CreateStore(codegen->null_ptr_value(), dst);
}
builder.CreateRetVoid();
} else {
// build row can't be NULL
builder.CreateBr(build_not_null_block);
}
// Copy build tuple ptrs
builder.SetInsertPoint(build_not_null_block);
codegen->codegen_memcpy(&builder, build_row_dst, build_row_arg, _build_tuple_row_size);
builder.CreateRetVoid();
return codegen->finalize_function(fn);
}
Function* HashJoinNode::codegen_process_build_batch(RuntimeState* state, Function* hash_fn) {
LlvmCodeGen* codegen = NULL;
if (!state->get_codegen(&codegen).ok()) {
return NULL;
}
// Get cross compiled function
Function* process_build_batch_fn = codegen->get_function(
IRFunction::HASH_JOIN_PROCESS_BUILD_BATCH);
DCHECK(process_build_batch_fn != NULL);
// Codegen for evaluating build rows
Function* eval_row_fn = _hash_tbl->codegen_eval_tuple_row(state, true);
if (eval_row_fn == NULL) {
return NULL;
}
int replaced = 0;
// Replace call sites
process_build_batch_fn = codegen->replace_call_sites(
process_build_batch_fn, false, eval_row_fn, "eval_build_row", &replaced);
DCHECK_EQ(replaced, 1);
process_build_batch_fn = codegen->replace_call_sites(
process_build_batch_fn, false, hash_fn, "hash_current_row", &replaced);
DCHECK_EQ(replaced, 1);
return codegen->optimize_function_with_exprs(process_build_batch_fn);
}
Function* HashJoinNode::codegen_process_probe_batch(RuntimeState* state, Function* hash_fn) {
LlvmCodeGen* codegen = NULL;
if (!state->get_codegen(&codegen).ok()) {
return NULL;
}
// Get cross compiled function
Function* process_probe_batch_fn =
codegen->get_function(IRFunction::HASH_JOIN_PROCESS_PROBE_BATCH);
DCHECK(process_probe_batch_fn != NULL);
// Codegen HashTable::Equals
Function* equals_fn = _hash_tbl->codegen_equals(state);
if (equals_fn == NULL) {
return NULL;
}
// Codegen for evaluating build rows
Function* eval_row_fn = _hash_tbl->codegen_eval_tuple_row(state, false);
if (eval_row_fn == NULL) {
return NULL;
}
// Codegen CreateOutputRow
Function* create_output_row_fn = codegen_create_output_row(codegen);
if (create_output_row_fn == NULL) {
return NULL;
}
// Codegen evaluating other join conjuncts
Function* eval_other_conjuncts_fn = ExecNode::codegen_eval_conjuncts(
state, _other_join_conjunct_ctxs, "EvalOtherConjuncts");
if (eval_other_conjuncts_fn == NULL) {
return NULL;
}
// Codegen evaluating conjuncts
Function* eval_conjuncts_fn = ExecNode::codegen_eval_conjuncts(state, _conjunct_ctxs);
if (eval_conjuncts_fn == NULL) {
return NULL;
}
// Replace all call sites with codegen version
int replaced = 0;
process_probe_batch_fn = codegen->replace_call_sites(
process_probe_batch_fn, false, hash_fn, "hash_current_row", &replaced);
DCHECK_EQ(replaced, 1);
process_probe_batch_fn = codegen->replace_call_sites(
process_probe_batch_fn, false, eval_row_fn, "eval_probe_row", &replaced);
DCHECK_EQ(replaced, 1);
process_probe_batch_fn = codegen->replace_call_sites(
process_probe_batch_fn, false, create_output_row_fn, "create_output_row", &replaced);
// TODO(zc): add semi join
DCHECK_EQ(replaced, 2);
process_probe_batch_fn = codegen->replace_call_sites(
process_probe_batch_fn, false, eval_conjuncts_fn, "eval_conjuncts", &replaced);
DCHECK_EQ(replaced, 2);
process_probe_batch_fn = codegen->replace_call_sites(
process_probe_batch_fn, false, eval_other_conjuncts_fn,
"eval_other_join_conjuncts", &replaced);
// TODO(zc): add semi join
DCHECK_EQ(replaced, 1);
process_probe_batch_fn = codegen->replace_call_sites(
process_probe_batch_fn, false, equals_fn, "equals", &replaced);
DCHECK_EQ(replaced, 2);
return codegen->optimize_function_with_exprs(process_probe_batch_fn);
}
}

View File

@ -0,0 +1,200 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_QUERY_EXEC_HASH_JOIN_NODE_H
#define BDG_PALO_BE_SRC_QUERY_EXEC_HASH_JOIN_NODE_H
#include <boost/scoped_ptr.hpp>
#include <boost/unordered_set.hpp>
#include <boost/thread.hpp>
#include <string>
#include "exec/exec_node.h"
#include "exec/hash_table.h"
#include "gen_cpp/PlanNodes_types.h"
namespace palo {
class MemPool;
class RowBatch;
class TupleRow;
// Node for in-memory hash joins:
// - builds up a hash table with the rows produced by our right input
// (child(1)); build exprs are the rhs exprs of our equi-join predicates
// - for each row from our left input, probes the hash table to retrieve
// matching entries; the probe exprs are the lhs exprs of our equi-join predicates
//
// Row batches:
// - In general, we are not able to pass our output row batch on to our left child (when
// we're fetching the probe rows): if we have a 1xn join, our output will contain
// multiple rows per left input row
// - TODO: fix this, so in the case of 1x1/nx1 joins (for instance, fact to dimension tbl)
// we don't do these extra copies
class HashJoinNode : public ExecNode {
public:
HashJoinNode(ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs);
~HashJoinNode();
// set up _build- and _probe_exprs
virtual Status init(const TPlanNode& tnode);
virtual Status prepare(RuntimeState* state);
virtual Status open(RuntimeState* state);
virtual Status get_next(RuntimeState* state, RowBatch* row_batch, bool* eos);
virtual Status close(RuntimeState* state);
static const char* _s_llvm_class_name;
protected:
void debug_string(int indentation_level, std::stringstream* out) const;
private:
boost::scoped_ptr<HashTable> _hash_tbl;
HashTable::Iterator _hash_tbl_iterator;
bool _is_push_down;
// for right outer joins, keep track of what's been joined
typedef boost::unordered_set<TupleRow*> BuildTupleRowSet;
BuildTupleRowSet _joined_build_rows;
TJoinOp::type _join_op;
// our equi-join predicates "<lhs> = <rhs>" are separated into
// _build_exprs (over child(1)) and _probe_exprs (over child(0))
std::vector<ExprContext*> _probe_expr_ctxs;
std::vector<ExprContext*> _build_expr_ctxs;
std::list<ExprContext*> _push_down_expr_ctxs;
// non-equi-join conjuncts from the JOIN clause
std::vector<ExprContext*> _other_join_conjunct_ctxs;
// derived from _join_op
bool _match_all_probe; // output all rows coming from the probe input
bool _match_one_build; // match at most one build row to each probe row
bool _match_all_build; // output all rows coming from the build input
bool _matched_probe; // if true, we have matched the current probe row
bool _eos; // if true, nothing left to return in get_next()
boost::scoped_ptr<MemPool> _build_pool; // holds everything referenced in _hash_tbl
// Size of the TupleRow (just the Tuple ptrs) from the build (right) and probe (left)
// sides. Set to zero if the build/probe tuples are not returned, e.g., for semi joins.
// Cached because it is used in the hot path.
int _probe_tuple_row_size;
int _build_tuple_row_size;
// _probe_batch must be cleared before calling get_next(). The child node
// does not initialize all tuple ptrs in the row, only the ones that it
// is responsible for.
boost::scoped_ptr<RowBatch> _probe_batch;
int _probe_batch_pos; // current scan pos in _probe_batch
bool _probe_eos; // if true, probe child has no more rows to process
TupleRow* _current_probe_row;
// _build_tuple_idx[i] is the tuple index of child(1)'s tuple[i] in the output row
std::vector<int> _build_tuple_idx;
int _build_tuple_size;
// byte size of result tuple row (sum of the tuple ptrs, not the tuple data).
// This should be the same size as the probe tuple row.
int _result_tuple_row_size;
/// llvm function for build batch
llvm::Function* _codegen_process_build_batch_fn;
// Function declaration for codegen'd function. Signature must match
// HashJoinNode::ProcessBuildBatch
typedef void (*ProcessBuildBatchFn)(HashJoinNode*, RowBatch*);
ProcessBuildBatchFn _process_build_batch_fn;
// HashJoinNode::process_probe_batch() exactly
typedef int (*ProcessProbeBatchFn)(HashJoinNode*, RowBatch*, RowBatch*, int);
// Jitted ProcessProbeBatch function pointer. Null if codegen is disabled.
ProcessProbeBatchFn _process_probe_batch_fn;
// record anti join pos in get_next()
HashTable::Iterator* _anti_join_last_pos;
RuntimeProfile::Counter* _build_timer; // time to build hash table
RuntimeProfile::Counter* _push_down_timer; // time to build hash table
RuntimeProfile::Counter* _push_compute_timer;
RuntimeProfile::Counter* _probe_timer; // time to probe
RuntimeProfile::Counter* _build_row_counter; // num build rows
RuntimeProfile::Counter* _probe_row_counter; // num probe rows
RuntimeProfile::Counter* _build_buckets_counter; // num buckets in hash table
RuntimeProfile::Counter* _hash_tbl_load_factor_counter;
// Supervises ConstructHashTable in a separate thread, and
// returns its status in the promise parameter.
void build_side_thread(RuntimeState* state, boost::promise<Status>* status);
// We parallelise building the build-side with Open'ing the
// probe-side. If, for example, the probe-side child is another
// hash-join node, it can start to build its own build-side at the
// same time.
Status construct_hash_table(RuntimeState* state);
// GetNext helper function for the common join cases: Inner join, left semi and left
// outer
Status left_join_get_next(RuntimeState* state, RowBatch* row_batch, bool* eos);
// Processes a probe batch for the common (non right-outer join) cases.
// out_batch: the batch for resulting tuple rows
// probe_batch: the probe batch to process. This function can be called to
// continue processing a batch in the middle
// max_added_rows: maximum rows that can be added to out_batch
// return the number of rows added to out_batch
int process_probe_batch(RowBatch* out_batch, RowBatch* probe_batch, int max_added_rows);
// Construct the build hash table, adding all the rows in 'build_batch'
void process_build_batch(RowBatch* build_batch);
// Write combined row, consisting of probe_row and build_row, to out_row.
// This is replaced by codegen.
void create_output_row(TupleRow* out_row, TupleRow* probe_row, TupleRow* build_row);
// Returns a debug string for probe_rows. Probe rows have tuple ptrs that are
// uninitialized; the left hand child only populates the tuple ptrs it is responsible
// for. This function outputs just the probe row values and leaves the build
// side values as NULL.
// This is only used for debugging and outputting the left child rows before
// doing the join.
std::string get_probe_row_output_string(TupleRow* probe_row);
/// Codegen function to create output row
llvm::Function* codegen_create_output_row(LlvmCodeGen* codegen);
/// Codegen processing build batches. Identical signature to ProcessBuildBatch.
/// hash_fn is the codegen'd function for computing hashes over tuple rows in the
/// hash table.
/// Returns NULL if codegen was not possible.
llvm::Function* codegen_process_build_batch(RuntimeState* state, llvm::Function* hash_fn);
/// Codegen processing probe batches. Identical signature to ProcessProbeBatch.
/// hash_fn is the codegen'd function for computing hashes over tuple rows in the
/// hash table.
/// Returns NULL if codegen was not possible.
llvm::Function* codegen_process_probe_batch(RuntimeState* state, llvm::Function* hash_fn);
};
}
#endif

View File

@ -0,0 +1,150 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/hash_join_node.h"
#include "exec/hash_table.hpp"
#include "runtime/row_batch.h"
namespace palo {
// Functions in this file are cross compiled to IR with clang.
// Wrapper around ExecNode's eval conjuncts with a different function name.
// This lets us distinguish between the join conjuncts vs. non-join conjuncts
// for codegen.
// Note: don't declare this static. LLVM will pick the fastcc calling convention and
// we will not be able to replace the funcitons with codegen'd versions.
// TODO: explicitly set the calling convention?
// TODO: investigate using fastcc for all codegen internal functions?
bool IR_NO_INLINE eval_other_join_conjuncts(
ExprContext* const* ctxs, int num_ctxs, TupleRow* row) {
return ExecNode::eval_conjuncts(ctxs, num_ctxs, row);
}
// CreateOutputRow, EvalOtherJoinConjuncts, and EvalConjuncts are replaced by
// codegen.
int HashJoinNode::process_probe_batch(RowBatch* out_batch, RowBatch* probe_batch,
int max_added_rows) {
// This path does not handle full outer or right outer joins
DCHECK(!_match_all_build);
int row_idx = out_batch->add_rows(max_added_rows);
DCHECK(row_idx != RowBatch::INVALID_ROW_INDEX);
uint8_t* out_row_mem = reinterpret_cast<uint8_t*>(out_batch->get_row(row_idx));
TupleRow* out_row = reinterpret_cast<TupleRow*>(out_row_mem);
int rows_returned = 0;
int probe_rows = probe_batch->num_rows();
ExprContext* const* other_conjunct_ctxs = &_other_join_conjunct_ctxs[0];
int num_other_conjunct_ctxs = _other_join_conjunct_ctxs.size();
ExprContext* const* conjunct_ctxs = &_conjunct_ctxs[0];
int num_conjunct_ctxs = _conjunct_ctxs.size();
while (true) {
// Create output row for each matching build row
while (_hash_tbl_iterator.has_next()) {
TupleRow* matched_build_row = _hash_tbl_iterator.get_row();
_hash_tbl_iterator.next<true>();
create_output_row(out_row, _current_probe_row, matched_build_row);
if (!eval_other_join_conjuncts(
other_conjunct_ctxs, num_other_conjunct_ctxs, out_row)) {
continue;
}
_matched_probe = true;
// left_anti_join: equal match won't return
if (_join_op == TJoinOp::LEFT_ANTI_JOIN) {
_hash_tbl_iterator= _hash_tbl->end();
break;
}
if (eval_conjuncts(conjunct_ctxs, num_conjunct_ctxs, out_row)) {
++rows_returned;
// Filled up out batch or hit limit
if (UNLIKELY(rows_returned == max_added_rows)) {
goto end;
}
// Advance to next out row
out_row_mem += out_batch->row_byte_size();
out_row = reinterpret_cast<TupleRow*>(out_row_mem);
}
// Handle left semi-join
if (_match_one_build) {
_hash_tbl_iterator = _hash_tbl->end();
break;
}
}
// Handle left outer-join and left semi-join
if ((!_matched_probe && _match_all_probe) ||
((!_matched_probe && _join_op == TJoinOp::LEFT_ANTI_JOIN))) {
create_output_row(out_row, _current_probe_row, NULL);
_matched_probe = true;
if (ExecNode::eval_conjuncts(conjunct_ctxs, num_conjunct_ctxs, out_row)) {
++rows_returned;
if (UNLIKELY(rows_returned == max_added_rows)) {
goto end;
}
// Advance to next out row
out_row_mem += out_batch->row_byte_size();
out_row = reinterpret_cast<TupleRow*>(out_row_mem);
}
}
if (!_hash_tbl_iterator.has_next()) {
// Advance to the next probe row
if (UNLIKELY(_probe_batch_pos == probe_rows)) {
goto end;
}
_current_probe_row = probe_batch->get_row(_probe_batch_pos++);
_hash_tbl_iterator = _hash_tbl->find(_current_probe_row);
_matched_probe = false;
}
}
end:
if (_match_one_build && _matched_probe) {
_hash_tbl_iterator = _hash_tbl->end();
}
out_batch->commit_rows(rows_returned);
return rows_returned;
}
void HashJoinNode::process_build_batch(RowBatch* build_batch) {
// insert build row into our hash table
for (int i = 0; i < build_batch->num_rows(); ++i) {
_hash_tbl->insert(build_batch->get_row(i));
}
}
}

816
be/src/exec/hash_table.cpp Normal file
View File

@ -0,0 +1,816 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#include "exec/hash_table.hpp"
#include "codegen/codegen_anyval.h"
#include "codegen/llvm_codegen.h"
#include "exprs/expr.h"
#include "runtime/raw_value.h"
#include "runtime/string_value.hpp"
#include "runtime/mem_tracker.h"
#include "runtime/runtime_state.h"
#include "util/debug_util.h"
#include "util/palo_metrics.h"
using llvm::BasicBlock;
using llvm::Value;
using llvm::Function;
using llvm::Type;
using llvm::PointerType;
using llvm::LLVMContext;
using llvm::PHINode;
namespace palo {
const float HashTable::MAX_BUCKET_OCCUPANCY_FRACTION = 0.75f;
const char* HashTable::_s_llvm_class_name = "class.palo::HashTable";
HashTable::HashTable(const vector<ExprContext*>& build_expr_ctxs,
const vector<ExprContext*>& probe_expr_ctxs,
int num_build_tuples, bool stores_nulls, int32_t initial_seed,
MemTracker* mem_tracker, int64_t num_buckets) :
_build_expr_ctxs(build_expr_ctxs),
_probe_expr_ctxs(probe_expr_ctxs),
_num_build_tuples(num_build_tuples),
_stores_nulls(stores_nulls),
_initial_seed(initial_seed),
_node_byte_size(sizeof(Node) + sizeof(Tuple*) * _num_build_tuples),
_num_filled_buckets(0),
_nodes(NULL),
_num_nodes(0),
_exceeded_limit(false),
_mem_tracker(mem_tracker),
_mem_limit_exceeded(false) {
DCHECK(mem_tracker != NULL);
DCHECK_EQ(_build_expr_ctxs.size(), _probe_expr_ctxs.size());
DCHECK_EQ((num_buckets & (num_buckets - 1)), 0) << "num_buckets must be a power of 2";
_buckets.resize(num_buckets);
_num_buckets = num_buckets;
_num_buckets_till_resize = MAX_BUCKET_OCCUPANCY_FRACTION * _num_buckets;
_mem_tracker->consume(_buckets.capacity() * sizeof(Bucket));
// Compute the layout and buffer size to store the evaluated expr results
_results_buffer_size = Expr::compute_results_layout(_build_expr_ctxs,
&_expr_values_buffer_offsets, &_var_result_begin);
_expr_values_buffer = new uint8_t[_results_buffer_size];
memset(_expr_values_buffer, 0, sizeof(uint8_t) * _results_buffer_size);
_expr_value_null_bits = new uint8_t[_build_expr_ctxs.size()];
_nodes_capacity = 1024;
_nodes = reinterpret_cast<uint8_t*>(malloc(_nodes_capacity * _node_byte_size));
memset(_nodes, 0, _nodes_capacity * _node_byte_size);
if (PaloMetrics::hash_table_total_bytes() != NULL) {
PaloMetrics::hash_table_total_bytes()->increment(_nodes_capacity * _node_byte_size);
}
_mem_tracker->consume(_nodes_capacity * _node_byte_size);
if (_mem_tracker->limit_exceeded()) {
mem_limit_exceeded(_nodes_capacity * _node_byte_size);
}
}
HashTable::~HashTable() {
}
void HashTable::close() {
// TODO: use tr1::array?
delete[] _expr_values_buffer;
delete[] _expr_value_null_bits;
free(_nodes);
if (PaloMetrics::hash_table_total_bytes() != NULL) {
PaloMetrics::hash_table_total_bytes()->increment(-_nodes_capacity * _node_byte_size);
}
_mem_tracker->release(_nodes_capacity * _node_byte_size);
_mem_tracker->release(_buckets.size() * sizeof(Bucket));
}
bool HashTable::eval_row(TupleRow* row, const vector<ExprContext*>& ctxs) {
// Put a non-zero constant in the result location for NULL.
// We don't want(NULL, 1) to hash to the same as (0, 1).
// This needs to be as big as the biggest primitive type since the bytes
// get copied directly.
// the 10 is experience value which need bigger than sizeof(Decimal)/sizeof(int64).
// for if slot is null, we need copy the null value to all type.
static int64_t null_value[10] = {HashUtil::FNV_SEED, HashUtil::FNV_SEED, 0};
bool has_null = false;
for (int i = 0; i < ctxs.size(); ++i) {
void* loc = _expr_values_buffer + _expr_values_buffer_offsets[i];
void* val = ctxs[i]->get_value(row);
if (val == NULL) {
// If the table doesn't store nulls, no reason to keep evaluating
if (!_stores_nulls) {
return true;
}
_expr_value_null_bits[i] = true;
val = &null_value;
has_null = true;
} else {
_expr_value_null_bits[i] = false;
}
RawValue::write(val, loc, _build_expr_ctxs[i]->root()->type(), NULL);
}
return has_null;
}
uint32_t HashTable::hash_variable_len_row() {
uint32_t hash = _initial_seed;
// Hash the non-var length portions (if there are any)
if (_var_result_begin != 0) {
hash = HashUtil::hash(_expr_values_buffer, _var_result_begin, hash);
}
for (int i = 0; i < _build_expr_ctxs.size(); ++i) {
// non-string and null slots are already part of expr_values_buffer
if (_build_expr_ctxs[i]->root()->type().is_string_type()) {
void* loc = _expr_values_buffer + _expr_values_buffer_offsets[i];
if (_expr_value_null_bits[i]) {
// Hash the null random seed values at 'loc'
hash = HashUtil::hash(loc, sizeof(StringValue), hash);
} else {
// Hash the string
StringValue* str = reinterpret_cast<StringValue*>(loc);
hash = HashUtil::hash(str->ptr, str->len, hash);
}
} else if (_build_expr_ctxs[i]->root()->type().is_decimal_type()) {
void* loc = _expr_values_buffer + _expr_values_buffer_offsets[i];
if (_expr_value_null_bits[i]) {
// Hash the null random seed values at 'loc'
hash = HashUtil::hash(loc, sizeof(StringValue), hash);
} else {
DecimalValue* decimal = reinterpret_cast<DecimalValue*>(loc);
hash = decimal->hash(hash);
}
}
}
return hash;
}
bool HashTable::equals(TupleRow* build_row) {
for (int i = 0; i < _build_expr_ctxs.size(); ++i) {
void* val = _build_expr_ctxs[i]->get_value(build_row);
if (val == NULL) {
if (!_stores_nulls) {
return false;
}
if (!_expr_value_null_bits[i]) {
return false;
}
continue;
}
void* loc = _expr_values_buffer + _expr_values_buffer_offsets[i];
if (!RawValue::eq(loc, val, _build_expr_ctxs[i]->root()->type())) {
return false;
}
}
return true;
}
void HashTable::resize_buckets(int64_t num_buckets) {
DCHECK_EQ((num_buckets & (num_buckets - 1)), 0) << "num_buckets must be a power of 2";
int64_t old_num_buckets = _num_buckets;
int64_t delta_bytes = (num_buckets - old_num_buckets) * sizeof(Bucket);
if (!_mem_tracker->try_consume(delta_bytes)) {
mem_limit_exceeded(delta_bytes);
return;
}
_buckets.resize(num_buckets);
// If we're doubling the number of buckets, all nodes in a particular bucket
// either remain there, or move down to an analogous bucket in the other half.
// In order to efficiently check which of the two buckets a node belongs in, the number
// of buckets must be a power of 2.
bool doubled_buckets = (num_buckets == old_num_buckets * 2);
for (int i = 0; i < _num_buckets; ++i) {
Bucket* bucket = &_buckets[i];
Bucket* sister_bucket = &_buckets[i + old_num_buckets];
Node* last_node = NULL;
int node_idx = bucket->_node_idx;
while (node_idx != -1) {
Node* node = get_node(node_idx);
int64_t next_idx = node->_next_idx;
uint32_t hash = node->_hash;
bool node_must_move = true;
Bucket* move_to = NULL;
if (doubled_buckets) {
node_must_move = ((hash & old_num_buckets) != 0);
move_to = sister_bucket;
} else {
int64_t bucket_idx = hash & (num_buckets - 1);
node_must_move = (bucket_idx != i);
move_to = &_buckets[bucket_idx];
}
if (node_must_move) {
move_node(bucket, move_to, node_idx, node, last_node);
} else {
last_node = node;
}
node_idx = next_idx;
}
}
_num_buckets = num_buckets;
_num_buckets_till_resize = MAX_BUCKET_OCCUPANCY_FRACTION * _num_buckets;
}
void HashTable::grow_node_array() {
int64_t old_size = _nodes_capacity * _node_byte_size;
_nodes_capacity = _nodes_capacity + _nodes_capacity / 2;
int64_t new_size = _nodes_capacity * _node_byte_size;
uint8_t* new_nodes = reinterpret_cast<uint8_t*>(malloc(new_size));
memset(new_nodes, 0, new_size);
memcpy(new_nodes, _nodes, old_size);
free(_nodes);
_nodes = new_nodes;
if (PaloMetrics::hash_table_total_bytes() != NULL) {
PaloMetrics::hash_table_total_bytes()->increment(new_size - old_size);
}
_mem_tracker->consume(new_size - old_size);
if (_mem_tracker->limit_exceeded()) {
mem_limit_exceeded(new_size - old_size);
}
}
void HashTable::mem_limit_exceeded(int64_t allocation_size) {
_mem_limit_exceeded = true;
_exceeded_limit = true;
// if (_state != NULL) {
// _state->set_mem_limit_exceeded(_mem_tracker, allocation_size);
// }
}
std::string HashTable::debug_string(bool skip_empty, const RowDescriptor* desc) {
std::stringstream ss;
ss << std::endl;
for (int i = 0; i < _buckets.size(); ++i) {
int64_t node_idx = _buckets[i]._node_idx;
bool first = true;
if (skip_empty && node_idx == -1) {
continue;
}
ss << i << ": ";
while (node_idx != -1) {
Node* node = get_node(node_idx);
if (!first) {
ss << ",";
}
if (desc == NULL) {
ss << node_idx << "(" << (void*)node->data() << ")";
} else {
ss << (void*)node->data() << " " << print_row(node->data(), *desc);
}
node_idx = node->_next_idx;
first = false;
}
ss << std::endl;
}
return ss.str();
}
// Helper function to store a value into the results buffer if the expr
// evaluated to NULL. We don't want (NULL, 1) to hash to the same as (0,1) so
// we'll pick a more random value.
static void codegen_assign_null_value(
LlvmCodeGen* codegen, LlvmCodeGen::LlvmBuilder* builder,
Value* dst, const TypeDescriptor& type) {
int64_t fvn_seed = HashUtil::FNV_SEED;
if (type.type == TYPE_CHAR || type.type == TYPE_VARCHAR) {
Value* dst_ptr = builder->CreateStructGEP(dst, 0, "string_ptr");
Value* dst_len = builder->CreateStructGEP(dst, 1, "string_len");
Value* null_len = codegen->get_int_constant(TYPE_INT, fvn_seed);
Value* null_ptr = builder->CreateIntToPtr(null_len, codegen->ptr_type());
builder->CreateStore(null_ptr, dst_ptr);
builder->CreateStore(null_len, dst_len);
return;
} else {
Value* null_value = NULL;
// Get a type specific representation of fvn_seed
switch (type.type) {
case TYPE_BOOLEAN:
// In results, booleans are stored as 1 byte
dst = builder->CreateBitCast(dst, codegen->ptr_type());
null_value = codegen->get_int_constant(TYPE_TINYINT, fvn_seed);
break;
case TYPE_TINYINT:
case TYPE_SMALLINT:
case TYPE_INT:
case TYPE_BIGINT:
null_value = codegen->get_int_constant(type.type, fvn_seed);
break;
case TYPE_FLOAT: {
// Don't care about the value, just the bit pattern
float fvn_seed_float = *reinterpret_cast<float*>(&fvn_seed);
null_value = llvm::ConstantFP::get(
codegen->context(), llvm::APFloat(fvn_seed_float));
break;
}
case TYPE_DOUBLE: {
// Don't care about the value, just the bit pattern
double fvn_seed_double = *reinterpret_cast<double*>(&fvn_seed);
null_value = llvm::ConstantFP::get(
codegen->context(), llvm::APFloat(fvn_seed_double));
break;
}
default:
DCHECK(false);
}
builder->CreateStore(null_value, dst);
}
}
// Codegen for evaluating a tuple row over either _build_expr_ctxs or _probe_expr_ctxs.
// For the case where we are joining on a single int, the IR looks like
// define i1 @EvaBuildRow(%"class.impala::HashTable"* %this_ptr,
// %"class.impala::TupleRow"* %row) {
// entry:
// %null_ptr = alloca i1
// %0 = bitcast %"class.palo::TupleRow"* %row to i8**
// %eval = call i32 @SlotRef(i8** %0, i8* null, i1* %null_ptr)
// %1 = load i1* %null_ptr
// br i1 %1, label %null, label %not_null
//
// null: ; preds = %entry
// ret i1 true
//
// not_null: ; preds = %entry
// store i32 %eval, i32* inttoptr (i64 46146336 to i32*)
// br label %continue
//
// continue: ; preds = %not_null
// %2 = zext i1 %1 to i8
// store i8 %2, i8* inttoptr (i64 46146248 to i8*)
// ret i1 false
// }
// For each expr, we create 3 code blocks. The null, not null and continue blocks.
// Both the null and not null branch into the continue block. The continue block
// becomes the start of the next block for codegen (either the next expr or just the
// end of the function).
Function* HashTable::codegen_eval_tuple_row(RuntimeState* state, bool build) {
// TODO: codegen_assign_null_value() can't handle TYPE_TIMESTAMP or TYPE_DECIMAL yet
const std::vector<ExprContext*>& ctxs = build ? _build_expr_ctxs : _probe_expr_ctxs;
for (int i = 0; i < ctxs.size(); ++i) {
PrimitiveType type = ctxs[i]->root()->type().type;
if (type == TYPE_DATE || type == TYPE_DATETIME
|| type == TYPE_DECIMAL || type == TYPE_CHAR) {
return NULL;
}
}
LlvmCodeGen* codegen = NULL;
if (!state->get_codegen(&codegen).ok()) {
return NULL;
}
// Get types to generate function prototype
Type* tuple_row_type = codegen->get_type(TupleRow::_s_llvm_class_name);
DCHECK(tuple_row_type != NULL);
PointerType* tuple_row_ptr_type = PointerType::get(tuple_row_type, 0);
Type* this_type = codegen->get_type(HashTable::_s_llvm_class_name);
DCHECK(this_type != NULL);
PointerType* this_ptr_type = PointerType::get(this_type, 0);
LlvmCodeGen::FnPrototype prototype(
codegen, build ? "eval_build_row" : "eval_probe_row", codegen->get_type(TYPE_BOOLEAN));
prototype.add_argument(LlvmCodeGen::NamedVariable("this_ptr", this_ptr_type));
prototype.add_argument(LlvmCodeGen::NamedVariable("row", tuple_row_ptr_type));
LLVMContext& context = codegen->context();
LlvmCodeGen::LlvmBuilder builder(context);
Value* args[2];
Function* fn = prototype.generate_prototype(&builder, args);
Value* row = args[1];
Value* has_null = codegen->false_value();
// Aggregation with no grouping exprs also use the hash table interface for
// code simplicity. In that case, there are no build exprs.
if (!_build_expr_ctxs.empty()) {
const std::vector<ExprContext*>& ctxs = build ? _build_expr_ctxs : _probe_expr_ctxs;
for (int i = 0; i < ctxs.size(); ++i) {
// TODO: refactor this to somewhere else? This is not hash table specific
// except for the null handling bit and would be used for anyone that needs
// to materialize a vector of exprs
// Convert result buffer to llvm ptr type
void* loc = _expr_values_buffer + _expr_values_buffer_offsets[i];
Value* llvm_loc = codegen->cast_ptr_to_llvm_ptr(
codegen->get_ptr_type(ctxs[i]->root()->type()), loc);
BasicBlock* null_block = BasicBlock::Create(context, "null", fn);
BasicBlock* not_null_block = BasicBlock::Create(context, "not_null", fn);
BasicBlock* continue_block = BasicBlock::Create(context, "continue", fn);
// Call expr
Function* expr_fn = NULL;
Status status = ctxs[i]->root()->get_codegend_compute_fn(state, &expr_fn);
if (!status.ok()) {
std::stringstream ss;
ss << "Problem with codegen: " << status.get_error_msg();
// TODO(zc )
// state->LogError(ErrorMsg(TErrorCode::GENERAL, ss.str()));
fn->eraseFromParent(); // deletes function
return NULL;
}
Value* ctx_arg = codegen->cast_ptr_to_llvm_ptr(
codegen->get_ptr_type(ExprContext::_s_llvm_class_name), ctxs[i]);
Value* expr_fn_args[] = { ctx_arg, row };
CodegenAnyVal result = CodegenAnyVal::create_call_wrapped(
codegen, &builder, ctxs[i]->root()->type(),
expr_fn, expr_fn_args, "result", NULL);
Value* is_null = result.get_is_null();
// Set null-byte result
Value* null_byte = builder.CreateZExt(is_null, codegen->get_type(TYPE_TINYINT));
uint8_t* null_byte_loc = &_expr_value_null_bits[i];
Value* llvm_null_byte_loc =
codegen->cast_ptr_to_llvm_ptr(codegen->ptr_type(), null_byte_loc);
builder.CreateStore(null_byte, llvm_null_byte_loc);
builder.CreateCondBr(is_null, null_block, not_null_block);
// Null block
builder.SetInsertPoint(null_block);
if (!_stores_nulls) {
// hash table doesn't store nulls, no reason to keep evaluating exprs
builder.CreateRet(codegen->true_value());
} else {
codegen_assign_null_value(codegen, &builder, llvm_loc, ctxs[i]->root()->type());
has_null = codegen->true_value();
builder.CreateBr(continue_block);
}
// Not null block
builder.SetInsertPoint(not_null_block);
result.to_native_ptr(llvm_loc);
builder.CreateBr(continue_block);
builder.SetInsertPoint(continue_block);
}
}
builder.CreateRet(has_null);
return codegen->finalize_function(fn);
}
// Codegen for hashing the current row. In the case with both string and non-string data
// (group by int_col, string_col), the IR looks like:
// define i32 @hash_current_row(%"class.impala::HashTable"* %this_ptr) {
// entry:
// %0 = call i32 @IrCrcHash(i8* inttoptr (i64 51107808 to i8*), i32 16, i32 0)
// %1 = load i8* inttoptr (i64 29500112 to i8*)
// %2 = icmp ne i8 %1, 0
// br i1 %2, label %null, label %not_null
//
// null: ; preds = %entry
// %3 = call i32 @IrCrcHash(i8* inttoptr (i64 51107824 to i8*), i32 16, i32 %0)
// br label %continue
//
// not_null: ; preds = %entry
// %4 = load i8** getelementptr inbounds (
// %"struct.impala::StringValue"* inttoptr
// (i64 51107824 to %"struct.impala::StringValue"*), i32 0, i32 0)
// %5 = load i32* getelementptr inbounds (
// %"struct.impala::StringValue"* inttoptr
// (i64 51107824 to %"struct.impala::StringValue"*), i32 0, i32 1)
// %6 = call i32 @IrCrcHash(i8* %4, i32 %5, i32 %0)
// br label %continue
//
// continue: ; preds = %not_null, %null
// %7 = phi i32 [ %6, %not_null ], [ %3, %null ]
// ret i32 %7
// }
// TODO: can this be cross-compiled?
Function* HashTable::codegen_hash_current_row(RuntimeState* state) {
for (int i = 0; i < _build_expr_ctxs.size(); ++i) {
// Disable codegen for CHAR
if (_build_expr_ctxs[i]->root()->type().type == TYPE_CHAR) {
return NULL;
}
}
LlvmCodeGen* codegen = NULL;
if (!state->get_codegen(&codegen).ok()) {
return NULL;
}
// Get types to generate function prototype
Type* this_type = codegen->get_type(HashTable::_s_llvm_class_name);
DCHECK(this_type != NULL);
PointerType* this_ptr_type = PointerType::get(this_type, 0);
LlvmCodeGen::FnPrototype prototype(codegen, "hash_current_row", codegen->get_type(TYPE_INT));
prototype.add_argument(LlvmCodeGen::NamedVariable("this_ptr", this_ptr_type));
LLVMContext& context = codegen->context();
LlvmCodeGen::LlvmBuilder builder(context);
Value* this_arg = NULL;
Function* fn = prototype.generate_prototype(&builder, &this_arg);
Value* hash_result = codegen->get_int_constant(TYPE_INT, _initial_seed);
Value* data = codegen->cast_ptr_to_llvm_ptr(codegen->ptr_type(), _expr_values_buffer);
if (_var_result_begin == -1) {
// No variable length slots, just hash what is in '_expr_values_buffer'
if (_results_buffer_size > 0) {
Function* hash_fn = codegen->get_hash_function(_results_buffer_size);
Value* len = codegen->get_int_constant(TYPE_INT, _results_buffer_size);
hash_result = builder.CreateCall3(hash_fn, data, len, hash_result);
}
} else {
if (_var_result_begin > 0) {
Function* hash_fn = codegen->get_hash_function(_var_result_begin);
Value* len = codegen->get_int_constant(TYPE_INT, _var_result_begin);
hash_result = builder.CreateCall3(hash_fn, data, len, hash_result);
}
// Hash string slots
for (int i = 0; i < _build_expr_ctxs.size(); ++i) {
if (_build_expr_ctxs[i]->root()->type().type != TYPE_CHAR
&& _build_expr_ctxs[i]->root()->type().type != TYPE_VARCHAR) {
continue;
}
BasicBlock* null_block = NULL;
BasicBlock* not_null_block = NULL;
BasicBlock* continue_block = NULL;
Value* str_null_result = NULL;
void* loc = _expr_values_buffer + _expr_values_buffer_offsets[i];
// If the hash table stores nulls, we need to check if the stringval
// evaluated to NULL
if (_stores_nulls) {
null_block = BasicBlock::Create(context, "null", fn);
not_null_block = BasicBlock::Create(context, "not_null", fn);
continue_block = BasicBlock::Create(context, "continue", fn);
uint8_t* null_byte_loc = &_expr_value_null_bits[i];
Value* llvm_null_byte_loc =
codegen->cast_ptr_to_llvm_ptr(codegen->ptr_type(), null_byte_loc);
Value* null_byte = builder.CreateLoad(llvm_null_byte_loc);
Value* is_null = builder.CreateICmpNE(
null_byte, codegen->get_int_constant(TYPE_TINYINT, 0));
builder.CreateCondBr(is_null, null_block, not_null_block);
// For null, we just want to call the hash function on the portion of
// the data
builder.SetInsertPoint(null_block);
Function* null_hash_fn = codegen->get_hash_function(sizeof(StringValue));
Value* llvm_loc = codegen->cast_ptr_to_llvm_ptr(codegen->ptr_type(), loc);
Value* len = codegen->get_int_constant(TYPE_INT, sizeof(StringValue));
str_null_result = builder.CreateCall3(null_hash_fn, llvm_loc, len, hash_result);
builder.CreateBr(continue_block);
builder.SetInsertPoint(not_null_block);
}
// Convert _expr_values_buffer loc to llvm value
Value* str_val = codegen->cast_ptr_to_llvm_ptr(
codegen->get_ptr_type(TYPE_VARCHAR), loc);
Value* ptr = builder.CreateStructGEP(str_val, 0, "ptr");
Value* len = builder.CreateStructGEP(str_val, 1, "len");
ptr = builder.CreateLoad(ptr);
len = builder.CreateLoad(len);
// Call hash(ptr, len, hash_result);
Function* general_hash_fn = codegen->get_hash_function();
Value* string_hash_result =
builder.CreateCall3(general_hash_fn, ptr, len, hash_result);
if (_stores_nulls) {
builder.CreateBr(continue_block);
builder.SetInsertPoint(continue_block);
// Use phi node to reconcile that we could have come from the string-null
// path and string not null paths.
PHINode* phi_node = builder.CreatePHI(codegen->get_type(TYPE_INT), 2);
phi_node->addIncoming(string_hash_result, not_null_block);
phi_node->addIncoming(str_null_result, null_block);
hash_result = phi_node;
} else {
hash_result = string_hash_result;
}
}
}
builder.CreateRet(hash_result);
return codegen->finalize_function(fn);
}
// Codegen for HashTable::Equals. For a hash table with two exprs (string,int), the
// IR looks like:
//
// define i1 @Equals(%"class.impala::OldHashTable"* %this_ptr,
// %"class.impala::TupleRow"* %row) {
// entry:
// %result = call i64 @get_slot_ref(%"class.impala::ExprContext"* inttoptr
// (i64 146381856 to %"class.impala::ExprContext"*),
// %"class.impala::TupleRow"* %row)
// %0 = trunc i64 %result to i1
// br i1 %0, label %null, label %not_null
//
// false_block: ; preds = %not_null2, %null1, %not_null, %null
// ret i1 false
//
// null: ; preds = %entry
// br i1 false, label %continue, label %false_block
//
// not_null: ; preds = %entry
// %1 = load i32* inttoptr (i64 104774368 to i32*)
// %2 = ashr i64 %result, 32
// %3 = trunc i64 %2 to i32
// %cmp_raw = icmp eq i32 %3, %1
// br i1 %cmp_raw, label %continue, label %false_block
//
// continue: ; preds = %not_null, %null
// %result4 = call { i64, i8* } @get_slot_ref(
// %"class.impala::ExprContext"* inttoptr
// (i64 146381696 to %"class.impala::ExprContext"*),
// %"class.impala::TupleRow"* %row)
// %4 = extractvalue { i64, i8* } %result4, 0
// %5 = trunc i64 %4 to i1
// br i1 %5, label %null1, label %not_null2
//
// null1: ; preds = %continue
// br i1 false, label %continue3, label %false_block
//
// not_null2: ; preds = %continue
// %6 = extractvalue { i64, i8* } %result4, 0
// %7 = ashr i64 %6, 32
// %8 = trunc i64 %7 to i32
// %result5 = extractvalue { i64, i8* } %result4, 1
// %cmp_raw6 = call i1 @_Z11StringValEQPciPKN6impala11StringValueE(
// i8* %result5, i32 %8, %"struct.impala::StringValue"* inttoptr
// (i64 104774384 to %"struct.impala::StringValue"*))
// br i1 %cmp_raw6, label %continue3, label %false_block
//
// continue3: ; preds = %not_null2, %null1
// ret i1 true
// }
Function* HashTable::codegen_equals(RuntimeState* state) {
for (int i = 0; i < _build_expr_ctxs.size(); ++i) {
// Disable codegen for CHAR
if (_build_expr_ctxs[i]->root()->type().type == TYPE_CHAR) {
return NULL;
}
}
LlvmCodeGen* codegen = NULL;
if (!state->get_codegen(&codegen).ok()) {
return NULL;
}
// Get types to generate function prototype
Type* tuple_row_type = codegen->get_type(TupleRow::_s_llvm_class_name);
DCHECK(tuple_row_type != NULL);
PointerType* tuple_row_ptr_type = PointerType::get(tuple_row_type, 0);
Type* this_type = codegen->get_type(HashTable::_s_llvm_class_name);
DCHECK(this_type != NULL);
PointerType* this_ptr_type = PointerType::get(this_type, 0);
LlvmCodeGen::FnPrototype prototype(codegen, "equals", codegen->get_type(TYPE_BOOLEAN));
prototype.add_argument(LlvmCodeGen::NamedVariable("this_ptr", this_ptr_type));
prototype.add_argument(LlvmCodeGen::NamedVariable("row", tuple_row_ptr_type));
LLVMContext& context = codegen->context();
LlvmCodeGen::LlvmBuilder builder(context);
Value* args[2];
Function* fn = prototype.generate_prototype(&builder, args);
Value* row = args[1];
if (!_build_expr_ctxs.empty()) {
BasicBlock* false_block = BasicBlock::Create(context, "false_block", fn);
for (int i = 0; i < _build_expr_ctxs.size(); ++i) {
BasicBlock* null_block = BasicBlock::Create(context, "null", fn);
BasicBlock* not_null_block = BasicBlock::Create(context, "not_null", fn);
BasicBlock* continue_block = BasicBlock::Create(context, "continue", fn);
// call GetValue on build_exprs[i]
Function* expr_fn = NULL;
Status status = _build_expr_ctxs[i]->root()->get_codegend_compute_fn(state, &expr_fn);
if (!status.ok()) {
std::stringstream ss;
ss << "Problem with codegen: " << status.get_error_msg();
// TODO(zc)
// state->LogError(ErrorMsg(TErrorCode::GENERAL, ss.str()));
fn->eraseFromParent(); // deletes function
return NULL;
}
Value* ctx_arg = codegen->cast_ptr_to_llvm_ptr(
codegen->get_ptr_type(ExprContext::_s_llvm_class_name), _build_expr_ctxs[i]);
Value* expr_fn_args[] = { ctx_arg, row };
CodegenAnyVal result = CodegenAnyVal::create_call_wrapped(
codegen, &builder, _build_expr_ctxs[i]->root()->type(),
expr_fn, expr_fn_args, "result", NULL);
Value* is_null = result.get_is_null();
// Determine if probe is null (i.e. _expr_value_null_bits[i] == true). In
// the case where the hash table does not store nulls, this is always false.
Value* probe_is_null = codegen->false_value();
uint8_t* null_byte_loc = &_expr_value_null_bits[i];
if (_stores_nulls) {
Value* llvm_null_byte_loc =
codegen->cast_ptr_to_llvm_ptr(codegen->ptr_type(), null_byte_loc);
Value* null_byte = builder.CreateLoad(llvm_null_byte_loc);
probe_is_null = builder.CreateICmpNE(
null_byte, codegen->get_int_constant(TYPE_TINYINT, 0));
}
// Get llvm value for probe_val from '_expr_values_buffer'
void* loc = _expr_values_buffer + _expr_values_buffer_offsets[i];
Value* probe_val = codegen->cast_ptr_to_llvm_ptr(
codegen->get_ptr_type(_build_expr_ctxs[i]->root()->type()), loc);
// Branch for GetValue() returning NULL
builder.CreateCondBr(is_null, null_block, not_null_block);
// Null block
builder.SetInsertPoint(null_block);
builder.CreateCondBr(probe_is_null, continue_block, false_block);
// Not-null block
builder.SetInsertPoint(not_null_block);
if (_stores_nulls) {
BasicBlock* cmp_block = BasicBlock::Create(context, "cmp", fn);
// First need to compare that probe expr[i] is not null
builder.CreateCondBr(probe_is_null, false_block, cmp_block);
builder.SetInsertPoint(cmp_block);
}
// Check result == probe_val
Value* is_equal = result.eq_to_native_ptr(probe_val);
builder.CreateCondBr(is_equal, continue_block, false_block);
builder.SetInsertPoint(continue_block);
}
builder.CreateRet(codegen->true_value());
builder.SetInsertPoint(false_block);
builder.CreateRet(codegen->false_value());
} else {
builder.CreateRet(codegen->true_value());
}
return codegen->finalize_function(fn);
}
}

444
be/src/exec/hash_table.h Normal file
View File

@ -0,0 +1,444 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_QUERY_EXEC_HASH_TABLE_H
#define BDG_PALO_BE_SRC_QUERY_EXEC_HASH_TABLE_H
#include <vector>
#include <boost/cstdint.hpp>
#include "codegen/palo_ir.h"
#include "common/logging.h"
#include "util/hash_util.hpp"
namespace llvm {
class Function;
}
namespace palo {
class Expr;
class ExprContext;
class LlvmCodeGen;
class RowDescriptor;
class Tuple;
class TupleRow;
class MemTracker;
class RuntimeState;
using std::vector;
// Hash table implementation designed for hash aggregation and hash joins. This is not
// templatized and is tailored to the usage pattern for aggregation and joins. The
// hash table store TupleRows and allows for different exprs for insertions and finds.
// This is the pattern we use for joins and aggregation where the input/build tuple
// row descriptor is different from the find/probe descriptor.
// The table is optimized for the query engine's use case as much as possible and is not
// intended to be a generic hash table implementation. The API loosely mimics the
// std::hashset API.
//
// The hash table stores evaluated expr results for the current row being processed
// when possible into a contiguous memory buffer. This allows for very efficient
// computation for hashing. The implementation is also designed to allow codegen
// for some paths.
//
// The hash table does not support removes. The hash table is not thread safe.
//
// The implementation is based on the boost multiset. The hashtable is implemented by
// two data structures: a vector of buckets and a vector of nodes. Inserted values
// are stored as nodes (in the order they are inserted). The buckets (indexed by the
// mod of the hash) contain pointers to the node vector. Nodes that fall in the same
// bucket are linked together (the bucket pointer gets you the head of that linked list).
// When growing the hash table, the number of buckets is doubled, and nodes from a
// particular bucket either stay in place or move to an analogous bucket in the second
// half of buckets. This behavior allows us to avoid moving about half the nodes each
// time, and maintains good cache properties by only accessing 2 buckets at a time.
// The node vector is modified in place.
// Due to the doubling nature of the buckets, we require that the number of buckets is a
// power of 2. This allows us to determine if a node needs to move by simply checking a
// single bit, and further allows us to initially hash nodes using a bitmask.
//
// TODO: this is not a fancy hash table in terms of memory access patterns (cuckoo-hashing
// or something that spills to disk). We will likely want to invest more time into this.
// TODO: hash-join and aggregation have very different access patterns. Joins insert
// all the rows and then calls scan to find them. Aggregation interleaves find() and
// inserts(). We can want to optimize joins more heavily for inserts() (in particular
// growing).
class HashTable {
private:
struct Node;
public:
class Iterator;
// Create a hash table.
// - build_exprs are the exprs that should be used to evaluate rows during insert().
// - probe_exprs are used during find()
// - num_build_tuples: number of Tuples in the build tuple row
// - stores_nulls: if false, TupleRows with nulls are ignored during Insert
// - num_buckets: number of buckets that the hash table should be initialized to
// - mem_limits: if non-empty, all memory allocation for nodes and for buckets is
// tracked against those limits; the limits must be valid until the d'tor is called
// - initial_seed: Initial seed value to use when computing hashes for rows
HashTable(
const std::vector<ExprContext*>& build_exprs,
const std::vector<ExprContext*>& probe_exprs,
int num_build_tuples, bool stores_nulls, int32_t initial_seed,
MemTracker* mem_tracker,
int64_t num_buckets);
~HashTable();
// Call to cleanup any resources. Must be called once.
void close();
// Insert row into the hash table. Row will be evaluated over _build_expr_ctxs
// This will grow the hash table if necessary
void IR_ALWAYS_INLINE insert(TupleRow* row) {
if (_num_filled_buckets > _num_buckets_till_resize) {
// TODO: next prime instead of double?
resize_buckets(_num_buckets * 2);
}
insert_impl(row);
}
// Returns the start iterator for all rows that match 'probe_row'. 'probe_row' is
// evaluated with _probe_expr_ctxs. The iterator can be iterated until HashTable::end()
// to find all the matching rows.
// Only one scan be in progress at any time (i.e. it is not legal to call
// find(), begin iterating through all the matches, call another find(),
// and continuing iterator from the first scan iterator).
// Advancing the returned iterator will go to the next matching row. The matching
// rows are evaluated lazily (i.e. computed as the Iterator is moved).
// Returns HashTable::end() if there is no match.
Iterator IR_ALWAYS_INLINE find(TupleRow* probe_row);
// Returns number of elements in the hash table
int64_t size() {
return _num_nodes;
}
// Returns the number of buckets
int64_t num_buckets() {
return _buckets.size();
}
// true if any of the MemTrackers was exceeded
bool exceeded_limit() const {
return _exceeded_limit;
}
// Returns the load factor (the number of non-empty buckets)
float load_factor() {
return _num_filled_buckets / static_cast<float>(_buckets.size());
}
// Returns the number of bytes allocated to the hash table
int64_t byte_size() const {
return _node_byte_size * _nodes_capacity + sizeof(Bucket) * _buckets.size();
}
// Returns the results of the exprs at 'expr_idx' evaluated over the last row
// processed by the HashTable.
// This value is invalid if the expr evaluated to NULL.
// TODO: this is an awkward abstraction but aggregation node can take advantage of
// it and save some expr evaluation calls.
void* last_expr_value(int expr_idx) const {
return _expr_values_buffer + _expr_values_buffer_offsets[expr_idx];
}
// Returns if the expr at 'expr_idx' evaluated to NULL for the last row.
bool last_expr_value_null(int expr_idx) const {
return _expr_value_null_bits[expr_idx];
}
// Return beginning of hash table. Advancing this iterator will traverse all
// elements.
Iterator begin();
// Returns end marker
Iterator end() {
return Iterator();
}
/// Codegen for evaluating a tuple row. Codegen'd function matches the signature
/// for EvalBuildRow and EvalTupleRow.
/// if build_row is true, the codegen uses the build_exprs, otherwise the probe_exprs
llvm::Function* codegen_eval_tuple_row(RuntimeState* state, bool build_row);
/// Codegen for hashing the expr values in '_expr_values_buffer'. Function
/// prototype matches hash_current_row identically.
llvm::Function* codegen_hash_current_row(RuntimeState* state);
/// Codegen for evaluating a TupleRow and comparing equality against
/// '_expr_values_buffer'. Function signature matches HashTable::Equals()
llvm::Function* codegen_equals(RuntimeState* state);
static const char* _s_llvm_class_name;
// Dump out the entire hash table to string. If skip_empty, empty buckets are
// skipped. If build_desc is non-null, the build rows will be output. Otherwise
// just the build row addresses.
std::string debug_string(bool skip_empty, const RowDescriptor* build_desc);
// stl-like iterator interface.
class Iterator {
public:
Iterator() : _table(NULL), _bucket_idx(-1), _node_idx(-1) {
}
// Iterates to the next element. In the case where the iterator was
// from a Find, this will lazily evaluate that bucket, only returning
// TupleRows that match the current scan row.
template<bool check_match>
void IR_ALWAYS_INLINE next();
// Returns the current row or NULL if at end.
TupleRow* get_row() {
if (_node_idx == -1) {
return NULL;
}
return _table->get_node(_node_idx)->data();
}
// Returns if the iterator is at the end
bool has_next() {
return _node_idx != -1;
}
// Returns true if this iterator is at the end, i.e. get_row() cannot be called.
bool at_end() {
return _node_idx == -1;
}
// Sets as matched the node currently pointed by the iterator. The iterator
// cannot be AtEnd().
void set_matched() {
DCHECK(!at_end());
Node *node = _table->get_node(_node_idx);
node->matched = true;
}
bool matched() {
DCHECK(!at_end());
Node *node = _table->get_node(_node_idx);
return node->matched;
}
bool operator==(const Iterator& rhs) {
return _bucket_idx == rhs._bucket_idx && _node_idx == rhs._node_idx;
}
bool operator!=(const Iterator& rhs) {
return _bucket_idx != rhs._bucket_idx || _node_idx != rhs._node_idx;
}
private:
friend class HashTable;
Iterator(HashTable* table, int bucket_idx, int64_t node, uint32_t hash) :
_table(table),
_bucket_idx(bucket_idx),
_node_idx(node),
_scan_hash(hash) {
}
HashTable* _table;
// Current bucket idx
int64_t _bucket_idx;
// Current node idx (within current bucket)
int64_t _node_idx;
// cached hash value for the row passed to find()()
uint32_t _scan_hash;
};
private:
friend class Iterator;
friend class HashTableTest;
// Header portion of a Node. The node data (TupleRow) is right after the
// node memory to maximize cache hits.
struct Node {
int64_t _next_idx; // chain to next node for collisions
uint32_t _hash; // Cache of the hash for _data
bool matched;
Node():_next_idx(-1),
_hash(-1),
matched(false) {
}
TupleRow* data() {
uint8_t* mem = reinterpret_cast<uint8_t*>(this);
DCHECK_EQ(reinterpret_cast<uint64_t>(mem) % 8, 0);
return reinterpret_cast<TupleRow*>(mem + sizeof(Node));
}
};
struct Bucket {
int64_t _node_idx;
Bucket() {
_node_idx = -1;
}
};
// Returns the next non-empty bucket and updates idx to be the index of that bucket.
// If there are no more buckets, returns NULL and sets idx to -1
Bucket* next_bucket(int64_t* bucket_idx);
// Returns node at idx. Tracking structures do not use pointers since they will
// change as the HashTable grows.
Node* get_node(int64_t idx) {
DCHECK_NE(idx, -1);
return reinterpret_cast<Node*>(_nodes + _node_byte_size * idx);
}
// Resize the hash table to 'num_buckets'
void resize_buckets(int64_t num_buckets);
// Insert row into the hash table
void IR_ALWAYS_INLINE insert_impl(TupleRow* row);
// Chains the node at 'node_idx' to 'bucket'. Nodes in a bucket are chained
// as a linked list; this places the new node at the beginning of the list.
void add_to_bucket(Bucket* bucket, int64_t node_idx, Node* node);
// Moves a node from one bucket to another. 'previous_node' refers to the
// node (if any) that's chained before this node in from_bucket's linked list.
void move_node(Bucket* from_bucket, Bucket* to_bucket, int64_t node_idx, Node* node,
Node* previous_node);
// Evaluate the exprs over row and cache the results in '_expr_values_buffer'.
// Returns whether any expr evaluated to NULL
// This will be replaced by codegen
bool eval_row(TupleRow* row, const std::vector<ExprContext*>& exprs);
// Evaluate 'row' over _build_expr_ctxs caching the results in '_expr_values_buffer'
// This will be replaced by codegen. We do not want this function inlined when
// cross compiled because we need to be able to differentiate between EvalBuildRow
// and EvalProbeRow by name and the _build_expr_ctxs/_probe_expr_ctxs are baked into
// the codegen'd function.
bool IR_NO_INLINE eval_build_row(TupleRow* row) {
return eval_row(row, _build_expr_ctxs);
}
// Evaluate 'row' over _probe_expr_ctxs caching the results in '_expr_values_buffer'
// This will be replaced by codegen.
bool IR_NO_INLINE eval_probe_row(TupleRow* row) {
return eval_row(row, _probe_expr_ctxs);
}
// Compute the hash of the values in _expr_values_buffer.
// This will be replaced by codegen. We don't want this inlined for replacing
// with codegen'd functions so the function name does not change.
uint32_t IR_NO_INLINE hash_current_row() {
if (_var_result_begin == -1) {
// This handles NULLs implicitly since a constant seed value was put
// into results buffer for nulls.
return HashUtil::hash(_expr_values_buffer, _results_buffer_size, _initial_seed);
} else {
return hash_variable_len_row();
}
}
// Compute the hash of the values in _expr_values_buffer for rows with variable length
// fields (e.g. strings)
uint32_t hash_variable_len_row();
// Returns true if the values of build_exprs evaluated over 'build_row' equal
// the values cached in _expr_values_buffer
// This will be replaced by codegen.
bool equals(TupleRow* build_row);
// Grow the node array.
void grow_node_array();
// Sets _mem_tracker_exceeded to true and MEM_LIMIT_EXCEEDED for the query.
// allocation_size is the attempted size of the allocation that would have
// brought us over the mem limit.
void mem_limit_exceeded(int64_t allocation_size);
// Load factor that will trigger growing the hash table on insert. This is
// defined as the number of non-empty buckets / total_buckets
static const float MAX_BUCKET_OCCUPANCY_FRACTION;
const std::vector<ExprContext*>& _build_expr_ctxs;
const std::vector<ExprContext*>& _probe_expr_ctxs;
// Number of Tuple* in the build tuple row
const int _num_build_tuples;
const bool _stores_nulls;
const int32_t _initial_seed;
// Size of hash table nodes. This includes a fixed size header and the Tuple*'s that
// follow.
const int _node_byte_size;
// Number of non-empty buckets. Used to determine when to grow and rehash
int64_t _num_filled_buckets;
// Memory to store node data. This is not allocated from a pool to take advantage
// of realloc.
// TODO: integrate with mem pools
uint8_t* _nodes;
// number of nodes stored (i.e. size of hash table)
int64_t _num_nodes;
// max number of nodes that can be stored in '_nodes' before realloc
int64_t _nodes_capacity;
bool _exceeded_limit; // true if any of _mem_trackers[].limit_exceeded()
MemTracker* _mem_tracker;
// Set to true if the hash table exceeds the memory limit. If this is set,
// subsequent calls to Insert() will be ignored.
bool _mem_limit_exceeded;
std::vector<Bucket> _buckets;
// equal to _buckets.size() but more efficient than the size function
int64_t _num_buckets;
// The number of filled buckets to trigger a resize. This is cached for efficiency
int64_t _num_buckets_till_resize;
// Cache of exprs values for the current row being evaluated. This can either
// be a build row (during insert()) or probe row (during find()).
std::vector<int> _expr_values_buffer_offsets;
// byte offset into _expr_values_buffer that begins the variable length results
int _var_result_begin;
// byte size of '_expr_values_buffer'
int _results_buffer_size;
// buffer to store evaluated expr results. This address must not change once
// allocated since the address is baked into the codegen
uint8_t* _expr_values_buffer;
// Use bytes instead of bools to be compatible with llvm. This address must
// not change once allocated.
uint8_t* _expr_value_null_bits;
};
}
#endif

176
be/src/exec/hash_table.hpp Normal file
View File

@ -0,0 +1,176 @@
// Modifications copyright (C) 2017, Baidu.com, Inc.
// Copyright 2017 The Apache Software Foundation
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
#ifndef BDG_PALO_BE_SRC_QUERY_EXEC_HASH_TABLE_HPP
#define BDG_PALO_BE_SRC_QUERY_EXEC_HASH_TABLE_HPP
#include "exec/hash_table.h"
namespace palo {
inline HashTable::Iterator HashTable::find(TupleRow* probe_row) {
bool has_nulls = eval_probe_row(probe_row);
if (!_stores_nulls && has_nulls) {
return end();
}
uint32_t hash = hash_current_row();
int64_t bucket_idx = hash & (_num_buckets - 1);
Bucket* bucket = &_buckets[bucket_idx];
int64_t node_idx = bucket->_node_idx;
while (node_idx != -1) {
Node* node = get_node(node_idx);
if (node->_hash == hash && equals(node->data())) {
return Iterator(this, bucket_idx, node_idx, hash);
}
node_idx = node->_next_idx;
}
return end();
}
inline HashTable::Iterator HashTable::begin() {
int64_t bucket_idx = -1;
Bucket* bucket = next_bucket(&bucket_idx);
if (bucket != NULL) {
return Iterator(this, bucket_idx, bucket->_node_idx, 0);
}
return end();
}
inline HashTable::Bucket* HashTable::next_bucket(int64_t* bucket_idx) {
++*bucket_idx;
for (; *bucket_idx < _num_buckets; ++*bucket_idx) {
if (_buckets[*bucket_idx]._node_idx != -1) {
return &_buckets[*bucket_idx];
}
}
*bucket_idx = -1;
return NULL;
}
inline void HashTable::insert_impl(TupleRow* row) {
bool has_null = eval_build_row(row);
if (!_stores_nulls && has_null) {
return;
}
uint32_t hash = hash_current_row();
int64_t bucket_idx = hash & (_num_buckets - 1);
if (_num_nodes == _nodes_capacity) {
grow_node_array();
}
Node* node = get_node(_num_nodes);
TupleRow* data = node->data();
node->_hash = hash;
memcpy(data, row, sizeof(Tuple*) * _num_build_tuples);
add_to_bucket(&_buckets[bucket_idx], _num_nodes, node);
++_num_nodes;
}
inline void HashTable::add_to_bucket(Bucket* bucket, int64_t node_idx, Node* node) {
if (bucket->_node_idx == -1) {
++_num_filled_buckets;
}
node->_next_idx = bucket->_node_idx;
bucket->_node_idx = node_idx;
}
inline void HashTable::move_node(Bucket* from_bucket, Bucket* to_bucket,
int64_t node_idx, Node* node, Node* previous_node) {
int64_t next_idx = node->_next_idx;
if (previous_node != NULL) {
previous_node->_next_idx = next_idx;
} else {
// Update bucket directly
from_bucket->_node_idx = next_idx;
if (next_idx == -1) {
--_num_filled_buckets;
}
}
add_to_bucket(to_bucket, node_idx, node);
}
template<bool check_match>
inline void HashTable::Iterator::next() {
if (_bucket_idx == -1) {
return;
}
// TODO: this should prefetch the next tuplerow
Node* node = _table->get_node(_node_idx);
// Iterator is not from a full table scan, evaluate equality now. Only the current
// bucket needs to be scanned. '_expr_values_buffer' contains the results
// for the current probe row.
if (check_match) {
// TODO: this should prefetch the next node
int64_t next_idx = node->_next_idx;
while (next_idx != -1) {
node = _table->get_node(next_idx);
if (node->_hash == _scan_hash && _table->equals(node->data())) {
_node_idx = next_idx;
return;
}
next_idx = node->_next_idx;
}
*this = _table->end();
} else {
// Move onto the next chained node
if (node->_next_idx != -1) {
_node_idx = node->_next_idx;
return;
}
// Move onto the next bucket
Bucket* bucket = _table->next_bucket(&_bucket_idx);
if (bucket == NULL) {
_bucket_idx = -1;
_node_idx = -1;
} else {
_node_idx = bucket->_node_idx;
}
}
}
}
#endif

Some files were not shown because too many files have changed in this diff Show More