ScalarArrayOpExprs with "useOr=true" and a set of Consts on the righthand
side have traditionally been evaluated by using a linear search over the
array. When these arrays contain large numbers of elements then this
linear search could become a significant part of execution time.
Here we add a new method of evaluating ScalarArrayOpExpr expressions to
allow them to be evaluated by first building a hash table containing each
element, then on subsequent evaluations, we just probe that hash table to
determine if there is a match.
The planner is in charge of determining when this optimization is possible
and it enables it by setting hashfuncid in the ScalarArrayOpExpr. The
executor will only perform the hash table evaluation when the hashfuncid
is set.
This means that not all cases are optimized. For example CHECK constraints
containing an IN clause won't go through the planner, so won't get the
hashfuncid set. We could maybe do something about that at some later
date. The reason we're not doing it now is from fear that we may slow
down cases where the expression is evaluated only once. Those cases can
be common, for example, a single row INSERT to a table with a CHECK
constraint containing an IN clause.
In the planner, we enable this when there are suitable hash functions for
the ScalarArrayOpExpr's operator and only when there is at least
MIN_ARRAY_SIZE_FOR_HASHED_SAOP elements in the array. The threshold is
currently set to 9.
Author: James Coleman, David Rowley
Reviewed-by: David Rowley, Tomas Vondra, Heikki Linnakangas
Discussion: https://postgr.es/m/CAAaqYe8x62+=wn0zvNKCj55tPpg-JBHzhZFFc6ANovdqFw7-dA@mail.gmail.com
There was some code in gen_prune_steps_from_opexps that needlessly
checked a list was not empty when it clearly had to contain at least one
item. This prompted a further cleanup operation in partprune.c.
Additionally, the previous code could end up adding additional needless
INTERSECT steps. However, those do not appear to be able to cause any
misbehavior.
gen_prune_steps_from_opexps is now no longer in charge of generating
combine pruning steps. Instead, gen_partprune_steps_internal, which
already does some combine step creation has been given the sole
responsibility of generating all combine steps. This means that when
we recursively call gen_partprune_steps_internal, since it always now adds
a combine step when it produces multiple steps, we can just pay attention
to the final step returned.
In passing, do quite a bit of work on the comments to try to more clearly
explain the role of both gen_partprune_steps_internal and
gen_prune_steps_from_opexps. This is fairly complex code so some extra
effort to give any new readers an overview of how things work seems like
a good idea.
Author: Amit Langote
Reported-by: Andy Fan
Reviewed-by: Kyotaro Horiguchi, Andy Fan, Ryan Lambert, David Rowley
Discussion: https://postgr.es/m/CAKU4AWqWoVii+bRTeBQmeVW+PznkdO8DfbwqNsu9Gj4ubt9A6w@mail.gmail.com
This adds support for writing CREATE FUNCTION and CREATE PROCEDURE
statements for language SQL with a function body that conforms to the
SQL standard and is portable to other implementations.
Instead of the PostgreSQL-specific AS $$ string literal $$ syntax,
this allows writing out the SQL statements making up the body
unquoted, either as a single statement:
CREATE FUNCTION add(a integer, b integer) RETURNS integer
LANGUAGE SQL
RETURN a + b;
or as a block
CREATE PROCEDURE insert_data(a integer, b integer)
LANGUAGE SQL
BEGIN ATOMIC
INSERT INTO tbl VALUES (a);
INSERT INTO tbl VALUES (b);
END;
The function body is parsed at function definition time and stored as
expression nodes in a new pg_proc column prosqlbody. So at run time,
no further parsing is required.
However, this form does not support polymorphic arguments, because
there is no more parse analysis done at call time.
Dependencies between the function and the objects it uses are fully
tracked.
A new RETURN statement is introduced. This can only be used inside
function bodies. Internally, it is treated much like a SELECT
statement.
psql needs some new intelligence to keep track of function body
boundaries so that it doesn't send off statements when it sees
semicolons that are inside a function body.
Tested-by: Jaime Casanova <jcasanov@systemguards.com.ec>
Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/1c11f1eb-f00c-43b7-799d-2d44132c02d7@2ndquadrant.com
Documentation and comments in code and tests have been using the terms
sensitive/insensitive cursor incorrectly relative to the SQL standard.
(Cursor sensitivity is only relevant for changes made in the same
transaction as the cursor, not for concurrent changes in other
sessions.) Moreover, some of the behavior of PostgreSQL is incorrect
according to the SQL standard, confusing the issue further. (WHERE
CURRENT OF changes are not visible in insensitive cursors, but they
should be.)
This change corrects the terminology and removes the claim that
sensitive cursors are supported. It also adds a test case that checks
the insensitive behavior in a "correct" way, using a change command
not using WHERE CURRENT OF. Finally, it adds the ASENSITIVE cursor
option to select the default asensitive behavior, per SQL standard.
There are no changes to cursor behavior in this patch.
Discussion: https://www.postgresql.org/message-id/flat/96ee8b30-9889-9e1b-b053-90e10c050e85%40enterprisedb.com
Delay creation of the projections for INSERT and UPDATE tuples
until they're needed. This saves a pretty fair amount of work
when only some of the partitions are actually touched.
The logic associated with identifying junk columns in UPDATE/DELETE
is moved to another loop, allowing removal of one loop over the
target relations; but it didn't actually change at all.
Extracted from a larger patch, which seemed to me to be too messy
to push in one commit.
Amit Langote, reviewed at different times by Heikki Linnakangas and
myself
Discussion: https://postgr.es/m/CA+HiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA@mail.gmail.com
Arrange to do some things on-demand, rather than immediately during
executor startup, because there's a fair chance of never having to do
them at all:
* Don't open result relations' indexes until needed.
* Don't initialize partition tuple routing, nor the child-to-root
tuple conversion map, until needed.
This wins in UPDATEs on partitioned tables when only some of the
partitions will actually receive updates; with larger partition
counts the savings is quite noticeable. Also, we can remove some
sketchy heuristics in ExecInitModifyTable about whether to set up
tuple routing.
Also, remove execPartition.c's private hash table tracking which
partitions were already opened by the ModifyTable node. Instead
use the hash added to ModifyTable itself by commit 86dc90056.
To allow lazy computation of the conversion maps, we now set
ri_RootResultRelInfo in all child ResultRelInfos. We formerly set it
only in some, not terribly well-defined, cases. This has user-visible
side effects in that now more error messages refer to the root
relation instead of some partition (and provide error data in the
root's column order, too). It looks to me like this is a strict
improvement in consistency, so I don't have a problem with the
output changes visible in this commit.
Extracted from a larger patch, which seemed to me to be too messy
to push in one commit.
Amit Langote, reviewed at different times by Heikki Linnakangas and
myself
Discussion: https://postgr.es/m/CA+HiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA@mail.gmail.com
Commit 3e98c0bafb added pg_backend_memory_contexts view to display
the memory contexts of the backend process. However its target process
is limited to the backend that is accessing to the view. So this is
not so convenient when investigating the local memory bloat of other
backend process. To improve this situation, this commit adds
pg_log_backend_memory_contexts() function that requests to log
the memory contexts of the specified backend process.
This information can be also collected by calling
MemoryContextStats(TopMemoryContext) via a debugger. But
this technique cannot be used in some environments because no debugger
is available there. So, pg_log_backend_memory_contexts() allows us to
see the memory contexts of specified backend more easily.
Only superusers are allowed to request to log the memory contexts
because allowing any users to issue this request at an unbounded rate
would cause lots of log messages and which can lead to denial of service.
On receipt of the request, at the next CHECK_FOR_INTERRUPTS(),
the target backend logs its memory contexts at LOG_SERVER_ONLY level,
so that these memory contexts will appear in the server log but not
be sent to the client. It logs one message per memory context.
Because if it buffers all memory contexts into StringInfo to log them
as one message, which may require the buffer to be enlarged very much
and lead to OOM error since there can be a large number of memory
contexts in a backend.
When a backend process is consuming huge memory, logging all its
memory contexts might overrun available disk space. To prevent this,
now this patch limits the number of child contexts to log per parent
to 100. As with MemoryContextStats(), it supposes that practical cases
where the log gets long will typically be huge numbers of siblings
under the same parent context; while the additional debugging value
from seeing details about individual siblings beyond 100 will not be large.
There was another proposed patch to add the function to return
the memory contexts of specified backend as the result sets,
instead of logging them, in the discussion. However that patch is
not included in this commit because it had several issues to address.
Thanks to Tatsuhito Kasahara, Andres Freund, Tom Lane, Tomas Vondra,
Michael Paquier, Kyotaro Horiguchi and Zhihong Yu for the discussion.
Bump catalog version.
Author: Atsushi Torikoshi
Reviewed-by: Kyotaro Horiguchi, Zhihong Yu, Fujii Masao
Discussion: https://postgr.es/m/0271f440ac77f2a4180e0e56ebd944d1@oss.nttdata.com
Move the planner-control flags up so that there is more room for parse
options. Some pending patches need some room there, so do this
renumbering separately so that there is less potential for conflicts.
Here we add a new executor node type named "Result Cache". The planner
can include this node type in the plan to have the executor cache the
results from the inner side of parameterized nested loop joins. This
allows caching of tuples for sets of parameters so that in the event that
the node sees the same parameter values again, it can just return the
cached tuples instead of rescanning the inner side of the join all over
again. Internally, result cache uses a hash table in order to quickly
find tuples that have been previously cached.
For certain data sets, this can significantly improve the performance of
joins. The best cases for using this new node type are for join problems
where a large portion of the tuples from the inner side of the join have
no join partner on the outer side of the join. In such cases, hash join
would have to hash values that are never looked up, thus bloating the hash
table and possibly causing it to multi-batch. Merge joins would have to
skip over all of the unmatched rows. If we use a nested loop join with a
result cache, then we only cache tuples that have at least one join
partner on the outer side of the join. The benefits of using a
parameterized nested loop with a result cache increase when there are
fewer distinct values being looked up and the number of lookups of each
value is large. Also, hash probes to lookup the cache can be much faster
than the hash probe in a hash join as it's common that the result cache's
hash table is much smaller than the hash join's due to result cache only
caching useful tuples rather than all tuples from the inner side of the
join. This variation in hash probe performance is more significant when
the hash join's hash table no longer fits into the CPU's L3 cache, but the
result cache's hash table does. The apparent "random" access of hash
buckets with each hash probe can cause a poor L3 cache hit ratio for large
hash tables. Smaller hash tables generally perform better.
The hash table used for the cache limits itself to not exceeding work_mem
* hash_mem_multiplier in size. We maintain a dlist of keys for this cache
and when we're adding new tuples and realize we've exceeded the memory
budget, we evict cache entries starting with the least recently used ones
until we have enough memory to add the new tuples to the cache.
For parameterized nested loop joins, we now consider using one of these
result cache nodes in between the nested loop node and its inner node. We
determine when this might be useful based on cost, which is primarily
driven off of what the expected cache hit ratio will be. Estimating the
cache hit ratio relies on having good distinct estimates on the nested
loop's parameters.
For now, the planner will only consider using a result cache for
parameterized nested loop joins. This works for both normal joins and
also for LATERAL type joins to subqueries. It is possible to use this new
node for other uses in the future. For example, to cache results from
correlated subqueries. However, that's not done here due to some
difficulties obtaining a distinct estimation on the outer plan to
calculate the estimated cache hit ratio. Currently we plan the inner plan
before planning the outer plan so there is no good way to know if a result
cache would be useful or not since we can't estimate the number of times
the subplan will be called until the outer plan is generated.
The functionality being added here is newly introducing a dependency on
the return value of estimate_num_groups() during the join search.
Previously, during the join search, we only ever needed to perform
selectivity estimations. With this commit, we need to use
estimate_num_groups() in order to estimate what the hit ratio on the
result cache will be. In simple terms, if we expect 10 distinct values
and we expect 1000 outer rows, then we'll estimate the hit ratio to be
99%. Since cache hits are very cheap compared to scanning the underlying
nodes on the inner side of the nested loop join, then this will
significantly reduce the planner's cost for the join. However, it's
fairly easy to see here that things will go bad when estimate_num_groups()
incorrectly returns a value that's significantly lower than the actual
number of distinct values. If this happens then that may cause us to make
use of a nested loop join with a result cache instead of some other join
type, such as a merge or hash join. Our distinct estimations have been
known to be a source of trouble in the past, so the extra reliance on them
here could cause the planner to choose slower plans than it did previous
to having this feature. Distinct estimations are also fairly hard to
estimate accurately when several tables have been joined already or when a
WHERE clause filters out a set of values that are correlated to the
expressions we're estimating the number of distinct value for.
For now, the costing we perform during query planning for result caches
does put quite a bit of faith in the distinct estimations being accurate.
When these are accurate then we should generally see faster execution
times for plans containing a result cache. However, in the real world, we
may find that we need to either change the costings to put less trust in
the distinct estimations being accurate or perhaps even disable this
feature by default. There's always an element of risk when we teach the
query planner to do new tricks that it decides to use that new trick at
the wrong time and causes a regression. Users may opt to get the old
behavior by turning the feature off using the enable_resultcache GUC.
Currently, this is enabled by default. It remains to be seen if we'll
maintain that setting for the release.
Additionally, the name "Result Cache" is the best name I could think of
for this new node at the time I started writing the patch. Nobody seems
to strongly dislike the name. A few people did suggest other names but no
other name seemed to dominate in the brief discussion that there was about
names. Let's allow the beta period to see if the current name pleases
enough people. If there's some consensus on a better name, then we can
change it before the release. Please see the 2nd discussion link below
for the discussion on the "Result Cache" name.
Author: David Rowley
Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu, Hou Zhijie
Tested-By: Konstantin Knizhnik
Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
This removes "Add Result Cache executor node". It seems that something
weird is going on with the tracking of cache hits and misses as
highlighted by many buildfarm animals. It's not yet clear what the
problem is as other parts of the plan indicate that the cache did work
correctly, it's just the hits and misses that were being reported as 0.
This is especially a bad time to have the buildfarm so broken, so
reverting before too many more animals go red.
Discussion: https://postgr.es/m/CAApHDvq_hydhfovm4=izgWs+C5HqEeRScjMbOgbpC-jRAeK3Yw@mail.gmail.com
Here we add a new executor node type named "Result Cache". The planner
can include this node type in the plan to have the executor cache the
results from the inner side of parameterized nested loop joins. This
allows caching of tuples for sets of parameters so that in the event that
the node sees the same parameter values again, it can just return the
cached tuples instead of rescanning the inner side of the join all over
again. Internally, result cache uses a hash table in order to quickly
find tuples that have been previously cached.
For certain data sets, this can significantly improve the performance of
joins. The best cases for using this new node type are for join problems
where a large portion of the tuples from the inner side of the join have
no join partner on the outer side of the join. In such cases, hash join
would have to hash values that are never looked up, thus bloating the hash
table and possibly causing it to multi-batch. Merge joins would have to
skip over all of the unmatched rows. If we use a nested loop join with a
result cache, then we only cache tuples that have at least one join
partner on the outer side of the join. The benefits of using a
parameterized nested loop with a result cache increase when there are
fewer distinct values being looked up and the number of lookups of each
value is large. Also, hash probes to lookup the cache can be much faster
than the hash probe in a hash join as it's common that the result cache's
hash table is much smaller than the hash join's due to result cache only
caching useful tuples rather than all tuples from the inner side of the
join. This variation in hash probe performance is more significant when
the hash join's hash table no longer fits into the CPU's L3 cache, but the
result cache's hash table does. The apparent "random" access of hash
buckets with each hash probe can cause a poor L3 cache hit ratio for large
hash tables. Smaller hash tables generally perform better.
The hash table used for the cache limits itself to not exceeding work_mem
* hash_mem_multiplier in size. We maintain a dlist of keys for this cache
and when we're adding new tuples and realize we've exceeded the memory
budget, we evict cache entries starting with the least recently used ones
until we have enough memory to add the new tuples to the cache.
For parameterized nested loop joins, we now consider using one of these
result cache nodes in between the nested loop node and its inner node. We
determine when this might be useful based on cost, which is primarily
driven off of what the expected cache hit ratio will be. Estimating the
cache hit ratio relies on having good distinct estimates on the nested
loop's parameters.
For now, the planner will only consider using a result cache for
parameterized nested loop joins. This works for both normal joins and
also for LATERAL type joins to subqueries. It is possible to use this new
node for other uses in the future. For example, to cache results from
correlated subqueries. However, that's not done here due to some
difficulties obtaining a distinct estimation on the outer plan to
calculate the estimated cache hit ratio. Currently we plan the inner plan
before planning the outer plan so there is no good way to know if a result
cache would be useful or not since we can't estimate the number of times
the subplan will be called until the outer plan is generated.
The functionality being added here is newly introducing a dependency on
the return value of estimate_num_groups() during the join search.
Previously, during the join search, we only ever needed to perform
selectivity estimations. With this commit, we need to use
estimate_num_groups() in order to estimate what the hit ratio on the
result cache will be. In simple terms, if we expect 10 distinct values
and we expect 1000 outer rows, then we'll estimate the hit ratio to be
99%. Since cache hits are very cheap compared to scanning the underlying
nodes on the inner side of the nested loop join, then this will
significantly reduce the planner's cost for the join. However, it's
fairly easy to see here that things will go bad when estimate_num_groups()
incorrectly returns a value that's significantly lower than the actual
number of distinct values. If this happens then that may cause us to make
use of a nested loop join with a result cache instead of some other join
type, such as a merge or hash join. Our distinct estimations have been
known to be a source of trouble in the past, so the extra reliance on them
here could cause the planner to choose slower plans than it did previous
to having this feature. Distinct estimations are also fairly hard to
estimate accurately when several tables have been joined already or when a
WHERE clause filters out a set of values that are correlated to the
expressions we're estimating the number of distinct value for.
For now, the costing we perform during query planning for result caches
does put quite a bit of faith in the distinct estimations being accurate.
When these are accurate then we should generally see faster execution
times for plans containing a result cache. However, in the real world, we
may find that we need to either change the costings to put less trust in
the distinct estimations being accurate or perhaps even disable this
feature by default. There's always an element of risk when we teach the
query planner to do new tricks that it decides to use that new trick at
the wrong time and causes a regression. Users may opt to get the old
behavior by turning the feature off using the enable_resultcache GUC.
Currently, this is enabled by default. It remains to be seen if we'll
maintain that setting for the release.
Additionally, the name "Result Cache" is the best name I could think of
for this new node at the time I started writing the patch. Nobody seems
to strongly dislike the name. A few people did suggest other names but no
other name seemed to dominate in the brief discussion that there was about
names. Let's allow the beta period to see if the current name pleases
enough people. If there's some consensus on a better name, then we can
change it before the release. Please see the 2nd discussion link below
for the discussion on the "Result Cache" name.
Author: David Rowley
Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu
Tested-By: Konstantin Knizhnik
Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
This patch makes two closely related sets of changes:
1. For UPDATE, the subplan of the ModifyTable node now only delivers
the new values of the changed columns (i.e., the expressions computed
in the query's SET clause) plus row identity information such as CTID.
ModifyTable must re-fetch the original tuple to merge in the old
values of any unchanged columns. The core advantage of this is that
the changed columns are uniform across all tables of an inherited or
partitioned target relation, whereas the other columns might not be.
A secondary advantage, when the UPDATE involves joins, is that less
data needs to pass through the plan tree. The disadvantage of course
is an extra fetch of each tuple to be updated. However, that seems to
be very nearly free in context; even worst-case tests don't show it to
add more than a couple percent to the total query cost. At some point
it might be interesting to combine the re-fetch with the tuple access
that ModifyTable must do anyway to mark the old tuple dead; but that
would require a good deal of refactoring and it seems it wouldn't buy
all that much, so this patch doesn't attempt it.
2. For inherited UPDATE/DELETE, instead of generating a separate
subplan for each target relation, we now generate a single subplan
that is just exactly like a SELECT's plan, then stick ModifyTable
on top of that. To let ModifyTable know which target relation a
given incoming row refers to, a tableoid junk column is added to
the row identity information. This gets rid of the horrid hack
that was inheritance_planner(), eliminating O(N^2) planning cost
and memory consumption in cases where there were many unprunable
target relations.
Point 2 of course requires point 1, so that there is a uniform
definition of the non-junk columns to be returned by the subplan.
We can't insist on uniform definition of the row identity junk
columns however, if we want to keep the ability to have both
plain and foreign tables in a partitioning hierarchy. Since
it wouldn't scale very far to have every child table have its
own row identity column, this patch includes provisions to merge
similar row identity columns into one column of the subplan result.
In particular, we can merge the whole-row Vars typically used as
row identity by FDWs into one column by pretending they are type
RECORD. (It's still okay for the actual composite Datums to be
labeled with the table's rowtype OID, though.)
There is more that can be done to file down residual inefficiencies
in this patch, but it seems to be committable now.
FDW authors should note several API changes:
* The argument list for AddForeignUpdateTargets() has changed, and so
has the method it must use for adding junk columns to the query. Call
add_row_identity_var() instead of manipulating the parse tree directly.
You might want to reconsider exactly what you're adding, too.
* PlanDirectModify() must now work a little harder to find the
ForeignScan plan node; if the foreign table is part of a partitioning
hierarchy then the ForeignScan might not be the direct child of
ModifyTable. See postgres_fdw for sample code.
* To check whether a relation is a target relation, it's no
longer sufficient to compare its relid to root->parse->resultRelation.
Instead, check it against all_result_relids or leaf_result_relids,
as appropriate.
Amit Langote and Tom Lane
Discussion: https://postgr.es/m/CA+HiwqHpHdqdDn48yCEhynnniahH78rwcrv1rEX65-fsZGBOLQ@mail.gmail.com
This implements asynchronous execution, which runs multiple parts of a
non-parallel-aware Append concurrently rather than serially to improve
performance when possible. Currently, the only node type that can be
run concurrently is a ForeignScan that is an immediate child of such an
Append. In the case where such ForeignScans access data on different
remote servers, this would run those ForeignScans concurrently, and
overlap the remote operations to be performed simultaneously, so it'll
improve the performance especially when the operations involve
time-consuming ones such as remote join and remote aggregation.
We may extend this to other node types such as joins or aggregates over
ForeignScans in the future.
This also adds the support for postgres_fdw, which is enabled by the
table-level/server-level option "async_capable". The default is false.
Robert Haas, Kyotaro Horiguchi, Thomas Munro, and myself. This commit
is mostly based on the patch proposed by Robert Haas, but also uses
stuff from the patch proposed by Kyotaro Horiguchi and from the patch
proposed by Thomas Munro. Reviewed by Kyotaro Horiguchi, Konstantin
Knizhnik, Andrey Lepikhov, Movead Li, Thomas Munro, Justin Pryzby, and
others.
Discussion: https://postgr.es/m/CA%2BTgmoaXQEt4tZ03FtQhnzeDEMzBck%2BLrni0UWHVVgOTnA6C1w%40mail.gmail.com
Discussion: https://postgr.es/m/CA%2BhUKGLBRyu0rHrDCMC4%3DRn3252gogyp1SjOgG8SEKKZv%3DFwfQ%40mail.gmail.com
Discussion: https://postgr.es/m/20200228.170650.667613673625155850.horikyota.ntt%40gmail.com
Here we aim to reduce duplicate work done by contain_volatile_functions()
by caching whether PathTargets and RestrictInfos contain any volatile
functions the first time contain_volatile_functions() is called for them.
Any future calls for these nodes just use the cached value rather than
going to the trouble of recursively checking the sub-node all over again.
Thanks to Tom Lane for the idea.
Any locations in the code which make changes to a PathTarget or
RestrictInfo which could change the outcome of the volatility check must
change the cached value back to VOLATILITY_UNKNOWN again.
contain_volatile_functions() is the only code in charge of setting the
cache value to either VOLATILITY_VOLATILE or VOLATILITY_NOVOLATILE.
Some existing code does benefit from this additional caching, however,
this change is mainly aimed at an upcoming patch that must check for
volatility during the join search. Repeated volatility checks in that
case can become very expensive when the join search contains more than a
few relations.
Author: David Rowley
Discussion: https://postgr.es/m/3795226.1614059027@sss.pgh.pa.us
Allow defining extended statistics on expressions, not just just on
simple column references. With this commit, expressions are supported
by all existing extended statistics kinds, improving the same types of
estimates. A simple example may look like this:
CREATE TABLE t (a int);
CREATE STATISTICS s ON mod(a,10), mod(a,20) FROM t;
ANALYZE t;
The collected statistics are useful e.g. to estimate queries with those
expressions in WHERE or GROUP BY clauses:
SELECT * FROM t WHERE mod(a,10) = 0 AND mod(a,20) = 0;
SELECT 1 FROM t GROUP BY mod(a,10), mod(a,20);
This introduces new internal statistics kind 'e' (expressions) which is
built automatically when the statistics object definition includes any
expressions. This represents single-expression statistics, as if there
was an expression index (but without the index maintenance overhead).
The statistics is stored in pg_statistics_ext_data as an array of
composite types, which is possible thanks to 79f6a942bd.
CREATE STATISTICS allows building statistics on a single expression, in
which case in which case it's not possible to specify statistics kinds.
A new system view pg_stats_ext_exprs can be used to display expression
statistics, similarly to pg_stats and pg_stats_ext views.
ALTER TABLE ... ALTER COLUMN ... TYPE now treats indexes the same way it
treats indexes, i.e. it drops and recreates the statistics. This means
all statistics are reset, and we no longer try to preserve at least the
functional dependencies. This should not be a major issue in practice,
as the functional dependencies actually rely on per-column statistics,
which were always reset anyway.
Author: Tomas Vondra
Reviewed-by: Justin Pryzby, Dean Rasheed, Zhihong Yu
Discussion: https://postgr.es/m/ad7891d2-e90c-b446-9fe2-7419143847d7%40enterprisedb.com
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
To allow inserts in parallel-mode this feature has to ensure that all the
constraints, triggers, etc. are parallel-safe for the partition hierarchy
which is costly and we need to find a better way to do that. Additionally,
we could have used existing cached information in some cases like indexes,
domains, etc. to determine the parallel-safety.
List of commits reverted, in reverse chronological order:
ed62d3737c Doc: Update description for parallel insert reloption.
c8f78b6161 Add a new GUC and a reloption to enable inserts in parallel-mode.
c5be48f092 Improve FK trigger parallel-safety check added by 05c8482f7f.
e2cda3c20a Fix use of relcache TriggerDesc field introduced by commit 05c8482f7f.
e4e87a32cc Fix valgrind issue in commit 05c8482f7f.
05c8482f7f Enable parallel SELECT for "INSERT INTO ... SELECT ...".
Discussion: https://postgr.es/m/E1lMiB9-0001c3-SY@gemulon.postgresql.org
A couple error messages and comments used 'statistic kind', not the
correct 'statistics kind'. Fix and backpatch all the way back to 10,
where extended statistics were introduced.
Backpatch-through: 10
There is now a per-column COMPRESSION option which can be set to pglz
(the default, and the only option in up until now) or lz4. Or, if you
like, you can set the new default_toast_compression GUC to lz4, and
then that will be the default for new table columns for which no value
is specified. We don't have lz4 support in the PostgreSQL code, so
to use lz4 compression, PostgreSQL must be built --with-lz4.
In general, TOAST compression means compression of individual column
values, not the whole tuple, and those values can either be compressed
inline within the tuple or compressed and then stored externally in
the TOAST table, so those properties also apply to this feature.
Prior to this commit, a TOAST pointer has two unused bits as part of
the va_extsize field, and a compessed datum has two unused bits as
part of the va_rawsize field. These bits are unused because the length
of a varlena is limited to 1GB; we now use them to indicate the
compression type that was used. This means we only have bit space for
2 more built-in compresison types, but we could work around that
problem, if necessary, by introducing a new vartag_external value for
any further types we end up wanting to add. Hopefully, it won't be
too important to offer a wide selection of algorithms here, since
each one we add not only takes more coding but also adds a build
dependency for every packager. Nevertheless, it seems worth doing
at least this much, because LZ4 gets better compression than PGLZ
with less CPU usage.
It's possible for LZ4-compressed datums to leak into composite type
values stored on disk, just as it is for PGLZ. It's also possible for
LZ4-compressed attributes to be copied into a different table via SQL
commands such as CREATE TABLE AS or INSERT .. SELECT. It would be
expensive to force such values to be decompressed, so PostgreSQL has
never done so. For the same reasons, we also don't force recompression
of already-compressed values even if the target table prefers a
different compression method than was used for the source data. These
architectural decisions are perhaps arguable but revisiting them is
well beyond the scope of what seemed possible to do as part of this
project. However, it's relatively cheap to recompress as part of
VACUUM FULL or CLUSTER, so this commit adjusts those commands to do
so, if the configured compression method of the table happens not to
match what was used for some column value stored therein.
Dilip Kumar. The original patches on which this work was based were
written by Ildus Kurbangaliev, and those were patches were based on
even earlier work by Nikita Glukhov, but the design has since changed
very substantially, since allow a potentially large number of
compression methods that could be added and dropped on a running
system proved too problematic given some of the architectural issues
mentioned above; the choice of which specific compression method to
add first is now different; and a lot of the code has been heavily
refactored. More recently, Justin Przyby helped quite a bit with
testing and reviewing and this version also includes some code
contributions from him. Other design input and review from Tomas
Vondra, Álvaro Herrera, Andres Freund, Oleg Bartunov, Alexander
Korotkov, and me.
Discussion: http://postgr.es/m/20170907194236.4cefce96%40wp.localdomain
Discussion: http://postgr.es/m/CAFiTN-uUpX3ck%3DK0mLEk-G_kUQY%3DSNOTeqdaNRR9FMdQrHKebw%40mail.gmail.com
With grouping sets, it's possible that some of the grouping sets are
duplicate. This is especially common with CUBE and ROLLUP clauses. For
example GROUP BY CUBE (a,b), CUBE (b,c) is equivalent to
GROUP BY GROUPING SETS (
(a, b, c),
(a, b, c),
(a, b, c),
(a, b),
(a, b),
(a, b),
(a),
(a),
(a),
(c, a),
(c, a),
(c, a),
(c),
(b, c),
(b),
()
)
Some of the grouping sets are calculated multiple times, which is mostly
unnecessary. This commit implements a new GROUP BY DISTINCT feature, as
defined in the SQL standard, which eliminates the duplicate sets.
Author: Vik Fearing
Reviewed-by: Erik Rijkers, Georgios Kokolatos, Tomas Vondra
Discussion: https://postgr.es/m/bf3805a8-d7d1-ae61-fece-761b7ff41ecc@postgresfriends.org
Parallel SELECT can't be utilized for INSERT in the following cases:
- INSERT statement uses the ON CONFLICT DO UPDATE clause
- Target table has a parallel-unsafe: trigger, index expression or
predicate, column default expression or check constraint
- Target table has a parallel-unsafe domain constraint on any column
- Target table is a partitioned table with a parallel-unsafe partition key
expression or support function
The planner is updated to perform additional parallel-safety checks for
the cases listed above, for determining whether it is safe to run INSERT
in parallel-mode with an underlying parallel SELECT. The planner will
consider using parallel SELECT for "INSERT INTO ... SELECT ...", provided
nothing unsafe is found from the additional parallel-safety checks, or
from the existing parallel-safety checks for SELECT.
While checking parallel-safety, we need to check it for all the partitions
on the table which can be costly especially when we decide not to use a
parallel plan. So, in a separate patch, we will introduce a GUC and or a
reloption to enable/disable parallelism for Insert statements.
Prior to entering parallel-mode for the execution of INSERT with parallel
SELECT, a TransactionId is acquired and assigned to the current
transaction state. This is necessary to prevent the INSERT from attempting
to assign the TransactionId whilst in parallel-mode, which is not allowed.
This approach has a disadvantage in that if the underlying SELECT does not
return any rows, then the TransactionId is not used, however that
shouldn't happen in practice in many cases.
Author: Greg Nancarrow, Amit Langote, Amit Kapila
Reviewed-by: Amit Langote, Hou Zhijie, Takayuki Tsunakawa, Antonin Houska, Bharath Rupireddy, Dilip Kumar, Vignesh C, Zhihong Yu, Amit Kapila
Tested-by: Tang, Haiying
Discussion: https://postgr.es/m/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV=qpFJrR3AcrTS3g@mail.gmail.com
Discussion: https://postgr.es/m/CAJcOf-fAdj=nDKMsRhQzndm-O13NY4dL6xGcEvdX5Xvbbi0V7g@mail.gmail.com
Commit 0aa8a01d04 extends the output plugin API to allow decoding of
prepared xacts and allowed the user to enable/disable the two-phase option
via pg_logical_slot_get_changes(). This can lead to a problem such that
the first time when it gets changes via pg_logical_slot_get_changes()
without two_phase option enabled it will not get the prepared even though
prepare is after consistent snapshot. Now next time during getting changes,
if the two_phase option is enabled it can skip prepare because by that
time start decoding point has been moved. So the user will only get commit
prepared.
Allow to enable/disable this option at the create slot time and default
will be false. It will break the existing slots which is fine in a major
release.
Author: Ajin Cherian
Reviewed-by: Amit Kapila and Vignesh C
Discussion: https://postgr.es/m/d0f60d60-133d-bf8d-bd70-47784d8fabf3@enterprisedb.com
This adds a new executor node named TID Range Scan. The query planner
will generate paths for TID Range scans when quals are discovered on base
relations which search for ranges on the table's ctid column. These
ranges may be open at either end. For example, WHERE ctid >= '(10,0)';
will return all tuples on page 10 and over.
To support this, two new optional callback functions have been added to
table AM. scan_set_tidrange is used to set the scan range to just the
given range of TIDs. scan_getnextslot_tidrange fetches the next tuple
in the given range.
For AMs were scanning ranges of TIDs would not make sense, these functions
can be set to NULL in the TableAmRoutine. The query planner won't
generate TID Range Scan Paths in that case.
Author: Edmund Horner, David Rowley
Reviewed-by: David Rowley, Tomas Vondra, Tom Lane, Andres Freund, Zhihong Yu
Discussion: https://postgr.es/m/CAMyN-kB-nFTkF=VA_JPwFNo08S0d-Yk0F741S2B7LDmYAi8eyA@mail.gmail.com
If ExecGetInsertedCols(), ExecGetUpdatedCols() or ExecGetExtraUpdatedCols()
were called with a ResultRelInfo that's not in the range table and isn't a
partition routing target, the functions would dereference a NULL pointer,
relinfo->ri_RootResultRelInfo. Such ResultRelInfos are created when firing
RI triggers in tables that are not modified directly. None of the current
callers of these functions pass such relations, so this isn't a live bug,
but let's make them more robust.
Also update comment in ResultRelInfo; after commit 6214e2b228,
ri_RangeTableIndex is zero for ResultRelInfos created for partition tuple
routing.
Noted by Coverity. Backpatch down to v11, like commit 6214e2b228.
Reviewed-by: Tom Lane, Amit Langote
If a cross-partition UPDATE violates a constraint on the target partition,
and the columns in the new partition are in different physical order than
in the parent, the error message can reveal columns that the user does not
have SELECT permission on. A similar bug was fixed earlier in commit
804b6b6db4.
The cause of the bug is that the callers of the
ExecBuildSlotValueDescription() function got confused when constructing
the list of modified columns. If the tuple was routed from a parent, we
converted the tuple to the parent's format, but the list of modified
columns was grabbed directly from the child's RTE entry.
ExecUpdateLockMode() had a similar issue. That lead to confusion on which
columns are key columns, leading to wrong tuple lock being taken on tables
referenced by foreign keys, when a row is updated with INSERT ON CONFLICT
UPDATE. A new isolation test is added for that corner case.
With this patch, the ri_RangeTableIndex field is no longer set for
partitions that don't have an entry in the range table. Previously, it was
set to the RTE entry of the parent relation, but that was confusing.
NOTE: This modifies the ResultRelInfo struct, replacing the
ri_PartitionRoot field with ri_RootResultRelInfo. That's a bit risky to
backpatch, because it breaks any extensions accessing the field. The
change that ri_RangeTableIndex is not set for partitions could potentially
break extensions, too. The ResultRelInfos are visible to FDWs at least,
and this patch required small changes to postgres_fdw. Nevertheless, this
seem like the least bad option. I don't think these fields widely used in
extensions; I don't think there are FDWs out there that uses the FDW
"direct update" API, other than postgres_fdw. If there is, you will get a
compilation error, so hopefully it is caught quickly.
Backpatch to 11, where support for both cross-partition UPDATEs, and unique
indexes on partitioned tables, were added.
Reviewed-by: Amit Langote
Security: CVE-2021-3393
It turns out that the calculation of [Merge]AppendPath.partitioned_rels
in allpaths.c is faulty and sometimes omits relevant non-leaf partitions,
allowing an assertion added by commit a929e17e5a8 to trigger. Rather
than fix that, it seems better to get rid of those fields altogether.
We don't really need the info until create_plan time, and calculating
it once for the selected plan should be cheaper than calculating it
for each append path we consider.
The preceding two commits did away with all use of the partitioned_rels
values; this commit just mechanically removes the fields and the code
that calculated them.
Discussion: https://postgr.es/m/87sg8tqhsl.fsf@aurora.ydns.eu
Discussion: https://postgr.es/m/CAJKUy5gCXDSmFs2c=R+VGgn7FiYcLCsEFEuDNNLGfoha=pBE_g@mail.gmail.com
The SQL standard allows a GRANTED BY clause on GRANT and
REVOKE (privilege) statements that can specify CURRENT_USER or
CURRENT_ROLE. In PostgreSQL, both of these are the default behavior.
Since we already have all the parsing support for this for the
GRANT (role) statement, we might as well add basic support for this
for the privilege variant as well. This allows us to check off SQL
feature T332. In the future, perhaps more interesting things could be
done with this, too.
Reviewed-by: Simon Riggs <simon@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/f2feac44-b4c5-f38f-3699-2851d6a76dc9@2ndquadrant.com
Extends the FDW API to allow batching inserts into foreign tables. That
is usually much more efficient than inserting individual rows, due to
high latency for each round-trip to the foreign server.
It was possible to implement something similar in the regular FDW API,
but it was inconvenient and there were issues with reporting the number
of actually inserted rows etc. This extends the FDW API with two new
functions:
* GetForeignModifyBatchSize - allows the FDW picking optimal batch size
* ExecForeignBatchInsert - inserts a batch of rows at once
Currently, only INSERT queries support batching. Support for DELETE and
UPDATE may be added in the future.
This also implements batching for postgres_fdw. The batch size may be
specified using "batch_size" option both at the server and table level.
The initial patch version was written by me, but it was rewritten and
improved in many ways by Takayuki Tsunakawa.
Author: Takayuki Tsunakawa
Reviewed-by: Tomas Vondra, Amit Langote
Discussion: https://postgr.es/m/20200628151002.7x5laxwpgvkyiu3q@development
Invent new RawParseModes that allow the core grammar to handle
pl/pgsql expressions and assignments directly, and thereby get rid
of a lot of hackery in pl/pgsql's parser. This moves a good deal
of knowledge about pl/pgsql into the core code: notably, we have to
invent a CoercionContext that matches pl/pgsql's (rather dubious)
historical behavior for assignment coercions. That's getting away
from the original idea of pl/pgsql as an arm's-length extension of
the core, but really we crossed that bridge a long time ago.
The main advantage of doing this is that we can now use the core
parser to generate FieldStore and/or SubscriptingRef nodes to handle
assignments to pl/pgsql variables that are records or arrays. That
fixes a number of cases that had never been implemented in pl/pgsql
assignment, such as nested records and array slicing, and it allows
pl/pgsql assignment to support the datatype-specific subscripting
behaviors introduced in commit c7aba7c14.
There are cosmetic benefits too: when a syntax error occurs in a
pl/pgsql expression, the error report no longer includes the confusing
"SELECT" keyword that used to get prefixed to the expression text.
Also, there seem to be some small speed gains.
Discussion: https://postgr.es/m/4165684.1607707277@sss.pgh.pa.us
This patch generalizes the subscripting infrastructure so that any
data type can be subscripted, if it provides a handler function to
define what that means. Traditional variable-length (varlena) arrays
all use array_subscript_handler(), while the existing fixed-length
types that support subscripting use raw_array_subscript_handler().
It's expected that other types that want to use subscripting notation
will define their own handlers. (This patch provides no such new
features, though; it only lays the foundation for them.)
To do this, move the parser's semantic processing of subscripts
(including coercion to whatever data type is required) into a
method callback supplied by the handler. On the execution side,
replace the ExecEvalSubscriptingRef* layer of functions with direct
calls to callback-supplied execution routines. (Thus, essentially
no new run-time overhead should be caused by this patch. Indeed,
there is room to remove some overhead by supplying specialized
execution routines. This patch does a little bit in that line,
but more could be done.)
Additional work is required here and there to remove formerly
hard-wired assumptions about the result type, collation, etc
of a SubscriptingRef expression node; and to remove assumptions
that the subscript values must be integers.
One useful side-effect of this is that we now have a less squishy
mechanism for identifying whether a data type is a "true" array:
instead of wiring in weird rules about typlen, we can look to see
if pg_type.typsubscript == F_ARRAY_SUBSCRIPT_HANDLER. For this
to be bulletproof, we have to forbid user-defined types from using
that handler directly; but there seems no good reason for them to
do so.
This patch also removes assumptions that the number of subscripts
is limited to MAXDIM (6), or indeed has any hard-wired limit.
That limit still applies to types handled by array_subscript_handler
or raw_array_subscript_handler, but to discourage other dependencies
on this constant, I've moved it from c.h to utils/array.h.
Dmitry Dolgov, reviewed at various times by Tom Lane, Arthur Zakirov,
Peter Eisentraut, Pavel Stehule
Discussion: https://postgr.es/m/CA+q6zcVDuGBv=M0FqBYX8DPebS3F_0KQ6OVFobGJPM507_SZ_w@mail.gmail.com
Discussion: https://postgr.es/m/CA+q6zcVovR+XY4mfk-7oNk-rF91gH0PebnNfuUjuuDsyHjOcVA@mail.gmail.com
This GUC was always intended as a temporary solution to help with
finding 9.4-to-9.5 migration issues. Now that all pre-9.5 branches
are out of support, and 9.5 will be too before v14 is released,
it seems like it's okay to drop it. Doing so allows removal of
several hundred lines of poorly-tested code in parse_expr.c,
which have been a fertile source of bugs when people did use this.
Discussion: https://postgr.es/m/2234320.1607117945@sss.pgh.pa.us
This changes CLUSTER and REINDEX so as a parenthesized grammar becomes
possible for options, while unifying the grammar parsing rules for
option lists with the existing ones.
This is a follow-up of the work done in 873ea9e for VACUUM, ANALYZE and
EXPLAIN. This benefits REINDEX for a potential backend-side filtering
for collatable-sensitive indexes and TABLESPACE, while CLUSTER would
benefit from the latter.
Author: Alexey Kondratov, Justin Pryzby
Discussion: https://postgr.es/m/8a8f5f73-00d3-55f8-7583-1375ca8f6a91@postgrespro.ru
As it stood, expandTableLikeClause() re-did the same relation_openrv
call that transformTableLikeClause() had done. However there are
scenarios where this would not find the same table as expected.
We hold lock on the LIKE source table, so it can't be renamed or
dropped, but another table could appear before it in the search path.
This explains the odd behavior reported in bug #16758 when cloning a
table as a temp table of the same name. This case worked as expected
before commit 502898192 introduced the need to open the source table
twice, so we should fix it.
To make really sure we get the same table, let's re-open it by OID not
name. That requires adding an OID field to struct TableLikeClause,
which is a little nervous-making from an ABI standpoint, but as long
as it's at the end I don't think there's any serious risk.
Per bug #16758 from Marc Boeren. Like the previous patch,
back-patch to all supported branches.
Discussion: https://postgr.es/m/16758-840e84a6cfab276d@postgresql.org
For debugging purposes, Path nodes are supposed to have outfuncs
support, but this was overlooked in the original incremental sort patch.
While at it, clean up a couple other minor oversights, as well as
bizarre choice of return type for create_incremental_sort_path().
(All the existing callers just cast it to "Path *" immediately, so
they don't care, but some future caller might care.)
outfuncs.c fix by Zhijie Hou, the rest by me
Discussion: https://postgr.es/m/324c4d81d8134117972a5b1f6cdf9560@G08CNEXMBPEKD05.g08.fujitsu.local
Previously this code assumed that all IndexScan nodes supported
mark/restore, which is not true since it depends on optional index AM
support functions. This could lead to errors about missing support
functions in rare edge cases of mergejoins with no sort keys, where an
unordered non-btree index scan was placed on the inner path without a
protecting Materialize node. (Normally, the fact that merge join
requires ordered input would avoid this error.)
Backpatch all the way since this bug is ancient.
Per report from Eugen Konkov on irc.
Discussion: https://postgr.es/m/87o8jn50be.fsf@news-spur.riddles.org.uk
This feature was added a long time ago, in 7c1e67bd5 and eb121ba2c,
but never documented in any user-facing way. (Documentation added
in 6126d3e70 was commented out almost immediately, in 8272fc3f7.)
That's because, while this syntax is defined by SQL:99, our
implementation is only vaguely related to the standard's semantics.
The standard appears to intend a run-time not parse-time test, and
it definitely intends that the test should understand subtype
relationships.
No one has stepped up to fix that in the intervening years, but
people keep coming across the code and asking why it's not documented.
Let's just get rid of it: if anyone ever wants to make it work per
spec, they can easily recover whatever parts of this code are still
of value from our git history.
If there's anyone out there who's actually using this despite its
undocumented status, they can switch to using pg_typeof() instead,
eg. "pg_typeof(something) = 'mytype'::regtype". That gives
essentially the same semantics as what our IS OF code did.
(We didn't have that function last time this was discussed, or
we would have ripped out IS OF then.)
Discussion: https://postgr.es/m/CAKFQuwZ2pTc-DSkOiTfjauqLYkNREeNZvWmeg12Q-_69D+sYZA@mail.gmail.com
Discussion: https://postgr.es/m/BAY20-F23E9F2B4DAB3E4E88D3623F99B0@phx.gbl
Discussion: https://postgr.es/m/3E7CF81D.1000203@joeconway.com
This is mostly straightforward. However, we disallow replacing
constraint triggers or changing the is-constraint property; perhaps
that can be added later, but the complexity versus benefit tradeoff
doesn't look very good.
Also, no special thought is taken here for whether replacing an
existing trigger should result in changes to queued-but-not-fired
trigger actions. We just document that if you're surprised by the
results, too bad, don't do that. (Note that any such pending trigger
activity would have to be within the current session.)
Takamichi Osumi, reviewed at various times by Surafel Temesgen,
Peter Smith, and myself
Discussion: https://postgr.es/m/0DDF369B45A1B44B8A687ED43F06557C010BC362@G01JPEXMBYT03
The SQL spec calls out nonstandard syntax for certain function calls,
for example substring() with numeric position info is supposed to be
spelled "SUBSTRING(string FROM start FOR count)". We accept many
of these things, but up to now would not print them in the same format,
instead simplifying down to "substring"(string, start, count).
That's long annoyed me because it creates an interoperability
problem: we're gratuitously injecting Postgres-specific syntax into
what might otherwise be a perfectly spec-compliant view definition.
However, the real reason for addressing it right now is to support
a planned change in the semantics of EXTRACT() a/k/a date_part().
When we switch that to returning numeric, we'll have the parser
translate EXTRACT() to some new function name (might as well be
"extract" if you ask me) and then teach ruleutils.c to reverse-list
that per SQL spec. In this way existing calls to date_part() will
continue to have the old semantics.
To implement this, invent a new CoercionForm value COERCE_SQL_SYNTAX,
and make the parser insert that rather than COERCE_EXPLICIT_CALL when
the input has SQL-spec decoration. (But if the input has the form of
a plain function call, continue to mark it COERCE_EXPLICIT_CALL, even
if it's calling one of these functions.) Then ruleutils.c recognizes
COERCE_SQL_SYNTAX as a cue to emit SQL call syntax. It can know
which decoration to emit using hard-wired knowledge about the
functions that could be called this way. (While this solution isn't
extensible without manual additions, neither is the grammar, so this
doesn't seem unmaintainable.) Notice that this solution will
reverse-list a function call with SQL decoration only if it was
entered that way; so dump-and-reload will not by itself produce any
changes in the appearance of views.
This requires adding a CoercionForm field to struct FuncCall.
(I couldn't resist the temptation to rearrange that struct's
field order a tad while I was at it.) FuncCall doesn't appear
in stored rules, so that change isn't a reason for a catversion
bump, but I did one anyway because the new enum value for
CoercionForm fields could confuse old backend code.
Possible future work:
* Perhaps CoercionForm should now be renamed to DisplayForm,
or something like that, to reflect its more general meaning.
This'd require touching a couple hundred places, so it's not
clear it's worth the code churn.
* The SQLValueFunction node type, which was invented partly for
the same goal of improving SQL-compatibility of view output,
could perhaps be replaced with regular function calls marked
with COERCE_SQL_SYNTAX. It's unclear if this would be a net
code savings, however.
Discussion: https://postgr.es/m/42b73d2d-da12-ba9f-570a-420e0cce19d9@phystech.edu
Previously we only tagged on the required information to allow the
executor to perform run-time partition pruning for Append/MergeAppend
nodes belonging to base relations. It was thought that nested
Append/MergeAppend nodes were just about always pulled up into the
top-level Append/MergeAppend and that making the run-time pruning info for
any sub Append/MergeAppend nodes was a waste of time. However, that was
likely badly thought through.
Some examples of cases we're unable to pullup nested Append/MergeAppends
are: 1) Parallel Append nodes with a mix of parallel and non-parallel
paths into a Parallel Append. 2) When planning an ordered Append scan a
sub-partition which is unordered may require a nested MergeAppend path to
ensure sub-partitions don't mix up the order of tuples being fed into the
top-level Append.
Unfortunately, it was not just as simple as removing the lines in
createplan.c which were purposefully not building the run-time pruning
info for anything but RELOPT_BASEREL relations. The code in
add_paths_to_append_rel() was far too sloppy about which partitioned_rels
it included for the Append/MergeAppend paths. The original code there
would always assume accumulate_append_subpath() would pull each sub-Append
and sub-MergeAppend path into the top-level path. While it does not
appear that there were any actual bugs caused by having the additional
partitioned table RT indexes recorded, what it did mean is that later in
planning, when we built the run-time pruning info that we wasted effort
and built PartitionedRelPruneInfos for partitioned tables that we had no
subpaths for the executor to run-time prune.
Here we tighten that up so that partitioned_rels only ever contains the RT
index for partitioned tables which actually have subpaths in the given
Append/MergeAppend. We can now Assert that every PartitionedRelPruneInfo
has a non-empty present_parts. That should allow us to catch any weird
corner cases that have been missed.
In passing, it seems there is no longer a good reason to have the
AppendPath and MergeAppendPath's partitioned_rel fields a List of IntList.
We can simply have a List of Relids instead. This is more compact in
memory and faster to add new members to. We still know which is the root
level partition as these always have a lower relid than their children.
Previously this field was used for more things, but run-time partition
pruning now remains the only user of it and it has no need for a List of
IntLists.
Here we also get rid of the RelOptInfo partitioned_child_rels field. This
is what was previously used to (sometimes incorrectly) set the
Append/MergeAppend path's partitioned_rels field. That was the only usage
of that field, so we can happily just remove it.
I also couldn't resist changing some nearby code to make use of the newly
added for_each_from macro so we can skip the first element in the list
without checking if the current item was the first one on each
iteration.
A bug report from Andreas Kretschmer prompted all this work, however,
after some consideration, I'm not personally classing this as a bug fix.
So no backpatch. In Andreas' test case, it just wasn't that clear that
there was a nested Append since the top-level Append just had a single
sub-path which was pulled up a level, per 8edd0e794.
Author: David Rowley
Reviewed-by: Amit Langote
Discussion: https://postgr.es/m/flat/CAApHDvqSchs%2BubdybcfFaSPB%2B%2BEA7kqMaoqajtP0GtZvzOOR3g%40mail.gmail.com
It's unsafe to do this at parse time because addition of generated
columns to a table would not invalidate stored rules containing
UPDATEs on the table ... but there might now be dependent generated
columns that were not there when the rule was made. This also fixes
an oversight that rewriteTargetView failed to update extraUpdatedCols
when transforming an UPDATE on an updatable view. (Since the new
calculation is downstream of that, rewriteTargetView doesn't actually
need to do anything; but before, there was a demonstrable bug there.)
In v13 and HEAD, this leads to easily-visible bugs because (since
commit c6679e4fc) we won't recalculate generated columns that aren't
listed in extraUpdatedCols. In v12 this bitmap is mostly just used
for trigger-firing decisions, so you'd only notice a problem if a
trigger cared whether a generated column had been updated.
I'd complained about this back in May, but then forgot about it
until bug #16671 from Michael Paul Killian revived the issue.
Back-patch to v12 where this field was introduced. If existing
stored rules contain any extraUpdatedCols values, they'll be
ignored because the rewriter will overwrite them, so the bug will
be fixed even for existing rules. (But note that if someone were
to update to 13.1 or 12.5, store some rules with UPDATEs on tables
having generated columns, and then downgrade to a prior minor version,
they might observe issues similar to what this patch fixes. That
seems unlikely enough to not be worth going to a lot of effort to fix.)
Discussion: https://postgr.es/m/10206.1588964727@sss.pgh.pa.us
Discussion: https://postgr.es/m/16671-2fa55851859fb166@postgresql.org