The nested-arrays code path in ExecEvalArrayExpr() used palloc to
allocate the result array, whereas every other array-creating function
has used palloc0 since 18c0b4ecc. This mostly works, but unused bits
past the end of the nulls bitmap may end up undefined. That causes
valgrind complaints with -DWRITE_READ_PARSE_PLAN_TREES, and could
cause planner misbehavior as cited in 18c0b4ecc. There seems no very
good reason why we should strive to avoid palloc0 in just this one case,
so fix it the easy way with s/palloc/palloc0/.
While looking at that I noted that we also failed to check for overflow
of "nbytes" and "nitems" while summing the sizes of the sub-arrays,
potentially allowing a crash due to undersized output allocation.
For "nbytes", follow the policy used by other array-munging code of
checking for overflow after each addition. (As elsewhere, the last
addition of the array's overhead space doesn't need an extra check,
since palloc itself will catch a value between 1Gb and 2Gb.)
For "nitems", there's no very good reason to sum the inputs at all,
since we can perfectly well use ArrayGetNItems' result instead of
ignoring it.
Per discussion of this bug, also remove redundant zeroing of the
nulls bitmap in array_set_element and array_set_slice.
Patch by Alexander Lakhin and myself, per bug #17858 from Alexander
Lakhin; thanks also to Richard Guo. These bugs are a dozen years old,
so back-patch to all supported branches.
Discussion: https://postgr.es/m/17858-8fd287fd3663d051@postgresql.org
We fail to apply updates and deletes when the REPLICA IDENTITY FULL is
used for the table having dropped columns. We didn't use to ignore dropped
columns while doing tuple comparison among the tuples from the publisher
and subscriber during apply of updates and deletes.
Author: Onder Kalaci, Shi yu
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CACawEhVQC9WoofunvXg12aXtbqKnEgWxoRx3+v8q32AWYsdpGg@mail.gmail.com
With unlucky timing and parallel_leader_participation=off (not the
default), PHJ could attempt to access per-batch shared state just as it
was being freed. There was code intended to prevent that by checking
for a cleared pointer, but it was racy. Fix, by introducing an extra
barrier phase. The new phase PHJ_BUILD_RUNNING means that it's safe to
access the per-batch state to find a batch to help with, and
PHJ_BUILD_DONE means that it is too late. The last to detach will free
the array of per-batch state as before, but now it will also atomically
advance the phase, so that late attachers can avoid the hazard. This
mirrors the way per-batch hash tables are freed (see phases
PHJ_BATCH_PROBING and PHJ_BATCH_DONE).
An earlier attempt to fix this (commit 3b8981b6, later reverted) missed
one special case. When the inner side is empty (the "empty inner
optimization), the build barrier would only make it to
PHJ_BUILD_HASHING_INNER phase before workers attempted to detach from
the hashtable. In that case, fast-forward the build barrier to
PHJ_BUILD_RUNNING before proceeding, so that our later assertions hold
and we can still negotiate who is cleaning up.
Revealed by build farm failures, where BarrierAttach() failed a sanity
check assertion, because the memory had been clobbered by dsa_free().
In non-assert builds, the result could be a segmentation fault.
Back-patch to all supported releases.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Michael Paquier <michael@paquier.xyz>
Reported-by: David Geier <geidav.pg@gmail.com>
Tested-by: David Geier <geidav.pg@gmail.com>
Discussion: https://postgr.es/m/20200929061142.GA29096%40paquier.xyz
We already tried to fix this in commits 3f7323cbb et al (and follow-on
fixes), but now it emerges that there are still unfixed cases;
moreover, these cases affect all branches not only pre-v14. I thought
we had eliminated all cases of making multiple clones of an UPDATE's
target list when we nuked inheritance_planner. But it turns out we
still do that in some partitioned-UPDATE cases, notably including
INSERT ... ON CONFLICT UPDATE, because ExecInitPartitionInfo thinks
it's okay to clone and modify the parent's targetlist.
This fix is based on a suggestion from Andres Freund: let's stop
abusing the ParamExecData.execPlan mechanism, which was only ever
meant to handle initplans, and instead solve the execution timing
problem by having the expression compiler move MULTIEXPR_SUBLINK steps
to the front of their expression step lists. This is feasible because
(a) all branches still in support compile the entire targetlist of
an UPDATE into a single ExprState, and (b) we know that all
MULTIEXPR_SUBLINKs do need to be evaluated --- none could be buried
inside a CASE, for example. There is a minor semantics change
concerning the order of execution of the MULTIEXPR's subquery versus
other parts of the parent targetlist, but that seems like something
we can get away with. By doing that, we no longer need to worry
about whether different clones of a MULTIEXPR_SUBLINK share output
Params; their usage of that data structure won't overlap.
Per bug #17800 from Alexander Lakhin. Back-patch to all supported
branches. In v13 and earlier, we can revert 3f7323cbb and follow-on
fixes; however, I chose to keep the SubPlan.subLinkId field added
in ccbb54c72. We don't need that anymore in the core code, but it's
cheap enough to fill, and removing a plan node field in a minor
release seems like it'd be asking for trouble.
Andres Freund and Tom Lane
Discussion: https://postgr.es/m/17800-ff90866b3906c964@postgresql.org
When an aggregate function is used as a WindowFunc and a tuple transitions
out of the window frame, we ordinarily try to make use of the aggregate
function's inverse transition function to "unaggregate" the exiting tuple.
This optimization is disabled for various cases, including when the
aggregate contains a volatile function. In such a case we'd be unable to
ensure that the transition value was calculated to the same value during
transitions and inverse transitions. Unfortunately, we did this check by
calling contain_volatile_functions() which does not recursively search
SubPlans for volatile functions. If the aggregate function's arguments or
its FILTER clause contained a subplan with volatile functions then we'd
fail to notice this.
Here we fix this by just disabling the optimization when the WindowFunc
contains any subplans. Volatile functions are not the only reason that a
subplan may have nonrepeatable results.
Bug: #17777
Reported-by: Anban Company
Discussion: https://postgr.es/m/17777-860b739b6efde977%40postgresql.org
Reviewed-by: Tom Lane
Backpatch-through: 11
When considering an empty grouping set, we fetched
phasedata->eqfunctions[-1]. Because the eqfunctions array is
palloc'd, that would always be an aset pointer in released versions,
and thus the code accidentally failed to malfunction (since it would
do nothing unless it found a null pointer). Nonetheless this seems
like trouble waiting to happen, so add a check for length == 0.
It's depressing that our valgrind testing did not catch this.
Maybe we should reconsider the choice to not mark that word NOACCESS?
Richard Guo
Discussion: https://postgr.es/m/CAMbWs4-vZuuPOZsKOYnSAaPYGKhmacxhki+vpOKk0O7rymccXQ@mail.gmail.com
The API contract for planstate_tree_walker() callbacks is that they
take a PlanState pointer and a context pointer. Somebody figured
they could save a couple lines of code by ignoring that, and passing
ExecShutdownNode itself as the walker even though it has but one
argument. Somewhat remarkably, we've gotten away with that so far.
However, it seems clear that the upcoming C2x standard means to
forbid such cases, and compilers that actively break such code
likely won't be far behind. So spend the extra few lines of code
to do it honestly with a separate walker function.
In HEAD, we might as well go further and remove ExecShutdownNode's
useless return value. I left that as-is in back branches though,
to forestall complaints about ABI breakage.
Back-patch, with the thought that this might become of practical
importance before our stable branches are all out of service.
It doesn't seem to be fixing any live bug on any currently known
platform, however.
Discussion: https://postgr.es/m/208054.1663534665@sss.pgh.pa.us
Prior to v14, if we have a MULTIEXPR SubPlan (that is, use of the syntax
UPDATE ... SET (c1, ...) = (SELECT ...)) in an UPDATE with an inherited
or partitioned target table, inheritance_planner() will clone the
targetlist and therefore also the MULTIEXPR SubPlan and the Param nodes
referencing it for each child target table. Up to now, we've allowed
all the clones to share the underlying subplan as well as the output
parameter IDs -- that is, the runtime ParamExecData slots. That
technique is borrowed from the far older code that supports initplans,
and it works okay in that case because the cloned SubPlan nodes are
essentially identical. So it doesn't matter which one of the clones
the shared ParamExecData.execPlan field might point to.
However, this fails to hold for MULTIEXPR SubPlans, because they can
have nonempty "args" lists (values to be passed into the subplan), and
those lists could get mutated to different states in the various clones.
In the submitted reproducer, as well as the test case added here, one
clone contains Vars with varno OUTER_VAR where another has INNER_VAR,
because the child tables are respectively on the outer or inner side of
the join. Sharing the execPlan pointer can result in trying to evaluate
an args list that doesn't match the local execution state, with mayhem
ensuing. The result often is to trigger consistency checks in the
executor, but I believe this could end in a crash or incorrect updates.
To fix, assign new Param IDs to each of the cloned SubPlans, so that
they don't share ParamExecData slots at runtime. It still seems fine
for the clones to share the underlying subplan, and extra ParamExecData
slots are cheap enough that this fix shouldn't cost much.
This has been busted since we invented MULTIEXPR SubPlans in 9.5.
Probably the lack of previous reports is because query plans in which
the different clones of a MULTIEXPR mutate to effectively-different
states are pretty rare. There's no issue in v14 and later, because
without inheritance_planner() there's never a reason to clone
MULTIEXPR SubPlans.
Per bug #17596 from Andre Lin. Patch v10-v13 only.
Discussion: https://postgr.es/m/17596-c5357f61427a81dc@postgresql.org
fmgr_sql must make expanded-datum arguments read-only, because
it's possible that the function body will pass the argument to
more than one callee function. If one of those functions takes
the datum's R/W property as license to scribble on it, then later
callees will see an unexpected value, leading to wrong answers.
From a performance standpoint, it'd be nice to skip this in the
common case that the argument value is passed to only one callee.
However, detecting that seems fairly hard, and certainly not
something that I care to attempt in a back-patched bug fix.
Per report from Adam Mackler. This has been broken since we
invented expanded datums, so back-patch to all supported branches.
Discussion: https://postgr.es/m/WScDU5qfoZ7PB2gXwNqwGGgDPmWzz08VdydcPFLhOwUKZcdWbblbo-0Lku-qhuEiZoXJ82jpiQU4hOjOcrevYEDeoAvz6nR0IU4IHhXnaCA=@mackler.email
Discussion: https://postgr.es/m/187436.1660143060@sss.pgh.pa.us
Ordinarily the functions called in this loop ought to have plenty
of CFIs themselves; but we've now seen a case where no such CFI is
reached, making the loop uninterruptible. Even though that's from
a recently-introduced bug, it seems prudent to install a CFI at
the loop level in all branches.
Per discussion of bug #17558 from Andrew Kesper (an actual fix for
that bug will follow).
Discussion: https://postgr.es/m/17558-3f6599ffcf52fd4a@postgresql.org
SPI_commit previously left it up to the caller to recover from any error
occurring during commit. Since that's complicated and requires use of
low-level xact.c facilities, it's not too surprising that no caller got
it right. Let's move the responsibility for cleanup into spi.c. Doing
that requires redefining SPI_commit as starting a new transaction, so
that it becomes equivalent to SPI_commit_and_chain except that you get
default transaction characteristics instead of preserving the prior
transaction's characteristics. We can make this pretty transparent
API-wise by redefining SPI_start_transaction() as a no-op. Callers
that expect to do something in between might be surprised, but
available evidence is that no callers do so.
Having made that API redefinition, we can fix this mess by having
SPI_commit[_and_chain] trap errors and start a new, clean transaction
before re-throwing the error. Likewise for SPI_rollback[_and_chain].
Some cleanup is also needed in AtEOXact_SPI, which was nowhere near
smart enough to deal with SPI contexts nested inside a committing
context.
While plperl and pltcl need no changes beyond removing their now-useless
SPI_start_transaction() calls, plpython needs some more work because it
hadn't gotten the memo about catching commit/rollback errors in the
first place. Such an error resulted in longjmp'ing out of the Python
interpreter, which leaks Python stack entries at present and is reported
to crash Python 3.11 altogether. Add the missing logic to catch such
errors and convert them into Python exceptions.
This is a back-patch of commit 2e517818f. That's now aged long enough
to reduce the concerns about whether it will break something, and we
do need to ensure that supported branches will work with Python 3.11.
Peter Eisentraut and Tom Lane
Discussion: https://postgr.es/m/3375ffd8-d71c-2565-e348-a597d6e739e3@enterprisedb.com
Discussion: https://postgr.es/m/17416-ed8fe5d7213d6c25@postgresql.org
In commit ec62cb0aa, I foolishly replaced ExecEvalWholeRowVar's
lookup_rowtype_tupdesc_domain call with just lookup_rowtype_tupdesc,
because I didn't see how a domain could be involved there, and
there were no regression test cases to jog my memory. But the
existing code was correct, so revert that change and add a test
case showing why it's necessary. (Note: per comment in struct
DatumTupleFields, it is correct to produce an output tuple that's
labeled with the base composite type, not the domain; hence just
blindly looking through the domain is correct here.)
Per bug #17515 from Dan Kubb. Back-patch to v11 where domains over
composites became a thing.
Discussion: https://postgr.es/m/17515-a24737438363aca0@postgresql.org
In commit bf7ca1587, I had the bright idea that we could make the
result of a whole-row Var (that is, foo.*) track any column aliases
that had been applied to the FROM entry the Var refers to. However,
that's not terribly logically consistent, because now the output of
the Var is no longer of the named composite type that the Var claims
to emit. bf7ca1587 tried to handle that by changing the output
tuple values to be labeled with a blessed RECORD type, but that's
really pretty disastrous: we can wind up storing such tuples onto
disk, whereupon they're not readable by other sessions.
The only practical fix I can see is to give up on what bf7ca1587
tried to do, and say that the column names of tuples produced by
a whole-row Var are always those of the underlying named composite
type, query aliases or no. While this introduces some inconsistencies,
it removes others, so it's not that awful in the abstract. What *is*
kind of awful is to make such a behavioral change in a back-patched
bug fix. But corrupt data is worse, so back-patched it will be.
(A workaround available to anyone who's unhappy about this is to
introduce an extra level of sub-SELECT, so that the whole-row Var is
referring to the sub-SELECT's output and not to a named table type.
Then the Var is of type RECORD to begin with and there's no issue.)
Per report from Miles Delahunty. The faulty commit dates to 9.5,
so back-patch to all supported branches.
Discussion: https://postgr.es/m/2950001.1638729947@sss.pgh.pa.us
There are two Asserts in nodeMergejoin.c that are reachable if
the input data is not in the expected order. This seems way too
fragile. Alexander Lakhin reported a case where the assertions
could be triggered with misconfigured foreign-table partitions,
and bitter experience with unstable operating system collation
definitions suggests another easy route to hitting them. Neither
Assert is in a place where we can't afford one more test-and-branch,
so replace 'em with plain test-and-elog logic.
Per bug #17395. While the reported symptom is relatively recent,
collation changes could happen anytime, so back-patch to all
supported branches.
Discussion: https://postgr.es/m/17395-8c326292078d1a57@postgresql.org
Commit 4ace45677 failed to fix the problem fully, because the
same issue of attempting to fetch a non-returnable index column
can occur when rechecking the indexqual after using a lossy index
operator. Moreover, it broke EXPLAIN for such indexquals (which
indicates a gap in our test cases :-().
Revert the code changes of 4ace45677 in favor of adding a new field
to struct IndexOnlyScan, containing a version of the indexqual that
can be executed against the index-returned tuple without using any
non-returnable columns. (The restrictions imposed by check_index_only
guarantee this is possible, although we may have to recompute indexed
expressions.) Support construction of that during setrefs.c
processing by marking IndexOnlyScan.indextlist entries as resjunk
if they can't be returned, rather than removing them entirely.
(We could alternatively require setrefs.c to look up the IndexOptInfo
again, but abusing resjunk this way seems like a reasonably safe way
to avoid needing to do that.)
This solution isn't great from an API-stability standpoint: if there
are any extensions out there that build IndexOnlyScan structs directly,
they'll be broken in the next minor releases. However, only a very
invasive extension would be likely to do such a thing. There's no
change in the Path representation, so typical planner extensions
shouldn't have a problem.
As before, back-patch to all supported branches.
Discussion: https://postgr.es/m/3179992.1641150853@sss.pgh.pa.us
Discussion: https://postgr.es/m/17350-b5bdcf476e5badbb@postgresql.org
An update such as "UPDATE ... SET fld[n].subfld = whatever"
failed if the array elements were domains rather than plain
composites. That's because isAssignmentIndirectionExpr()
failed to cope with the CoerceToDomain node that would appear
in the expression tree in this case. The result would typically
be a crash, and even if we accidentally didn't crash, we'd not
correctly preserve other fields of the same array element.
Per report from Onder Kalaci. Back-patch to v11 where arrays of
domains came in.
Discussion: https://postgr.es/m/PH0PR21MB132823A46AA36F0685B7A29AD8BD9@PH0PR21MB1328.namprd21.prod.outlook.com
Most data-corruption reports mention the location of the problem, but
this one failed to. Add it.
Backpatch all the way back. In 12 and older, also assign the
ERRCODE_DATA_CORRUPTED error code as was done in commit fd6ec93bf890 for
13 and later.
Discussion: https://postgr.es/m/202108191637.oqyzrdtnheir@alvherre.pgsql
COMMIT/ROLLBACK necessarily destroys all snapshots within the session.
The original implementation of intra-procedure transactions just
cavalierly did that, ignoring the fact that this left us executing in
a rather different environment than normal. In particular, it turns
out that handling of toasted datums depends rather critically on there
being an outer ActiveSnapshot: otherwise, when SPI or the core
executor pop whatever snapshot they used and return, it's unsafe to
dereference any toasted datums that may appear in the query result.
It's possible to demonstrate "no known snapshots" and "missing chunk
number N for toast value" errors as a result of this oversight.
Historically this outer snapshot has been held by the Portal code,
and that seems like a good plan to preserve. So add infrastructure
to pquery.c to allow re-establishing the Portal-owned snapshot if it's
not there anymore, and add enough bookkeeping support that we can tell
whether it is or not.
We can't, however, just re-establish the Portal snapshot as part of
COMMIT/ROLLBACK. As in normal transaction start, acquiring the first
snapshot should wait until after SET and LOCK commands. Hence, teach
spi.c about doing this at the right time. (Note that this patch
doesn't fix the problem for any PLs that try to run intra-procedure
transactions without using SPI to execute SQL commands.)
This makes SPI's no_snapshots parameter rather a misnomer, so in HEAD,
rename that to allow_nonatomic.
replication/logical/worker.c also needs some fixes, because it wasn't
careful to hold a snapshot open around AFTER trigger execution.
That code doesn't use a Portal, which I suspect someday we're gonna
have to fix. But for now, just rearrange the order of operations.
This includes back-patching the recent addition of finish_estate()
to centralize the cleanup logic there.
This also back-patches commit 2ecfeda3e into v13, to improve the
test coverage for worker.c (it was that test that exposed that
worker.c's snapshot management is wrong).
Per bug #15990 from Andreas Wicht. Back-patch to v11 where
intra-procedure COMMIT was added.
Discussion: https://postgr.es/m/15990-eee2ac466b11293d@postgresql.org
It's unusual to have any resjunk columns in an ON CONFLICT ... UPDATE
list, but it can happen when MULTIEXPR_SUBLINK SubPlans are present.
If it happens, the ON CONFLICT UPDATE code path would end up storing
tuples that include the values of the extra resjunk columns. That's
fairly harmless in the short run, but if new columns are added to
the table then the values would become accessible, possibly leading
to malfunctions if they don't match the datatypes of the new columns.
This had escaped notice through a confluence of missing sanity checks,
including
* There's no cross-check that a tuple presented to heap_insert or
heap_update matches the table rowtype. While it's difficult to
check that fully at reasonable cost, we can easily add assertions
that there aren't too many columns.
* The output-column-assignment cases in execExprInterp.c lacked
any sanity checks on the output column numbers, which seems like
an oversight considering there are plenty of assertion checks on
input column numbers. Add assertions there too.
* We failed to apply nodeModifyTable's ExecCheckPlanOutput() to
the ON CONFLICT UPDATE tlist. That wouldn't have caught this
specific error, since that function is chartered to ignore resjunk
columns; but it sure seems like a bad omission now that we've seen
this bug.
In HEAD, the right way to fix this is to make the processing of
ON CONFLICT UPDATE tlists work the same as regular UPDATE tlists
now do, that is don't add "SET x = x" entries, and use
ExecBuildUpdateProjection to evaluate the tlist and combine it with
old values of the not-set columns. This adds a little complication
to ExecBuildUpdateProjection, but allows removal of a comparable
amount of now-dead code from the planner.
In the back branches, the most expedient solution seems to be to
(a) use an output slot for the ON CONFLICT UPDATE projection that
actually matches the target table, and then (b) invent a variant of
ExecBuildProjectionInfo that can be told to not store values resulting
from resjunk columns, so it doesn't try to store into nonexistent
columns of the output slot. (We can't simply ignore the resjunk columns
altogether; they have to be evaluated for MULTIEXPR_SUBLINK to work.)
This works back to v10. In 9.6, projections work much differently and
we can't cheaply give them such an option. The 9.6 version of this
patch works by inserting a JunkFilter when it's necessary to get rid
of resjunk columns.
In addition, v11 and up have the reverse problem when trying to
perform ON CONFLICT UPDATE on a partitioned table. Through a
further oversight, adjust_partition_tlist() discarded resjunk columns
when re-ordering the ON CONFLICT UPDATE tlist to match a partition.
This accidentally prevented the storing-bogus-tuples problem, but
at the cost that MULTIEXPR_SUBLINK cases didn't work, typically
crashing if more than one row has to be updated. Fix by preserving
resjunk columns in that routine. (I failed to resist the temptation
to add more assertions there too, and to do some minor code
beautification.)
Per report from Andres Freund. Back-patch to all supported branches.
Security: CVE-2021-32028
While we were (mostly) careful about ensuring that the dimensions of
arrays aren't large enough to cause integer overflow, the lower bound
values were generally not checked. This allows situations where
lower_bound + dimension overflows an integer. It seems that that's
harmless so far as array reading is concerned, except that array
elements with subscripts notionally exceeding INT_MAX are inaccessible.
However, it confuses various array-assignment logic, resulting in a
potential for memory stomps.
Fix by adding checks that array lower bounds aren't large enough to
cause lower_bound + dimension to overflow. (Note: this results in
disallowing cases where the last subscript position would be exactly
INT_MAX. In principle we could probably allow that, but there's a lot
of code that computes lower_bound + dimension and would need adjustment.
It seems doubtful that it's worth the trouble/risk to allow it.)
Somewhat independently of that, array_set_element() was careless
about possible overflow when checking the subscript of a fixed-length
array, creating a different route to memory stomps. Fix that too.
Security: CVE-2021-32027
If the source and destination partitions don't have identical
rowtypes (for example, one has dropped columns the other lacks),
then the planSlot contents will be different because of that.
If the query has a RETURNING list that tries to return resjunk
columns out of the planSlot, that is columns from tables that
were joined to the target table, we'd get errors or wrong answers.
That's because we used the RETURNING list generated for the
destination partition, which expects a planSlot matching that
partition's subplan.
The most practical fix seems to be to convert the updated destination
tuple back to the source partition's rowtype, and then apply the
RETURNING list generated for the source partition. This avoids making
fragile assumptions about whether the per-subpartition subplans
generated all the resjunk columns in the same order.
This has been broken since v11 introduced cross-partition UPDATE.
The lack of field complaints shows that non-identical partitions
aren't a common case; therefore, don't stress too hard about
making the conversion efficient.
There's no such bug in HEAD, because commit 86dc90056 got rid of
per-target-relation variance in the contents of the planSlot.
Hence, patch v11-v13 only.
Amit Langote and Etsuro Fujita, small changes by me
Discussion: https://postgr.es/m/CA+HiwqE_UK1jTSNrjb8mpTdivzd3dum6mK--xqKq0Y9VmfwWQA@mail.gmail.com
Previously, get_cached_rowtype() cached a pointer to a reference-counted
tuple descriptor from the typcache, relying on the ExprContextCallback
mechanism to release the tupdesc refcount when the expression tree
using the tupdesc was destroyed. This worked fine when it was designed,
but the introduction of within-DO-block COMMITs broke it. The refcount
is logged in a transaction-lifespan resource owner, but plpgsql won't
destroy simple expressions made within the DO block (before its first
commit) until the DO block is exited. That results in a warning about
a leaked tupdesc refcount when the COMMIT destroys the original resource
owner, and then an error about the active resource owner not holding a
matching refcount when the expression is destroyed.
To fix, get rid of the need to have a shutdown callback at all, by
instead caching a pointer to the relevant typcache entry. Those
survive for the life of the backend, so we needn't worry about the
pointer becoming stale. (For registered RECORD types, we can still
cache a pointer to the tupdesc, knowing that it won't change for the
life of the backend.) This mechanism has been in use in plpgsql
and expandedrecord.c since commit 4b93f5799, and seems to work well.
This change requires modifying the ExprEvalStep structs used by the
relevant expression step types, which is slightly worrisome for
back-patching. However, there seems no good reason for extensions
to be familiar with the details of these particular sub-structs.
Per report from Rohit Bhogate. Back-patch to v11 where within-DO-block
COMMITs became a thing.
Discussion: https://postgr.es/m/CAAV6ZkQRCVBh8qAY+SZiHnz+U+FqAGBBDaDTjF2yiKa2nJSLKg@mail.gmail.com
With very unlucky timing and parallel_leader_participation off, PHJ
could attempt to access per-batch state just as it was being freed.
There was code intended to prevent that by checking for a cleared
pointer, but it was buggy.
Fix, by introducing an extra barrier phase. The new phase
PHJ_BUILD_RUNNING means that it's safe to access the per-batch state to
find a batch to help with, and PHJ_BUILD_DONE means that it is too late.
The last to detach will free the array of per-batch state as before, but
now it will also atomically advance the phase at the same time, so that
late attachers can avoid the hazard, without the data race. This
mirrors the way per-batch hash tables are freed (see phases
PHJ_BATCH_PROBING and PHJ_BATCH_DONE).
Revealed by a one-off build farm failure, where BarrierAttach() failed a
sanity check assertion, because the memory had been clobbered by
dsa_free().
Back-patch to 11, where the code arrived.
Reported-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/20200929061142.GA29096%40paquier.xyz
If ExecGetInsertedCols(), ExecGetUpdatedCols() or ExecGetExtraUpdatedCols()
were called with a ResultRelInfo that's not in the range table and isn't a
partition routing target, the functions would dereference a NULL pointer,
relinfo->ri_RootResultRelInfo. Such ResultRelInfos are created when firing
RI triggers in tables that are not modified directly. None of the current
callers of these functions pass such relations, so this isn't a live bug,
but let's make them more robust.
Also update comment in ResultRelInfo; after commit 6214e2b228,
ri_RangeTableIndex is zero for ResultRelInfos created for partition tuple
routing.
Noted by Coverity. Backpatch down to v11, like commit 6214e2b228.
Reviewed-by: Tom Lane, Amit Langote
If a cross-partition UPDATE violates a constraint on the target partition,
and the columns in the new partition are in different physical order than
in the parent, the error message can reveal columns that the user does not
have SELECT permission on. A similar bug was fixed earlier in commit
804b6b6db4.
The cause of the bug is that the callers of the
ExecBuildSlotValueDescription() function got confused when constructing
the list of modified columns. If the tuple was routed from a parent, we
converted the tuple to the parent's format, but the list of modified
columns was grabbed directly from the child's RTE entry.
ExecUpdateLockMode() had a similar issue. That lead to confusion on which
columns are key columns, leading to wrong tuple lock being taken on tables
referenced by foreign keys, when a row is updated with INSERT ON CONFLICT
UPDATE. A new isolation test is added for that corner case.
With this patch, the ri_RangeTableIndex field is no longer set for
partitions that don't have an entry in the range table. Previously, it was
set to the RTE entry of the parent relation, but that was confusing.
NOTE: This modifies the ResultRelInfo struct, replacing the
ri_PartitionRoot field with ri_RootResultRelInfo. That's a bit risky to
backpatch, because it breaks any extensions accessing the field. The
change that ri_RangeTableIndex is not set for partitions could potentially
break extensions, too. The ResultRelInfos are visible to FDWs at least,
and this patch required small changes to postgres_fdw. Nevertheless, this
seem like the least bad option. I don't think these fields widely used in
extensions; I don't think there are FDWs out there that uses the FDW
"direct update" API, other than postgres_fdw. If there is, you will get a
compilation error, so hopefully it is caught quickly.
Backpatch to 11, where support for both cross-partition UPDATEs, and unique
indexes on partitioned tables, were added.
Reviewed-by: Amit Langote
Security: CVE-2021-3393
perform_pruning_combine_step() was not taught about the number of
partition indexes used in hash partitioning; more embarrassingly,
get_matching_hash_bounds() also had it wrong. These errors are masked
in the common case where all the partitions have the same modulus
and no partition is missing. However, with missing or unequal-size
partitions, we could erroneously prune some partitions that need
to be scanned, leading to silently wrong query answers.
While a minimal-footprint fix for this could be to export
get_partition_bound_num_indexes and make the incorrect functions use it,
I'm of the opinion that that function should never have existed in the
first place. It's not reasonable data structure design that
PartitionBoundInfoData lacks any explicit record of the length of
its indexes[] array. Perhaps that was all right when it could always
be assumed equal to ndatums, but something should have been done about
it as soon as that stopped being true. Putting in an explicit
"nindexes" field makes both partition_bounds_equal() and
partition_bounds_copy() simpler, safer, and faster than before,
and removes explicit knowledge of the number-of-partition-indexes
rules from some other places too.
This change also makes get_hash_partition_greatest_modulus obsolete.
I left that in place in case any external code uses it, but no core
code does anymore.
Per bug #16840 from Michał Albrycht. Back-patch to v11 where the
hash partitioning code came in. (In the back branches, add the new
field at the end of PartitionBoundInfoData to minimize ABI risks.)
Discussion: https://postgr.es/m/16840-571a22976f829ad4@postgresql.org
When building aggregate expression steps, strict checks need a bailout
jump for when a null value is encountered, so there is a list of steps
that require later adjustment. Adding entries to that list for steps
that aren't actually strict would be harmless, except that there is an
Assert which catches them. This leads to spurious errors on asserts
builds, for data sets that trigger parallel aggregation of an
aggregate with a non-strict deserialization function (no such
aggregates exist in the core system).
Repair by not adding the adjustment entry when it's not needed.
Backpatch back to 11 where the code was introduced.
Per a report from Darafei (Komzpa) of the PostGIS project; analysis
and patch by me.
Discussion: https://postgr.es/m/87mty7peb3.fsf@news-spur.riddles.org.uk
Somebody extended search_plan_tree() to treat MergeAppend exactly
like Append, which is 100% wrong, because unlike Append we can't
assume that only one input node is actively returning tuples.
Hence a cursor using a MergeAppend across a UNION ALL or inheritance
tree could falsely match a WHERE CURRENT OF query at a row that
isn't actually the cursor's current output row, but coincidentally
has the same TID (in a different table) as the current output row.
Delete the faulty code; this means that such a case will now return
an error like 'cursor "foo" is not a simply updatable scan of table
"bar"', instead of silently misbehaving. Users should not find that
surprising though, as the same cursor query could have failed that way
already depending on the chosen plan. (It would fail like that if the
sort were done with an explicit Sort node instead of MergeAppend.)
Expand the clearly-inadequate commentary to be more explicit about
what this code is doing, in hopes of forestalling future mistakes.
It's been like this for awhile, so back-patch to all supported
branches.
Discussion: https://postgr.es/m/482865.1611075182@sss.pgh.pa.us
execCurrent.c's search_plan_tree() assumed that ForeignScanStates
and CustomScanStates necessarily have a valid ss_currentRelation.
This is demonstrably untrue for postgres_fdw's remote join and
remote aggregation plans, and non-leaf custom scans might not have
an identifiable scan relation either. Avoid crashing by ignoring
such nodes when the field is null.
This solution will lead to errors like 'cursor "foo" is not a
simply updatable scan of table "bar"' in cases where maybe we
could have allowed WHERE CURRENT OF to work. That's not an issue
for postgres_fdw's usages, since joins or aggregations would render
WHERE CURRENT OF invalid anyway. But an otherwise-transparent
upper level custom scan node might find this annoying. When and if
someone cares to expend work on such a scenario, we could invent a
custom-scan-provider callback to determine what's safe.
Report and patch by David Geier, commentary by me. It's been like
this for awhile, so back-patch to all supported branches.
Discussion: https://postgr.es/m/0253344d-9bdd-11c4-7f0d-d88c02cd7991@swarm64.com
Previously this code assumed that all IndexScan nodes supported
mark/restore, which is not true since it depends on optional index AM
support functions. This could lead to errors about missing support
functions in rare edge cases of mergejoins with no sort keys, where an
unordered non-btree index scan was placed on the inner path without a
protecting Materialize node. (Normally, the fact that merge join
requires ordered input would avoid this error.)
Backpatch all the way since this bug is ancient.
Per report from Eugen Konkov on irc.
Discussion: https://postgr.es/m/87o8jn50be.fsf@news-spur.riddles.org.uk
If the planner erroneously puts a non-parallel-safe SubPlan into
a parallelized portion of the query tree, nodeSubplan.c will fail
in the worker processes because it finds a null in es_subplanstates,
which it's unable to cope with. It seems worth a test-and-elog to
make that an error case rather than a core dump case.
This probably should have been included in commit 16ebab688, which
was responsible for allowing nulls to appear in es_subplanstates
to begin with. So, back-patch to v10 where that came in.
Discussion: https://postgr.es/m/924226.1604422326@sss.pgh.pa.us
nodeSubplan.c expects that the testexpr for a hashable ANY SubPlan
has the form of one or more OpExprs whose LHS is an expression of the
outer query's, while the RHS is an expression over Params representing
output columns of the subquery. However, the planner only went as far
as verifying that the clauses were all binary OpExprs. This works
99.99% of the time, because the clauses have the right shape when
emitted by the parser --- but it's possible for function inlining to
break that, as reported by PegoraroF10. To fix, teach the planner
to check that the LHS and RHS contain the right things, or more
accurately don't contain the wrong things. Given that this has been
broken for years without anyone noticing, it seems sufficient to just
give up hashing when it happens, rather than go to the trouble of
commuting the clauses back again (which wouldn't necessarily work
anyway).
While poking at that, I also noticed that nodeSubplan.c had a baked-in
assumption that the number of hash clauses is identical to the number
of subquery output columns. Again, that's fine as far as parser output
goes, but it's not hard to break it via function inlining. There seems
little reason for that assumption though --- AFAICS, the only thing
it's buying us is not having to store the number of hash clauses
explicitly. Adding code to the planner to reject such cases would take
more code than getting nodeSubplan.c to cope, so I fixed it that way.
This has been broken for as long as we've had hashable SubPlans,
so back-patch to all supported branches.
Discussion: https://postgr.es/m/1549209182255-0.post@n3.nabble.com
Commit 85c9d347 addressed a similar problem for Gather and Gather
Merge nodes but forgot to account for nodes above parallel nodes. This
still works for nodes above Gather node because we shut down the workers
for Gather node as soon as there are no more tuples. We can do a similar
thing for Gather Merge as well but it seems better to account for stats
during nodes shutdown after completing the execution.
Reported-by: Stéphane Lorek, Jehan-Guillaume de Rorthais
Author: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Reviewed-by: Amit Kapila
Backpatch-through: 10, where it was introduced
Discussion: https://postgr.es/m/20200718160206.584532a2@firost
In a logical replication subscriber, a table using REPLICA IDENTITY FULL
which has a primary key would try to use the primary key's index
available to scan for a tuple, but an assertion only assumed as correct
the case of an index associated to REPLICA IDENTITY USING INDEX. This
commit corrects the assertion so as the use of a primary key index is a
valid case.
Reported-by: Dilip Kumar
Analyzed-by: Dilip Kumar
Author: Euler Taveira
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w@mail.gmail.com
Backpatch-through: 10
Working on commit 1c455078b led me to check through FunctionCallInvoke
call sites to see if every one was being honest about (a) making sure
that fcinfo.isnull is initially false, and (b) checking its state after
the call. Sure enough, I found some violations.
The main one is that finalize_partialaggregate re-used serialfn_fcinfo
without resetting isnull, even though it clearly intends to cater for
serialfns that return NULL. There would only be an issue with a
non-strict serialfn, since it's unlikely that a serialfn would return
NULL for non-null input. We have no non-strict serialfns in core, and
there may be none in the wild either, which would account for the lack
of complaints. Still, it's clearly wrong, so back-patch that fix to
9.6 where finalize_partialaggregate was introduced.
Also, arrayfuncs.c and rowtypes.c contained various callers that were
not bothering to check for result nulls. While what's being called is
a comparison or hash function that probably *shouldn't* return null,
that's a lousy excuse for not having any check at all. There are
existing places that just Assert(!fcinfo->isnull) in comparable
situations, so I added that to the places that were calling btree
comparison or hash support functions. In the places calling
boolean-returning equality functions, it's quite cheap to have them
treat isnull as FALSE, so make those places do that. Also remove some
"locfcinfo->isnull = false" assignments that are unnecessary given the
assumption that no previous call returned null. These changes seem like
mostly neatnik-ism or debugging support, so I didn't back-patch.
ExecReScanHashJoin will destroy the join's hash table if it expects
that the inner relation will produce different rows on rescan.
Up to now it's not bothered to clear the additional pointer to that
hash table that exists in the child HashState node. However, it's
possible for the query to terminate without building a fresh hash
table (this happens if the outer relation is found to be empty
during the final rescan). So we can end with a dangling pointer
to a deleted hash table. That was harmless originally, but since
9.0 EXPLAIN ANALYZE has used that pointer to print hash table
statistics. In debug builds this reproducibly results in garbage
statistics. In non-debug builds there's frequently no ill effects,
but in principle one could get wrong EXPLAIN ANALYZE output, or
perhaps even a crash if free() has released the hashtable memory
back to the OS.
To fix, just make sure we clear the additional pointer when destroying
the hash table. In problematic cases, EXPLAIN ANALYZE will then print
no hashtable statistics (reverting to its pre-9.0 behavior). This isn't
ideal, but since the problem manifests only in unusual corner cases,
it's hard to justify taking any risks to do better in the back
branches. A follow-on patch will improve matters in HEAD.
Konstantin Knizhnik and Tom Lane, per diagnosis by Thomas Munro
of a trouble report from Alvaro Herrera.
Discussion: https://postgr.es/m/20200323165059.GA24950@alvherre.pgsql
During the reporting of constraint violations for partitioned tables,
ExecPartitionCheckEmitError(), ExecConstraints(),
ExecWithCheckOptions() set the slot descriptor of the input slot to
the root partition's tuple desc. That's generally problematic when
the slot could be used by other routines, but can cause crashes after
the introduction of slots with "fixed" tuple descriptors in
ad7dbee368a.
The problem likely escaped detection so far for two reasons: First,
currently the only known way that these routines are used with a
partitioned table that is not "owned" by partitioning code is when
"fast defaults" are used for the child partition. Second, as an error
is raised afterwards, an "external" slot that had its descriptor
changed, is very unlikely to continue being used.
Even though the issue currently is only known to cause a crash for
11 (as that has both fast defaults and "fixed" slot descriptors), it
seems worth applying the fix to 10 too. Potentially changing random
slots is hazardous.
Regression tests will be added in a separate commit, as it seems best
to add them for master and 12 too.
Reported-By: Daniel WM
Author: Andres Freund
Bug: #16293
Discussion: https://postgr.es/m/16293-26f5777d10143a66@postgresql.org
Backpatch: 11, 10 only
Commit 356687bd8 omitted to remove leftover code for destroying
a hashed subplan's hash tables, with the result that the tables
were always rebuilt not reused; this leads to severe memory
leakage if a hashed subplan is re-executed enough times.
Moreover, the code for reusing the hashnulls table had a typo
that would have made it do the wrong thing if it were reached.
Looking at the code coverage report shows severe under-coverage
of the potential callers of ResetTupleHashTable, so add some test
cases that exercise them.
Andreas Karlsson and Tom Lane, per reports from Ranier Vilela
and Justin Pryzby.
Backpatch to v11, as the faulty commit was.
Discussion: https://postgr.es/m/edb62547-c453-c35b-3ed6-a069e4d6b937@proxel.se
Discussion: https://postgr.es/m/CAEudQAo=DCebm1RXtig9OH+QivpS97sMkikt0A9qHmMUs+g6ZA@mail.gmail.com
Discussion: https://postgr.es/m/20200210032547.GA1412@telsasoft.com
When replica identity is FULL (an admittedly unusual case), the loop
that searches for tuples in execReplication.c didn't stop scanning the
table when once a matching tuple was found. Add the missing 'break'.
Note slight behavior change: we now return the first matching tuple
rather than the last one. They are supposed to be indistinguishable
anyway, so this shouldn't matter.
Author: Konstantin Knizhnik
Discussion: https://postgr.es/m/379743f6-ae91-b866-f7a2-5624e6d2b0a4@postgrespro.ru
Currently, Parallel Hash Join cannot be used for full/right joins,
so there is no point in setting the match flag. It turns out that
the cache coherence traffic generated by those writes slows down
large systems running many-core joins, so let's stop doing that.
In future, if we need to use match bits in parallel joins, we might
want to consider setting them only if not already set.
Back-patch to 11, where Parallel Hash Join arrived.
Reported-by: Deng, Gang
Discussion: https://postgr.es/m/0F44E799048C4849BAE4B91012DB910462E9897A%40SHSMSX103.ccr.corp.intel.com
The code checking whether an aggregate transition value needs to be
reparented into the current context has always only compared the
transition return value with the previous transition value by datum,
i.e. without regard for NULLness. This normally works, because when
the transition function returns NULL (via fcinfo->isnull), it'll
return a value that won't be the same as its input value.
But there's no hard requirement that that's the case. And it turns
out, it's possible to hit this case (see discussion or reproducers),
leading to a non-null transition value not being reparented, followed
by a crash caused by that.
Instead of adding another comparison of NULLness, instead have
ExecAggTransReparent() ensure that pergroup->transValue ends up as 0
when the new transition value is NULL. That avoids having to add an
additional branch to the much more common cases of the transition
function returning the old transition value (which is a pointer in
this case), and when the new value is different, but not NULL.
In branches since 69c3936a149, also deduplicate the reparenting code
between the expression evaluation based transitions, and the path for
ordered aggregates.
Reported-By: Teodor Sigaev, Nikita Glukhov
Author: Andres Freund
Discussion: https://postgr.es/m/bd34e930-cfec-ea9b-3827-a8bc50891393@sigaev.ru
Backpatch: 9.4-, this issue has existed since at least 7.4
Commit 9b63c13f0 turns out to have been fundamentally misguided:
the parent node's subPlan list is by no means the only way in which
a child SubPlan node can be hooked into the outer execution state.
As shown in bug #16213 from Matt Jibson, we can also get short-lived
tuple table slots added to the outer es_tupleTable list. At this point
I have little faith that there aren't other possible connections as
well; the long time it took to notice this problem shows that this
isn't a heavily-exercised situation.
Therefore, revert that fix, returning to the coding that passed a
NULL parent plan pointer down to the transiently-built subexpressions.
That gives us a pretty good guarantee that they won't hook into the
outer executor state in any way. But then we need some other solution
to make SubPlans work. Adopt the solution speculated about in the
previous commit's log message: do expression initialization at plan
startup for just those VALUES rows containing SubPlans, abandoning the
goal of reclaiming memory intra-query for those rows. In practice it
seems unlikely that queries containing a vast number of VALUES rows
would be using SubPlans in them, so this should not give up much.
(BTW, this test case also refutes my claim in connection with the prior
commit that the issue only arises with use of LATERAL. That was just
wrong: some variants of SubLink always produce SubPlans.)
As with previous patch, back-patch to all supported branches.
Discussion: https://postgr.es/m/16213-871ac3bc208ecf23@postgresql.org
A view with conditional INSTEAD rules and no unconditional INSTEAD
rules or INSTEAD OF triggers is not auto-updatable. Previously we
relied on a check in the executor to catch this, but that's
problematic since the planner may fail to properly handle such a query
and thus return a particularly unhelpful error to the user, before
reaching the executor check.
Instead, trap this in the rewriter and report the correct error there.
Doing so also allows us to include more useful error detail than the
executor check can provide. This doesn't change the existing behaviour
of updatable views; it merely ensures that useful error messages are
reported when a view isn't updatable.
Per report from Pengzhou Tang, though not adopting that suggested fix.
Back-patch to all supported branches.
Discussion: https://postgr.es/m/CAG4reAQn+4xB6xHJqWdtE0ve_WqJkdyCV4P=trYr4Kn8_3_PEA@mail.gmail.com