If we try to run a parallel plan in serial mode because, for example,
it's going to be scanned via a cursor, but for some reason we're
already in parallel mode (for example because an outer query is
running in parallel), we'd incorrectly try to launch workers.
Fix by adding a flag to the EState, so that we can be certain that
ExecutePlan() and ExecGather()/ExecGatherMerge() will have the same
idea about whether we are executing serially or in parallel.
Report and fix by Amit Kapila with help from Kuntal Ghosh. A few
tweaks by me.
Discussion: http://postgr.es/m/CAA4eK1+_BuZrmVCeua5Eqnm4Co9DAXdM5HPAOE2J19ePbR912Q@mail.gmail.com
This is epecially useful in the case or "VARIADIC ANY" functions. The
caller can get the artguments and types regardless of whether or not and
explicit VARIADIC array argument has been used. The function also
provides an option to convert arguments on type "unknown" to to "text".
Michael Paquier and me, reviewed by Tom Lane.
Backpatch to 9.4 in order to support the following json bug fix.
An aggregate's input expression(s) are not supposed to be evaluated
at all for a row where its FILTER test fails ... but commit 8ed3f11bb
overlooked that requirement. Reshuffle so that aggregates having a
filter clause evaluate their arguments separately from those without.
This still gets the benefit of doing only one ExecProject in the
common case of multiple Aggrefs, none of which have filters.
While at it, arrange for filter clauses to be included in the common
ExecProject evaluation, thus perhaps buying a little bit even when
there are filters.
Back-patch to v10 where the bug was introduced.
Discussion: https://postgr.es/m/30065.1508161354@sss.pgh.pa.us
If we merge the transition calculations for two different aggregates,
it's reasonable to assume that the transition function should not care
which of those Aggref structs it gets from AggGetAggref(). It is not
reasonable to make the same assumption about an aggregate final function,
however. Commit 804163bc2 broke this, as it will pass whichever Aggref
was first associated with the transition state in both cases.
This doesn't create an observable bug so far as the core system is
concerned, because the only existing uses of AggGetAggref() are in
ordered-set aggregates that happen to not pay attention to anything
but the input properties of the Aggref; and besides that, we disabled
sharing of transition calculations for OSAs yesterday. Nonetheless,
if some third-party code were using AggGetAggref() in a normal aggregate,
they would be entitled to call this a bug. Hence, back-patch the fix
to 9.6 where the problem was introduced.
In passing, improve some of the comments about transition state sharing.
Discussion: https://postgr.es/m/CAB4ELO5RZhOamuT9Xsf72ozbenDLLXZKSk07FiSVsuJNZB861A@mail.gmail.com
Up to now async.c has used TransactionIdIsInProgress() to detect whether
a notify message's source transaction is still running. However, that
function has a quick-exit path that reports that XIDs before RecentXmin
are no longer running. If a listening backend is doing nothing but
listening, and not running any queries, there is nothing that will advance
its value of RecentXmin. Once 2 billion transactions elapse, the
RecentXmin check causes active transactions to be reported as not running.
If they aren't committed yet according to CLOG, async.c decides they
aborted and discards their messages. The timing for that is a bit tight
but it can happen when multiple backends are sending notifies concurrently.
The net symptom therefore is that a sufficiently-long-surviving
listen-only backend starts to miss some fraction of NOTIFY traffic,
but only under heavy load.
The only function that updates RecentXmin is GetSnapshotData().
A brute-force fix would therefore be to take a snapshot before
processing incoming notify messages. But that would add cycles,
as well as contention for the ProcArrayLock. We can be smarter:
having taken the snapshot, let's use that to check for running
XIDs, and not call TransactionIdIsInProgress() at all. In this
way we reduce the number of ProcArrayLock acquisitions from one
per message to one per notify interrupt; that's the same under
light load but should be a benefit under heavy load. Light testing
says that this change is a wash performance-wise for normal loads.
I looked around for other callers of TransactionIdIsInProgress()
that might be at similar risk, and didn't find any; all of them
are inside transactions that presumably have already taken a
snapshot.
Problem report and diagnosis by Marko Tiikkaja, patch by me.
Back-patch to all supported branches, since it's been like this
since 9.0.
Discussion: https://postgr.es/m/20170926182935.14128.65278@wrigleys.postgresql.org
The logical decoding functions do BeginInternalSubTransaction and
RollbackAndReleaseCurrentSubTransaction to clean up after themselves.
It turns out that AtEOSubXact_SPI has an unrecognized assumption that
we always need to cancel the active SPI operation in the SPI context
that surrounds the subtransaction (if there is one). That's true
when the RollbackAndReleaseCurrentSubTransaction call is coming from
the SPI-using function itself, but not when it's happening inside
some unrelated function invoked by a SPI query. In practice the
affected callers are the various PLs.
To fix, record the current subtransaction ID when we begin a SPI
operation, and clean up only if that ID is the subtransaction being
canceled.
Also, remove AtEOSubXact_SPI's assertion that it must have cleaned
up the surrounding SPI context's active tuptable. That's proven
wrong by the same test case.
Also clarify (or, if you prefer, reinterpret) the calling conventions
for _SPI_begin_call and _SPI_end_call. The memory context cleanup
in the latter means that these have always had the flavor of a matched
resource-management pair, but they weren't documented that way before.
Per report from Ben Chobot.
Back-patch to 9.4 where logical decoding came in. In principle,
the SPI changes should go all the way back, since the problem dates
back to commit 7ec1c5a86. But given the lack of field complaints
it seems few people are using internal subtransactions in this way.
So I don't feel a need to take any risks in 9.2/9.3.
Discussion: https://postgr.es/m/73FBA179-C68C-4540-9473-71E865408B15@silentmedia.com
When some tuple versions in an update chain are frozen due to them being
older than freeze_min_age, the xmax/xmin trail can become broken. This
breaks HOT (and probably other things). A subsequent VACUUM can break
things in more serious ways, such as leaving orphan heap-only tuples
whose root HOT redirect items were removed. This can be seen because
index creation (or REINDEX) complain like
ERROR: XX000: failed to find parent tuple for heap-only tuple at (0,7) in table "t"
Because of relfrozenxid contraints, we cannot avoid the freezing of the
early tuples, so we must cope with the results: whenever we see an Xmin
of FrozenTransactionId, consider it a match for whatever the previous
Xmax value was.
This problem seems to have appeared in 9.3 with multixact changes,
though strictly speaking it seems unrelated.
Since 9.4 we have commit 37484ad2a "Change the way we mark tuples as
frozen", so the fix is simple: just compare the raw Xmin (still stored
in the tuple header, since freezing merely set an infomask bit) to the
Xmax. But in 9.3 we rewrite the Xmin value to FrozenTransactionId, so
the original value is lost and we have nothing to compare the Xmax with.
To cope with that case we need to compare the Xmin with FrozenXid,
assume it's a match, and hope for the best. Sadly, since you can
pg_upgrade a 9.3 instance containing half-frozen pages to newer
releases, we need to keep the old check in newer versions too, which
seems a bit brittle; I hope we can somehow get rid of that.
I didn't optimize the new function for performance. The new coding is
probably a bit slower than before, since there is a function call rather
than a straight comparison, but I'd rather have it work correctly than
be fast but wrong.
This is a followup after 20b655224249 fixed a few related problems.
Apparently, in 9.6 and up there are more ways to get into trouble, but
in 9.3 - 9.5 I cannot reproduce a problem anymore with this patch, so
there must be a separate bug.
Reported-by: Peter Geoghegan
Diagnosed-by: Peter Geoghegan, Michael Paquier, Daniel Wood,
Yi Wen Wong, Álvaro
Discussion: https://postgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com
Commit 597a87ccc introduced a latch pointer variable to replace use
of a long-lived shared latch in the shared WalRcvData structure.
This was not well thought out, because there are now hazards of the
pointer variable changing while it's being inspected by another
process. This could obviously lead to a core dump in code like
if (WalRcv->latch)
SetLatch(WalRcv->latch);
and there's a more remote risk of a torn read, if we have any
platforms where reading/writing a pointer is not atomic.
An actual problem would occur only if the walreceiver process
exits (gracefully) while the startup process is trying to
signal it, but that seems well within the realm of possibility.
To fix, treat the pointer variable (not the referenced latch)
as being protected by the WalRcv->mutex spinlock. There
remains a race condition that we could apply SetLatch to a
process latch that no longer belongs to the walreceiver, but
I believe that's harmless: at worst it'd cause an extra wakeup
of the next process to use that PGPROC structure.
Back-patch to v10 where the faulty code was added.
Discussion: https://postgr.es/m/22735.1507048202@sss.pgh.pa.us
This reverts commit 15bc038f9, along with the followon commits 1635e80d3
and 984c92074 that tried to clean up the problems exposed by bug #14825.
The result was incomplete because it failed to address parallel-query
requirements. With 10.0 release so close upon us, now does not seem like
the time to be adding more code to fix that. I hope we can un-revert this
code and add the missing parallel query support during the v11 cycle.
Back-patch to v10.
Discussion: https://postgr.es/m/20170922185904.1448.16585@wrigleys.postgresql.org
Commit 15bc038f9 allowed ALTER TYPE ADD VALUE to be executed inside
transaction blocks, by disallowing the use of the added value later
in the same transaction, except under limited circumstances. However,
the test for "limited circumstances" was heuristic and could reject
references to enum values that were created during CREATE TYPE AS ENUM,
not just later. This breaks the use-case of restoring pg_dump scripts
in a single transaction, as reported in bug #14825 from Balazs Szilfai.
We can improve this by keeping a "blacklist" table of enum value OIDs
created by ALTER TYPE ADD VALUE during the current transaction. Any
visible-but-uncommitted value whose OID is not in the blacklist must
have been created by CREATE TYPE AS ENUM, and can be used safely
because it could not have a lifespan shorter than its parent enum type.
This change also removes the restriction that a renamed enum value
can't be used before being committed (unless it was on the blacklist).
Andrew Dunstan, with cosmetic improvements by me.
Back-patch to v10.
Discussion: https://postgr.es/m/20170922185904.1448.16585@wrigleys.postgresql.org
On Linux, shared memory segments created with shm_open() are backed by
swap files created in tmpfs. If the swap file needs to be extended,
but there's no tmpfs space left, you get a very unfriendly SIGBUS trap.
To avoid this, force allocation of the full request size when we create
the segment. This adds a few cycles, but none that we wouldn't expend
later anyway, assuming the request isn't hugely bigger than the actual
need.
Make this code #ifdef __linux__, because (a) there's not currently a
reason to think the same problem exists on other platforms, and (b)
applying posix_fallocate() to an FD created by shm_open() isn't very
portable anyway.
Back-patch to 9.4 where the DSM code came in.
Thomas Munro, per a bug report from Amul Sul
Discussion: https://postgr.es/m/1002664500.12301802.1471008223422.JavaMail.yahoo@mail.yahoo.com
The standard says that all changes of the same kind (insert, update, or
delete) caused in one table by a single SQL statement should be reported
in a single transition table; and by that, they mean to include foreign key
enforcement actions cascading from the statement's direct effects. It's
also reasonable to conclude that if the standard had wCTEs, they would say
that effects of wCTEs applying to the same table as each other or the outer
statement should be merged into one transition table. We weren't doing it
like that.
Hence, arrange to merge tuples from multiple update actions into a single
transition table as much as we can. There is a problem, which is that if
the firing of FK enforcement triggers and after-row triggers with
transition tables is interspersed, we might need to report more tuples
after some triggers have already seen the transition table. It seems like
a bad idea for the transition table to be mutable between trigger calls.
There's no good way around this without a major redesign of the FK logic,
so for now, resolve it by opening a new transition table each time this
happens.
Also, ensure that AFTER STATEMENT triggers fire just once per statement,
or once per transition table when we're forced to make more than one.
Previous versions of Postgres have allowed each FK enforcement query
to cause an additional firing of the AFTER STATEMENT triggers for the
referencing table, but that's certainly not per spec. (We're still
doing multiple firings of BEFORE STATEMENT triggers, though; is that
something worth changing?)
Also, forbid using transition tables with column-specific UPDATE triggers.
The spec requires such transition tables to show only the tuples for which
the UPDATE trigger would have fired, which means maintaining multiple
transition tables or else somehow filtering the contents at readout.
Maybe someday we'll bother to support that option, but it looks like a
lot of trouble for a marginal feature.
The transition tables are now managed by the AfterTriggers data structures,
rather than being directly the responsibility of ModifyTable nodes. This
removes a subtransaction-lifespan memory leak introduced by my previous
band-aid patch 3c4359521.
In passing, refactor the AfterTriggers data structures to reduce the
management overhead for them, by using arrays of structs rather than
several parallel arrays for per-query-level and per-subtransaction state.
I failed to resist the temptation to do some copy-editing on the SGML
docs about triggers, above and beyond merely documenting the effects
of this patch.
Back-patch to v10, because we don't want the semantics of transition
tables to change post-release.
Patch by me, with help and review from Thomas Munro.
Discussion: https://postgr.es/m/20170909064853.25630.12825@wrigleys.postgresql.org
The order in which GRANTs are output is important as GRANTs which have
been GRANT'd by individuals via WITH GRANT OPTION GRANTs have to come
after the GRANT which included the WITH GRANT OPTION. This happens
naturally in the backend during normal operation as we only change
existing ACLs in-place, only add new ACLs to the end, and when removing
an ACL we remove any which depend on it also.
Also, adjust the comments in acl.h to make this clear.
Unfortunately, the updates to pg_dump to handle initial privileges
involved pulling apart ACLs and then combining them back together and
could end up putting them back together in an invalid order, leading to
dumps which wouldn't restore.
Fix this by adjusting the queries used by pg_dump to ensure that the
ACLs are rebuilt in the same order in which they were originally.
Back-patch to 9.6 where the changes for initial privileges were done.
This doesn't allow routing tuple to the foreign partitions themselves,
but it permits tuples to be routed to regular partitions despite the
presence of foreign partitions in the same inheritance hierarchy.
Etsuro Fujita, reviewed by Amit Langote and by me.
Discussion: http://postgr.es/m/bc3db4c1-1693-3b8a-559f-33ad2b50b7ad@lab.ntt.co.jp
The NAMEDTUPLESTORE patch piggybacked on the infrastructure for
TABLEFUNC/VALUES/CTE RTEs, none of which can ever have dropped columns,
so the possibility was ignored most places. Fix that, including adding a
specification to parsenodes.h about what it's supposed to look like.
In passing, clean up assorted comments that hadn't been maintained
properly by said patch.
Per bug #14799 from Philippe Beaudoin. Back-patch to v10.
Discussion: https://postgr.es/m/20170906120005.25630.84360@wrigleys.postgresql.org
Move the responsibility for creating/destroying TupleQueueReaders into
execParallel.c, to avoid duplicative coding in nodeGather.c and
nodeGatherMerge.c. Also, instead of having DestroyTupleQueueReader do
shm_mq_detach, do it in the caller (which is now only ExecParallelFinish).
This means execParallel.c does both the attaching and detaching of the
tuple-queue-reader shm_mqs, which seems less weird than the previous
arrangement.
These changes also eliminate a vestigial memory leak (of the pei->tqueue
array). It's now demonstrable that rescans of Gather or GatherMerge don't
leak memory.
Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us
Instead of using a cast to force the constant to be the right width,
assume we can plaster on an L, UL, LL, or ULL suffix as appropriate.
The old approach to this is very hoary, dating from before we were
willing to require compilers to have working int64 types.
This fix makes the PG_INT64_MIN, PG_INT64_MAX, and PG_UINT64_MAX
constants safe to use in preprocessor conditions, where a cast
doesn't work. Other symbolic constants that might be defined using
[U]INT64CONST are likewise safer than before.
Also fix the SIZE_MAX macro to be similarly safe, if we are forced
to provide a definition for that. The test added in commit 2e70d6b5e
happens to do what we want even with the hack "(size_t) -1" definition,
but we could easily get burnt on other tests in future.
Back-patch to all supported branches, like the previous commits.
Discussion: https://postgr.es/m/15883.1504278595@sss.pgh.pa.us
Pre-C99 platforms may lack <stdint.h> and thereby SIZE_MAX. We have
a couple of places using the hack "(size_t) -1" as a fallback, but
it wasn't universally available; which means the code added in commit
2e70d6b5e fails to compile everywhere. Move that hack to c.h so that
we can rely on having SIZE_MAX everywhere.
Per discussion, it'd be a good idea to make the macro's value safe
for use in #if-tests, but that will take a bit more work. This is
just a quick expedient to get the buildfarm green again.
Back-patch to all supported branches, like the previous commit.
Discussion: https://postgr.es/m/15883.1504278595@sss.pgh.pa.us
Commit 9915de6c1cb2 changed the default behavior of
DROP_REPLICATION_SLOT so that it would wait until any session holding
the slot active would release it, instead of raising an error. But
users are already depending on the original behavior, so revert to it by
default and add a WAIT option to invoke the new behavior.
Per complaint from Simone Gotti, in
Discussion: https://postgr.es/m/CAEvsy6Wgdf90O6pUvg2wSVXL2omH5OPC-38OD4Zzgk-FXavj3Q@mail.gmail.com
Rescanning a GatherMerge led to leaking some memory in the executor's
query-lifespan context, because most of the node's working data structures
were simply abandoned and rebuilt from scratch. In practice, this might
never amount to much, given the cost of relaunching worker processes ---
but it's still pretty messy, so let's fix it.
We can rearrange things so that the tuple arrays are simply cleared and
reused, and we don't need to rebuild the TupleTableSlots either, just
clear them. One small complication is that because we might get a
different number of workers on each iteration, we can't keep the old
convention that the leader's gm_slots[] entry is the last one; the leader
might clobber a TupleTableSlot that we need for a worker in a future
iteration. Hence, adjust the logic so that the leader has slot 0 always,
while the active workers have slots 1..n.
Back-patch to v10 to keep all the existing versions of nodeGatherMerge.c
in sync --- because of the renumbering of the slots, there would otherwise
be a very large risk that any future backpatches in this module would
introduce bugs.
Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us
The logic around shm_mq_detach was a few bricks shy of a load, because
(contrary to the comments for shm_mq_attach) all it did was update the
shared shm_mq state. That left us leaking a bit of process-local
memory, but much worse, the on_dsm_detach callback for shm_mq_detach
was still armed. That means that whenever we ultimately detach from
the DSM segment, we'd run shm_mq_detach again for already-detached,
possibly long-dead queues. This accidentally fails to fail today,
because we only ever re-use a shm_mq's memory for another shm_mq, and
multiple detach attempts on the last such shm_mq are fairly harmless.
But it's gonna bite us someday, so let's clean it up.
To do that, change shm_mq_detach's API so it takes a shm_mq_handle
not the underlying shm_mq. This makes the callers simpler in most
cases anyway. Also fix a few places in parallel.c that were just
pfree'ing the handle structs rather than doing proper cleanup.
Back-patch to v10 because of the risk that the revenant shm_mq_detach
callbacks would cause a live bug sometime. Since this is an API
change, it's too late to do it in 9.6. (We could make a variant
patch that preserves API, but I'm not excited enough to do that.)
Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us
Comment the fields of GatherMergeState, and organize them a bit more
sensibly. Comment GMReaderTupleBuffer more usefully too. Improve
assorted other comments that were obsolete or just not very good English.
Get rid of the use of a GMReaderTupleBuffer for the leader process;
that was confusing, since only the "done" field was used, and that
in a way redundant with need_to_scan_locally.
In gather_merge_init, avoid calling load_tuple_array for
already-known-exhausted workers. I'm not sure if there's a live bug there,
but the case is unlikely to be well tested due to timing considerations.
Remove some useless code, such as duplicating the tts_isempty test done by
TupIsNull.
Remove useless initialization of ps.qual, replacing that with an assertion
that we have no qual to check. (If we did, the code would fail to check
it.)
Avoid applying heap_copytuple to a null tuple. While that fails to crash,
it's confusing and it makes the code less legible not more so IMO.
Propagate a couple of these changes into nodeGather.c, as well.
Back-patch to v10, partly because of the possibility that the
gather_merge_init change is fixing a live bug, but mostly to keep
the branches in sync to ease future bug fixes.
Previously, the parallel executor logic did reinitialization of shared
state within the ExecReScan code for parallel-aware scan nodes. This is
problematic, because it means that the ExecReScan call has to occur
synchronously (ie, during the parent Gather node's ReScan call). That is
swimming very much against the tide so far as the ExecReScan machinery is
concerned; the fact that it works at all today depends on a lot of fragile
assumptions, such as that no plan node between Gather and a parallel-aware
scan node is parameterized. Another objection is that because ExecReScan
might be called in workers as well as the leader, hacky extra tests are
needed in some places to prevent unwanted shared-state resets.
Hence, let's separate this code into two functions, a ReInitializeDSM
call and the ReScan call proper. ReInitializeDSM is called only in
the leader and is guaranteed to run before we start new workers.
ReScan is returned to its traditional function of resetting only local
state, which means that ExecReScan's usual habits of delaying or
eliminating child rescan calls are safe again.
As with the preceding commit 7df2c1f8d, it doesn't seem to be necessary
to make these changes in 9.6, which is a good thing because the FDW and
CustomScan APIs are impacted.
Discussion: https://postgr.es/m/CAA4eK1JkByysFJNh9M349u_nNjqETuEnY_y1VUc_kJiU0bxtaQ@mail.gmail.com
The ExecReScan machinery contains various optimizations for postponing
or skipping rescans of plan subtrees; for example a HashAgg node may
conclude that it can re-use the table it built before, instead of
re-reading its input subtree. But that is wrong if the input contains
a parallel-aware table scan node, since the portion of the table scanned
by the leader process is likely to vary from one rescan to the next.
This explains the timing-dependent buildfarm failures we saw after
commit a2b70c89c.
The established mechanism for showing that a plan node's output is
potentially variable is to mark it as depending on some runtime Param.
Hence, to fix this, invent a dummy Param (one that has a PARAM_EXEC
parameter number, but carries no actual value) associated with each Gather
or GatherMerge node, mark parallel-aware nodes below that node as dependent
on that Param, and arrange for ExecReScanGather[Merge] to flag that Param
as changed whenever the Gather[Merge] node is rescanned.
This solution breaks an undocumented assumption made by the parallel
executor logic, namely that all rescans of nodes below a Gather[Merge]
will happen synchronously during the ReScan of the top node itself.
But that's fundamentally contrary to the design of the ExecReScan code,
and so was doomed to fail someday anyway (even if you want to argue
that the bug being fixed here wasn't a failure of that assumption).
A follow-on patch will address that issue. In the meantime, the worst
that's expected to happen is that given very bad timing luck, the leader
might have to do all the work during a rescan, because workers think
they have nothing to do, if they are able to start up before the eventual
ReScan of the leader's parallel-aware table scan node has reset the
shared scan state.
Although this problem exists in 9.6, there does not seem to be any way
for it to manifest there. Without GatherMerge, it seems that a plan tree
that has a rescan-short-circuiting node below Gather will always also
have one above it that will short-circuit in the same cases, preventing
the Gather from being rescanned. Hence we won't take the risk of
back-patching this change into 9.6. But v10 needs it.
Discussion: https://postgr.es/m/CAA4eK1JkByysFJNh9M349u_nNjqETuEnY_y1VUc_kJiU0bxtaQ@mail.gmail.com
2cd70845240 / c6293249d change the way individual attributes in a
TupleDesc are stored / accessed. To reduce the effort of making
extensions compatible with postgresql 11, and to ease future
backpatching, backpatch introduction of TupleDescAttr() to all
releases. Do not backpatch change in storage, as that'd be a breaking
change for existing and working extensions.
Author: Andres Freund
Discussion: https://postgr.es/m/20170820181723.tdswdinzptbcwhrr@alap3.anarazel.de
Backpatch: 9.2-
Add a new EState member es_leaf_result_relations, so that the trigger
code knows about ResultRelInfos created by tuple routing. Also make
sure ExplainPrintTriggers knows about partition-related
ResultRelInfos.
Etsuro Fujita, reviewed by Amit Langote
Discussion: http://postgr.es/m/57163e18-8e56-da83-337a-22f2c0008051@lab.ntt.co.jp
Instead, lock them in the caller using find_all_inheritors so that
they get locked in the standard order, minimizing deadlock risks.
Also in RelationGetPartitionDispatchInfo, avoid opening tables which
are not partitioned; there's no need.
Amit Langote, reviewed by Ashutosh Bapat and Amit Khandekar
Discussion: http://postgr.es/m/91b36fa1-c197-b72f-ca6e-56c593bae68c@lab.ntt.co.jp
The API for WaitLatch and friends followed the Unix convention in which
waiting for a socket connection to complete is identical to waiting for
the socket to accept a write. While Windows provides a select(2)
emulation that agrees with that, the native WaitForMultipleObjects API
treats them as quite different --- and for some bizarre reason, it will
report a not-yet-connected socket as write-ready. libpq itself has so
far escaped dealing with this because it waits with select(), but in
libpqwalreceiver.c we want to wait using WaitLatchOrSocket. The semantics
mismatch resulted in replication connection failures on Windows, but only
for remote connections (apparently, localhost connections complete
immediately, or at least too fast for anyone to have noticed the problem
in single-machine testing).
To fix, introduce an additional WL_SOCKET_CONNECTED wait flag for
WaitLatchOrSocket, which is identical to WL_SOCKET_WRITEABLE on
non-Windows, but results in waiting for FD_CONNECT events on Windows.
Ideally, we would also distinguish the two conditions in the API for
PQconnectPoll(), but changing that API at this point seems infeasible.
Instead, cheat by checking for PQstatus() == CONNECTION_STARTED to
determine that we're still waiting for the connection to complete.
(This is a cheat mainly because CONNECTION_STARTED is documented as an
internal state rather than something callers should rely on. Perhaps
we ought to change the documentation ... but this patch doesn't.)
Per reports from Jobin Augustine and Igor Neyman. Back-patch to v10
where commit 1e8a85009 exposed this longstanding shortcoming.
Andres Freund, minor fix and some code review/beautification by me
Discussion: https://postgr.es/m/CAHBggj8g2T+ZDcACZ2FmzX9CTxkWjKBsHd6NkYB4i9Ojf6K1Fw@mail.gmail.com
The sole useful effect of this function, to check that no catcache
entries have positive refcounts at transaction end, has really been
obsolete since we introduced ResourceOwners in PG 8.1. We reduced the
checks to assertions years ago, so that the function was a complete
no-op in production builds. There have been previous discussions about
removing it entirely, but consensus up to now was that it had some small
value as a cross-check for bugs in the ResourceOwner logic.
However, it now emerges that it's possible to trigger these assertions
if you hit an assert-enabled backend with SIGTERM during a call to
SearchCatCacheList, because that function temporarily increases the
refcounts of entries it's intending to add to a catcache list construct.
In a normal ERROR scenario, the extra refcounts are cleaned up by
SearchCatCacheList's PG_CATCH block; but in a FATAL exit we do a
transaction abort and exit without ever executing PG_CATCH handlers.
There's a case to be made that this is a generic hazard and we should
consider restructuring elog(FATAL) handling so that pending PG_CATCH
handlers do get run. That's pretty scary though: it could easily create
more problems than it solves. Preliminary stress testing by Andreas
Seltenreich suggests that there are not many live problems of this ilk,
so we rejected that idea.
There are more-localized ways to fix the problem; the most principled
one would be to use PG_ENSURE_ERROR_CLEANUP instead of plain PG_TRY.
But adding cycles to SearchCatCacheList isn't very appealing. We could
also weaken the assertions in AtEOXact_CatCache in some more or less
ad-hoc way, but that just makes its raison d'etre even less compelling.
In the end, the most reasonable solution seems to be to just remove
AtEOXact_CatCache altogether, on the grounds that it's not worth trying
to fix it. It hasn't found any bugs for us in many years.
Per report from Jeevan Chalke. Back-patch to all supported branches.
Discussion: https://postgr.es/m/CAM2+6=VEE30YtRQCZX7_sCFsEpoUkFBV1gZazL70fqLn8rcvBA@mail.gmail.com
The previous message didn't mention the name of the table or the
bounds. Put the table name in the primary error message and the
bounds in the detail message.
Amit Langote, changed slightly by me. Suggestions on the exac
phrasing from Tom Lane, David G. Johnston, and Dean Rasheed.
Discussion: http://postgr.es/m/CA+Tgmoae6bpwVa-1BMaVcwvCCeOoJ5B9Q9-RHWo-1gJxfPBZ5Q@mail.gmail.com
Similar to what was fixed in commit 9915de6c1cb2 for replication slots,
but this time it's related to replication origins: DROP SUBSCRIPTION
attempts to drop the replication origin, but that fails if the
replication worker process hasn't yet marked it unused. This causes
failures in the buildfarm:
ERROR: could not drop replication origin with OID 1, in use by PID 34069
Like the aforementioned commit, fix by having the process running DROP
SUBSCRIPTION sleep until the worker marks the the replication origin
struct as free. This uses a condition variable on each replication
origin shmem state struct, so that the session trying to drop can sleep
and expect to be awakened by the process keeping the origin open.
Also fix a SGML markup in the previous commit.
Discussion: https://postgr.es/m/20170808001433.rozlseaf4m2wkw3n@alvherre.pgsql
In commit 9915de6c1cb2, we introduced a new wait point for replication
slots and incorrectly labelled it as wait event PG_WAIT_LOCK. That's
wrong, so invent an appropriate new wait event instead, and document it
properly.
While at it, fix numerous other problems in the vicinity:
- two different walreceiver wait events were being mixed up in a single
wait event (which wasn't documented either); split it out so that they
can be distinguished, and document the new events properly.
- ParallelBitmapPopulate was documented but didn't exist.
- ParallelBitmapScan was not documented (I think this should be called
"ParallelBitmapScanInit" instead.)
- Logical replication wait events weren't documented
- various symbols had been added in dartboard order in various places.
Put them in alphabetical order instead, as was originally intended.
Discussion: https://postgr.es/m/20170808181131.mu4fjepuh5m75cyq@alvherre.pgsql
Per a report from AP, it's not that hard to exhaust the supply of
bitmap pages if you create a table with a hash index and then insert a
few billion rows - and then you start getting errors when you try to
insert additional rows. In the particular case reported by AP,
there's another fix that we can make to improve recycling of overflow
pages, which is another way to avoid the error, but there may be other
cases where this problem happens and that fix won't help. So let's
buy ourselves as much headroom as we can without rearchitecting
anything.
The comments claim that the old limit was 64GB, but it was really
only 32GB, because we didn't use all the bits in the page for bitmap
bits - only the largest power of 2 that could fit after deducting
space for the page header and so forth. Thus, we have 4kB per page
for bitmap bits, not 8kB. The new limit is thus actually 8 times the
old *real* limit but only 4 times the old *purported* limit.
Since this breaks on-disk compatibility, bump HASH_VERSION. We've
already done this earlier in this release cycle, so this doesn't cause
any incremental inconvenience for people using pg_upgrade from
releases prior to v10. However, users who use pg_upgrade to reach
10beta3 or later from 10beta2 or earlier will need to REINDEX any hash
indexes again.
Amit Kapila and Robert Haas
Discussion: http://postgr.es/m/20170704105728.mwb72jebfmok2nm2@zip.com.au
Otherwise, partitioned tables with RETURNING expressions or subject
to a WITH CHECK OPTION do not work properly.
Amit Langote, reviewed by Amit Khandekar and Etsuro Fujita. A few
comment changes by me.
Discussion: http://postgr.es/m/9a39df80-871e-6212-0684-f93c83be4097@lab.ntt.co.jp
XLByteToSeg and XLByteToPrevSeg calculate only a segment number. The
definition of these macros were modified by commit
dfda6ebaec6763090fb78b458a979b558c50b39b but the comment remain
unchanged.
Patch by Yugo Nagata. Back patched to 9.3 and beyond.
1024 bits is considered weak these days, but OpenSSL always passes 1024 as
the key length to the tmp_dh callback. All the code to handle other key
lengths is, in fact, dead.
To remedy those issues:
* Only include hard-coded 2048-bit parameters.
* Set the parameters directly with SSL_CTX_set_tmp_dh(), without the
callback
* The name of the file containing the DH parameters is now a GUC. This
replaces the old hardcoded "dh1024.pem" filename. (The files for other
key lengths, dh512.pem, dh2048.pem, etc. were never actually used.)
This is not a new problem, but it doesn't seem worth the risk and churn to
backport. If you care enough about the strength of the DH parameters on
old versions, you can create custom DH parameters, with as many bits as you
wish, and put them in the "dh1024.pem" file.
Per report by Nicolas Guini and Damian Quiroga. Reviewed by Michael Paquier.
Discussion: https://www.postgresql.org/message-id/CAMxBoUyjOOautVozN6ofzym828aNrDjuCcOTcCquxjwS-L2hGQ@mail.gmail.com
This allows us to add stack-depth checks the first time an executor
node is called, and skip that overhead on following
calls. Additionally it yields a nice speedup.
While it'd probably have been a good idea to have that check all
along, it has become more important after the new expression
evaluation framework in b8d7f053c5c2bf2a7e - there's no stack depth
check in common paths anymore now. We previously relied on
ExecEvalExpr() being executed somewhere.
We should move towards that model for further routines, but as this is
required for v10, it seems better to only do the necessary (which
already is quite large).
Author: Andres Freund, Tom Lane
Reported-By: Julien Rouhaud
Discussion:
https://postgr.es/m/22833.1490390175@sss.pgh.pa.ushttps://postgr.es/m/b0af9eaa-130c-60d0-9e4e-7a135b1e0c76@dalibo.com