The file handling functions from fd.c were called with a diverse mix of
notations for the file permissions when they were opening new files.
Almost all files created by the server should have the same permissions
set. So change the API so that e.g. OpenTransientFile() automatically
uses the standard permissions set, and OpenTransientFilePerm() is a new
function that takes an explicit permissions set for the few cases where
it is needed. This also saves an unnecessary argument for call sites
that are just opening an existing file.
While we're reviewing these APIs, get rid of the FileName typedef and
use the standard const char * for the file name and mode_t for the file
mode. This makes these functions match other file handling functions
and removes an unnecessary layer of mysteriousness. We can also get rid
of a few casts that way.
Author: David Steele <david@pgmasters.net>
Tuples can have type RECORDOID and a typmod number that identifies a blessed
TupleDesc in a backend-private cache. To support the sharing of such tuples
through shared memory and temporary files, provide a typmod registry in
shared memory.
To achieve that, introduce per-session DSM segments, created on demand when a
backend first runs a parallel query. The per-session DSM segment has a
table-of-contents just like the per-query DSM segment, and initially the
contents are a shared record typmod registry and a DSA area to provide the
space it needs to grow.
State relating to the current session is accessed via a Session object
reached through global variable CurrentSession that may require significant
redesign further down the road as we figure out what else needs to be shared
or remodelled.
Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
Commit 0e141c0fbb211bdd23783afa731e3eef95c9ad7a introduced a mechanism
to reduce contention on ProcArrayLock by having a single process clear
XIDs in the procArray on behalf of multiple processes, reducing the
need to hand the lock around. A previous attempt to introduce a similar
mechanism for CLogControlLock in ccce90b398673d55b0387b3de66639b1b30d451b
crashed and burned, but the design problem which resulted in those
failures is believed to have been corrected in this version.
Amit Kapila, with some cosmetic changes by me. See the previous commit
message for additional credits.
Discussion: http://postgr.es/m/CAA4eK1KudxzgWhuywY_X=yeSAhJMT4DwCjroV5Ay60xaeB2Eew@mail.gmail.com
The logic around shm_mq_detach was a few bricks shy of a load, because
(contrary to the comments for shm_mq_attach) all it did was update the
shared shm_mq state. That left us leaking a bit of process-local
memory, but much worse, the on_dsm_detach callback for shm_mq_detach
was still armed. That means that whenever we ultimately detach from
the DSM segment, we'd run shm_mq_detach again for already-detached,
possibly long-dead queues. This accidentally fails to fail today,
because we only ever re-use a shm_mq's memory for another shm_mq, and
multiple detach attempts on the last such shm_mq are fairly harmless.
But it's gonna bite us someday, so let's clean it up.
To do that, change shm_mq_detach's API so it takes a shm_mq_handle
not the underlying shm_mq. This makes the callers simpler in most
cases anyway. Also fix a few places in parallel.c that were just
pfree'ing the handle structs rather than doing proper cleanup.
Back-patch to v10 because of the risk that the revenant shm_mq_detach
callbacks would cause a live bug sometime. Since this is an API
change, it's too late to do it in 9.6. (We could make a variant
patch that preserves API, but I'm not excited enough to do that.)
Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us
Periodically while the server is running, and at shutdown, write out a
list of blocks in shared buffers. When the server reaches consistency
-- unfortunatey, we can't do it before that point without breaking
things -- reload those blocks into any still-unused shared buffers.
Mithun Cy and Robert Haas, reviewed and tested by Beena Emerson,
Amit Kapila, Jim Nasby, and Rafia Sabih.
Discussion: http://postgr.es/m/CAD__OugubOs1Vy7kgF6xTjmEqTR4CrGAv8w+ZbaY_+MZeitukw@mail.gmail.com
The API for WaitLatch and friends followed the Unix convention in which
waiting for a socket connection to complete is identical to waiting for
the socket to accept a write. While Windows provides a select(2)
emulation that agrees with that, the native WaitForMultipleObjects API
treats them as quite different --- and for some bizarre reason, it will
report a not-yet-connected socket as write-ready. libpq itself has so
far escaped dealing with this because it waits with select(), but in
libpqwalreceiver.c we want to wait using WaitLatchOrSocket. The semantics
mismatch resulted in replication connection failures on Windows, but only
for remote connections (apparently, localhost connections complete
immediately, or at least too fast for anyone to have noticed the problem
in single-machine testing).
To fix, introduce an additional WL_SOCKET_CONNECTED wait flag for
WaitLatchOrSocket, which is identical to WL_SOCKET_WRITEABLE on
non-Windows, but results in waiting for FD_CONNECT events on Windows.
Ideally, we would also distinguish the two conditions in the API for
PQconnectPoll(), but changing that API at this point seems infeasible.
Instead, cheat by checking for PQstatus() == CONNECTION_STARTED to
determine that we're still waiting for the connection to complete.
(This is a cheat mainly because CONNECTION_STARTED is documented as an
internal state rather than something callers should rely on. Perhaps
we ought to change the documentation ... but this patch doesn't.)
Per reports from Jobin Augustine and Igor Neyman. Back-patch to v10
where commit 1e8a85009 exposed this longstanding shortcoming.
Andres Freund, minor fix and some code review/beautification by me
Discussion: https://postgr.es/m/CAHBggj8g2T+ZDcACZ2FmzX9CTxkWjKBsHd6NkYB4i9Ojf6K1Fw@mail.gmail.com
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
The new indent version includes numerous fixes thanks to Piotr Stefaniak.
The main changes visible in this commit are:
* Nicer formatting of function-pointer declarations.
* No longer unexpectedly removes spaces in expressions using casts,
sizeof, or offsetof.
* No longer wants to add a space in "struct structname *varname", as
well as some similar cases for const- or volatile-qualified pointers.
* Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely.
* Fixes bug where comments following declarations were sometimes placed
with no space separating them from the code.
* Fixes some odd decisions for comments following case labels.
* Fixes some cases where comments following code were indented to less
than the expected column 33.
On the less good side, it now tends to put more whitespace around typedef
names that are not listed in typedefs.list. This might encourage us to
put more effort into typedef name collection; it's not really a bug in
indent itself.
There are more changes coming after this round, having to do with comment
indentation and alignment of lines appearing within parentheses. I wanted
to limit the size of the diffs to something that could be reviewed without
one's eyes completely glazing over, so it seemed better to split up the
changes as much as practical.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
Previously we required every exported transaction to have an xid
assigned. That was used to check that the exporting transaction is
still running, which in turn is needed to guarantee that that
necessary rows haven't been removed in between exporting and importing
the snapshot.
The exported xid caused unnecessary problems with logical decoding,
because slot creation has to wait for all concurrent xid to finish,
which in turn serializes concurrent slot creation. It also
prohibited snapshots to be exported on hot-standby replicas.
Instead export the virtual transactionid, which avoids the unnecessary
serialization and the inability to export snapshots on standbys. This
changes the file name of the exported snapshot, but since we never
documented what that one means, that seems ok.
Author: Petr Jelinek, slightly editorialized by me
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/f598b4b8-8cd7-0d54-0939-adda763d8c34@2ndquadrant.com
When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and
throws a PANIC if so. At that point, only walsenders are still
active, so one might think this could not happen, but walsenders can
also generate WAL, for instance in BASE_BACKUP and logical decoding
related commands (e.g. via hint bits). So they can trigger this panic
if such a command is run while the shutdown checkpoint is being
written.
To fix this, divide the walsender shutdown into two phases. First,
checkpointer, itself triggered by postmaster, sends a
PROCSIG_WALSND_INIT_STOPPING signal to all walsenders. If the backend
is idle or runs an SQL query this causes the backend to shutdown, if
logical replication is in progress all existing WAL records are
processed followed by a shutdown. Otherwise this causes the walsender
to switch to the "stopping" state. In this state, the walsender will
reject any further replication commands. The checkpointer begins the
shutdown checkpoint once all walsenders are confirmed as
stopping. When the shutdown checkpoint finishes, the postmaster sends
us SIGUSR2. This instructs walsender to send any outstanding WAL,
including the shutdown checkpoint record, wait for it to be replicated
to the standby, and then exit.
Author: Andres Freund, based on an earlier patch by Michael Paquier
Reported-By: Fujii Masao, Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxpvs2@alap3.anarazel.de
Backpatch: 9.4, where logical decoding was introduced
Declare the toc_nentry field as uint32 not Size. Since shm_toc_lookup()
reads the field without any lock, it has to be atomically readable, and
we do not assume that for fields wider than 32 bits. Performance would
be impossibly bad for entry counts approaching 2^32 anyway, so there is
no need to try to preserve maximum width here.
This is probably an academic issue, because even if reading int64 isn't
atomic, the high order half would never change in practice. Still, it's
a coding rule violation, so let's fix it.
Adjust some other not-terribly-well-chosen data types too, and copy-edit
some comments. Make shm_toc_attach's Asserts consistent with
shm_toc_create's.
None of this looks to be a live bug, so no need for back-patch.
Discussion: https://postgr.es/m/16984.1496679541@sss.pgh.pa.us
Given the possibility of race conditions and so on, it seems entirely
unsafe to just assume that shm_toc_lookup() always finds the key it's
looking for --- but that was exactly what all but one call site were
doing. To fix, add a "bool noError" argument, similarly to what we
have in many other functions, and throw an error on an unexpected
lookup failure. Remove now-redundant Asserts that a rather random
subset of call sites had.
I doubt this will throw any light on buildfarm member lorikeet's
recent failures, because if an unnoticed lookup failure were involved,
you'd kind of expect a null-pointer-dereference crash rather than the
observed symptom. But you never know ... and this is better coding
practice even if it never catches anything.
Discussion: https://postgr.es/m/9697.1496675981@sss.pgh.pa.us
The logical decoding machinery already preserved all the required
catalog tuples, which is sufficient in the course of normal logical
decoding, but did not guarantee that non-catalog tuples were preserved
during computation of the initial snapshot when creating a slot over
the replication protocol.
This could cause a corrupted initial snapshot being exported. The
time window for issues is usually not terribly large, but on a busy
server it's perfectly possible to it hit it. Ongoing decoding is not
affected by this bug.
To avoid increased overhead for the SQL API, only retain additional
tuples when a logical slot is being created over the replication
protocol. To do so this commit changes the signature of
CreateInitDecodingContext(), but it seems unlikely that it's being
used in an extension, so that's probably ok.
In a drive-by fix, fix handling of
ReplicationSlotsComputeRequiredXmin's already_locked argument, which
should only apply to ProcArrayLock, not ReplicationSlotControlLock.
Reported-By: Erik Rijkers
Analyzed-By: Petr Jelinek
Author: Petr Jelinek, heavily editorialized by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/9a897b86-46e1-9915-ee4c-da02e4ff6a95@2ndquadrant.com
Backport: 9.4, where logical decoding was introduced.
This used to mean "Visual C++ except in those parts where Borland C++
was supported where it meant one of those". Now that we don't support
Borland C++ anymore, simplify by using _MSC_VER which is the normal way
to detect Visual C++.
Commit 4deb41381 modified isolationtester's query to see whether a
session is blocked to also check for waits occurring in GetSafeSnapshot.
However, it did that in a way that enormously increased the query's
runtime under CLOBBER_CACHE_ALWAYS, causing the buildfarm members
that use that to run about four times slower than before, and in some
cases fail entirely. To fix, push the entire logic into a dedicated
backend function. This should actually reduce the CLOBBER_CACHE_ALWAYS
runtime from what it was previously, though I've not checked that.
In passing, expose a SQL function to check for safe-snapshot blockage,
comparable to pg_blocking_pids. This is more or less free given the
infrastructure built to solve the other problem, so we might as well.
Thomas Munro
Discussion: https://postgr.es/m/20170407165749.pstcakbc637opkax@alap3.anarazel.de
Defaults match the fixed behavior of prior releases, but now DBAs
have better options to tune serializable workloads.
It might be nice to be able to set this per relation, but that part
will need to wait for another release.
Author: Dagfinn Ilmari Mannsåker
There are no functional changes here; this simply encapsulates knowledge
of the ItemPointerData struct so that a future patch can change things
without more breakage.
All direct users of ip_blkid and ip_posid are changed to use existing
macros ItemPointerGetBlockNumber and ItemPointerGetOffsetNumber
respectively. For callers where that's inappropriate (because they
Assert that the itempointer is is valid-looking), add
ItemPointerGetBlockNumberNoCheck and ItemPointerGetOffsetNumberNoCheck,
which lack the assertion but are otherwise identical.
Author: Pavan Deolasee
Discussion: https://postgr.es/m/CABOikdNnFon4cJiL=h1mZH3bgUeU+sWHuU4Yr8AB=j3A2p1GiA@mail.gmail.com
If file was created/deleted just before powerloss it's possible that
file system will miss that. To prevent it, call fsync() where creating/
unlinkg file is critical.
Author: Michael Paquier
Reviewed-by: Ashutosh Bapat, Takayuki Tsunakawa, me
Previously, auxiliary processes and background workers not connected
to a database (such as the logical replication launcher) weren't
shown. Include them, so that we can see the associated wait state
information. Add a new column to identify the processes type, so that
people can filter them out easily using SQL if they wish.
Before this patch was written, there was discussion about whether we
should expose this information in a separate view, so as to avoid
contaminating pg_stat_activity with things people might not want to
see. But putting everything in pg_stat_activity was a more popular
choice, so that's what the patch does.
Kuntal Ghosh, reviewed by Amit Langote and Michael Paquier. Some
revisions and bug fixes by me.
Discussion: http://postgr.es/m/CA+TgmoYES5nhkEGw9nZXU8_FhA8XEm8NTm3-SO+3ML1B81Hkww@mail.gmail.com
If the upstream walsender is using a physical replication slot, store the
catalog_xmin in the slot's catalog_xmin field. If the upstream doesn't use a
slot and has only a PGPROC entry behaviour doesn't change, as we store the
combined xmin and catalog_xmin in the PGPROC entry.
Author: Craig Ringer
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
This could result in corruption of the init fork of an unlogged index
if the ambuildempty routine for that index used shared buffers to
create the init fork, which was true for brin, gin, gist, and hash
indexes.
Patch by me, based on an earlier patch by Michael Paquier, who also
reviewed this one. This also incorporates an idea from Artur
Zakirov.
Discussion: http://postgr.es/m/CACYUyc8yccE4xfxhqxfh_Mh38j7dRFuxfaK1p6dSNAEUakxUyQ@mail.gmail.com
This reverts commit ccce90b398673d55b0387b3de66639b1b30d451b. This
optimization is unsafe, at least, of rollbacks and rollbacks to
savepoints, but I'm concerned there may be other problematic cases as
well. Therefore, I've decided to revert this pending further
investigation.
Commit 0e141c0fbb211bdd23783afa731e3eef95c9ad7a introduced a mechanism
to reduce contention on ProcArrayLock by having a single process clear
XIDs in the procArray on behalf of multiple processes, reducing the
need to hand the lock around. Use a similar mechanism to reduce
contention on CLogControlLock. Testing shows that this very
significantly reduces the amount of time waiting for CLogControlLock
on high-concurrency pgbench tests run on a large multi-socket
machines; whether that translates into a TPS improvement depends on
how much of that contention is simply shifted to some other lock,
particularly WALWriteLock.
Amit Kapila, with some cosmetic changes by me. Extensively reviewed,
tested, and benchmarked over a period of about 15 months by Simon
Riggs, Robert Haas, Andres Freund, Jesper Pedersen, and especially by
Tomas Vondra and Dilip Kumar.
Discussion: http://postgr.es/m/CAA4eK1L_snxM_JcrzEstNq9P66++F4kKFce=1r5+D1vzPofdtg@mail.gmail.com
Discussion: http://postgr.es/m/CAA4eK1LyR2A+m=RBSZ6rcPEwJ=rVi1ADPSndXHZdjn56yqO6Vg@mail.gmail.com
Discussion: http://postgr.es/m/91d57161-d3ea-0cc2-6066-80713e4f90d7@2ndquadrant.com
When a shared iterator is used, each call to tbm_shared_iterate()
returns a result that has not yet been returned to any process
attached to the shared iterator. In other words, each cooperating
processes gets a disjoint subset of the full result set, but all
results are returned exactly once.
This is infrastructure for parallel bitmap heap scan.
Dilip Kumar. The larger patch set of which this is a part has been
reviewed and tested by (at least) Andres Freund, Amit Khandekar,
Tushar Ahuja, Rafia Sabih, Haribabu Kommi, and Thomas Munro.
Discussion: http://postgr.es/m/CAFiTN-uc4=0WxRGfCzs-xfkMYcSEWUC-Fon6thkJGjkh9i=13A@mail.gmail.com
In preparation for adding write-ahead logging to hash indexes,
refactor _hash_freeovflpage and _hash_squeezebucket so that all
related page modifications happen in a single section of code. The
previous coding assumed that it would be fine to move tuples one at a
time, and also that the various operations involved in freeing an
overflow page didn't necessarily all need to be done together, all
of which is true if you don't care about write-ahead logging.
Amit Kapila, with slight changes by me.
Doing so doesn't seem to be within the purpose of the per user
connection limits, and has particularly unfortunate effects in
conjunction with parallel queries.
Backpatch to 9.6 where parallel queries were introduced.
David Rowley, reviewed by Robert Haas and Albe Laurenz.
Gen_fmgrtab.pl creates a new file fmgrprotos.h, which contains
prototypes for all functions registered in pg_proc.h. This avoids
having to manually maintain these prototypes across a random variety of
header files. It also automatically enforces a correct function
signature, and since there are warnings about missing prototypes, it
will detect functions that are defined but not registered in
pg_proc.h (or otherwise used).
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
The typical size of an LWLock is now 16 bytes even on 64-bit platforms,
and the size of slock_t is now irrelevant. But pg_atomic_uint32 can
(perhaps surprisingly) still be larger than 4 bytes, so there's still
some marginal point to allowing LWLOCK_MINIMAL_SIZE == 64.
Commit 008608b9d51061b1f598c197477b3dc7be9c4a64 made the changes
that led to the need for these updates.
If we do not reset the FD_READ event, WaitForMultipleObjects won't
return it again again unless we've meanwhile read from the socket,
which is generally true but not guaranteed. WaitEventSetWaitBlock
itself may fail to return the event to the caller if the latch is
also set, and even if we changed that, the caller isn't obliged to
handle all returned events at once. On non-Windows systems, the
socket-read event is purely level-triggered, so this issue does
not exist. To fix, make Windows reset the event when needed.
This bug was introduced by 98a64d0bd713cb89e61bef6432befc4b7b5da59e,
and causes hangs when trying to use the pldebugger extension.
Patch by Amit Kapial. Reported and tested by Ashutosh Sharma, who
also provided some analysis. Further analysis by Michael Paquier.
This was added in 105409746499657acdffc109db9d343b464bda1f, but has
never been used for anything as far as I can tell. There seems to
be no reason to keep it.
array_base and array_stride were added so that we could identify the
offset of an LWLock within a tranche, but this facility is only very
marginally used apart from the main tranche. So, give every lock in
the main tranche its own tranche ID and get rid of array_base,
array_stride, and all that's attached. For debugging facilities
(Trace_lwlocks and LWLOCK_STATS) print the pointer address of the
LWLock using %p instead of the offset. This is arguably more useful,
and certainly a lot cheaper. Drop the offset-within-tranche from
the information reported to dtrace and from one can't-happen message
inside lwlock.c.
The main user-visible impact of this change is that pg_stat_activity
will now report all waits for LWLocks as "LWLock" rather than
reporting some as "LWLockTranche" and others as "LWLockNamed".
The main motivation for this change is that the need to specify an
array_base and an array_stride is awkward for parallel query. There
is only a very limited supply of tranche IDs so we can't just keep
allocating new ones, and if we try to use the same tranche IDs every
time then we run into trouble when multiple parallel contexts are
use simultaneously. So if we didn't get rid of this mechanism we'd
have to make it even more complicated. By simplifying it in this
way, we instead reduce the size of the generated code for lwlock.c
by about 5%.
Discussion: http://postgr.es/m/CA+TgmoYsFn6NUW1x0AZtupJGUAs1UDY4dJtCN47_Q6D0sP80PA@mail.gmail.com
Previously, the "sem" field of PGPROC varied in size depending on which
kernel semaphore API we were using. That was okay as long as there was
only one likely choice per platform, but in the wake of commit ecb0d20a9,
that assumption seems rather shaky. It doesn't seem out of the question
anymore that an extension compiled against one API choice might be loaded
into a postmaster built with another choice. Moreover, this prevents any
possibility of selecting the semaphore API at postmaster startup, which
might be something we want to do in future.
Hence, change PGPROC.sem to be PGSemaphore (i.e. a pointer) for all Unix
semaphore APIs, and turn the pointed-to data into an opaque struct whose
contents are only known within the responsible modules.
For the SysV and unnamed-POSIX APIs, the pointed-to data has to be
allocated elsewhere in shared memory, which takes a little bit of
rejiggering of the InitShmemAllocation code sequence. (I invented a
ShmemAllocUnlocked() function to make that a little cleaner than it used
to be. That function is not meant for any uses other than the ones it
has now, but it beats having InitShmemAllocation() know explicitly about
allocation of space for semaphores and spinlocks.) This change means an
extra indirection to access the semaphore data, but since we only touch
that when blocking or awakening a process, there shouldn't be any
meaningful performance penalty. Moreover, at least for the unnamed-POSIX
case on Linux, the sem_t type is quite a bit wider than a pointer, so this
reduces sizeof(PGPROC) which seems like a good thing.
For the named-POSIX API, there's effectively no change: the PGPROC.sem
field was and still is a pointer to something returned by sem_open() in
the postmaster's memory space. Document and check the pre-existing
limitation that this case can't work in EXEC_BACKEND mode.
It did not seem worth unifying the Windows semaphore ABI with the Unix
cases, since there's no likelihood of needing ABI compatibility much less
runtime switching across those cases. However, we can simplify the Windows
code a bit if we define PGSemaphore as being directly a HANDLE, rather than
pointer to HANDLE, so let's do that while we're here. (This also ends up
being no change in what's physically stored in PGPROC.sem. We're just
moving the HANDLE fetch from callees to callers.)
It would take a bunch of additional code shuffling to get to the point of
actually choosing a semaphore API at postmaster start, but the effects
of that would now be localized in the port/XXX_sema.c files, so it seems
like fit material for a separate patch. The need for it is unproven as
yet, anyhow, whereas the ABI risk to extensions seems real enough.
Discussion: https://postgr.es/m/4029.1481413370@sss.pgh.pa.us
Instead of confusingly stating platform-dependent defaults for these
parameters in the comments in postgresql.conf.sample (with the main
entry being a lie on Linux), teach initdb to install the correct
platform-dependent value in postgresql.conf, similarly to the way
we handle other platform-dependent defaults. This won't do anything
for existing 9.6 installations, but since it's effectively only a
documentation improvement, that seems OK.
Since this requires initdb to have access to the default values,
move the #define's for those to pg_config_manual.h; the original
placement in bufmgr.h is unworkable because that file can't be
included by frontend programs.
Adjust the default value for wal_writer_flush_after so that it is 1MB
regardless of XLOG_BLCKSZ, conforming to what is stated in both the
SGML docs and postgresql.conf. (We could alternatively make it scale
with XLOG_BLCKSZ, but I'm not sure I see the point.)
Copy-edit related SGML documentation.
Fabien Coelho and Tom Lane, per a gripe from Tomas Vondra.
Discussion: <30ebc6e3-8358-09cf-44a8-578252938424@2ndquadrant.com>
A new thing also called a "barrier" is proposed, but whether we decide
to take that patch or not, this file seems to have outlived its
usefulness.
Thomas Munro
Condition variables provide a flexible way to sleep until a
cooperating process causes an arbitrary condition to become true. In
simple cases, this can be accomplished with a WaitLatch/ResetLatch
loop; the cooperating process can call SetLatch after performing work
that might cause the condition to be satisfied, and the waiting
process can recheck the condition each time. However, if the process
performing the work doesn't have an easy way to identify which
processes might be waiting, this doesn't work, because it can't
identify which latches to set. Condition variables solve that problem
by internally maintaining a list of waiters; a process that may have
caused some waiter's condition to be satisfied must "signal" or
"broadcast" on the condition variable.
Robert Haas and Thomas Munro
Previously, the handle for the control segment could not be zero, but
some other DSM segment could potentially have a handle value of zero.
However, that means that if someone wanted to store a dsm_handle that
might or might not be valid, they would need a separate boolean to
keep track of whether the associated value is legal. That's annoying,
so change things so that no DSM segment can ever have a handle of 0 -
or as we call it here, DSM_HANDLE_INVALID.
Thomas Munro. This was submitted as part of a much larger patch to
add an malloc-like allocator for dynamic shared memory, but this part
seems like a good idea independently of the rest of the patch.
LockBufferForCleanup() acquires a cleanup lock unconditionally, and
ConditionalLockBufferForCleanup() acquires a cleanup lock if it is
possible to do so without waiting; this patch adds a new API,
IsBufferCleanupOK(), which tests whether an exclusive lock already
held happens to be a cleanup lock. This is possible because a cleanup
lock simply means an exclusive lock plus the assurance any other pins
on the buffer are newer than our own pin. Therefore, just as the
existing functions decide that the exclusive lock that they've just
taken is a cleanup lock if they observe the pin count to be 1, this
new function allows us to observe that the pin count is 1 on a buffer
we've already locked.
This is useful in situations where a backend definitely wishes to
modify the buffer and also wishes to perform cleanup operations if
possible. The patch to eliminate heavyweight locking by hash indexes
uses this, and it may have other applications as well.
Amit Kapila, per a suggestion from me. Some comment adjustments by me
as well.
SCO OpenServer and SCO UnixWare are more or less dead platforms.
We have never had a buildfarm member testing the "sco" port, and
the last "unixware" member was last heard from in 2012, so it's
fair to doubt that the code even compiles anymore on either one.
Remove both ports. We can always undo this if someone shows up
with an interest in maintaining and testing these platforms.
Discussion: <17177.1476136994@sss.pgh.pa.us>
I somehow had assumed that in the spinlock (in turn possibly using
semaphores) based fallback atomics implementation 32 bit writes could be
done without a lock. As far as the write goes that's correct, since
postgres supports only platforms with single-copy atomicity for aligned
32bit writes. But writing without holding the spinlock breaks
read-modify-write operations like pg_atomic_compare_exchange_u32(),
since they'll potentially "miss" a concurrent write, which can't happen
in actual hardware implementations.
In 9.6+ when using the fallback atomics implementation this could lead
to buffer header locks not being properly marked as released, and
potentially some related state corruption. I don't see a related danger
in 9.5 (earliest release with the API), because pg_atomic_write_u32()
wasn't used in a concurrent manner there.
The state variable of local buffers, before this change, were
manipulated using pg_atomic_write_u32(), to avoid unnecessary
synchronization overhead. As that'd not be the case anymore, introduce
and use pg_atomic_unlocked_write_u32(), which does not correctly
interact with RMW operations.
This bug only caused issues when postgres is compiled on platforms
without atomics support (i.e. no common new platform), or when compiled
with --disable-atomics, which explains why this wasn't noticed in
testing.
Reported-By: Tom Lane
Discussion: <14947.1475690465@sss.pgh.pa.us>
Backpatch: 9.5-, where the atomic operations API was introduced.
WaitLatch, WaitLatchOrSocket, and WaitEventSetWait now taken an
additional wait_event_info parameter; legal values are defined in
pgstat.h. This makes it possible to uniquely identify every point in
the core code where we are waiting for a latch; extensions can pass
WAIT_EXTENSION.
Because latches were the major wait primitive not previously covered
by this patch, it is now possible to see information in
pg_stat_activity on a large number of important wait events not
previously addressed, such as ClientRead, ClientWrite, and SyncRep.
Unfortunately, many of the wait events added by this patch will fail
to appear in pg_stat_activity because they're only used in background
processes which don't currently appear in pg_stat_activity. We should
fix this either by creating a separate view for such information, or
else by deciding to include them in pg_stat_activity after all.
Michael Paquier and Robert Haas, reviewed by Alexander Korotkov and
Thomas Munro.