Upcoming patches are going to address performance issues that involve
slow system provided ntohs/htons etc. To address that expand
pg_bswap.h to provide pg_ntoh{16,32,64}, pg_hton{16,32,64} and
optimize their respective implementations by using compiler intrinsics
for gcc compatible compilers and msvc. Fall back to manual
implementations using shifts etc otherwise.
Additionally remove multiple evaluation hazards from the existing
BSWAP32/64 macros, by replacing them with inline functions when
necessary. In the course of that the naming scheme is changed to
pg_bswap16/32/64.
Author: Andres Freund
Discussion: https://postgr.es/m/20170927172019.gheidqy6xvlxb325@alap3.anarazel.de
Add bgw_type field to background worker structure. It is intended to be
set to the same value for all workers of the same type, so they can be
grouped in pg_stat_activity, for example.
The backend_type column in pg_stat_activity now shows bgw_type for a
background worker. The ps listing also no longer calls out that a
process is a background worker but just show the bgw_type. That way,
being a background worker is more of an implementation detail now that
is not shown to the user. However, most log messages still refer to
'background worker "%s"'; otherwise constructing sensible and
translatable log messages would become tricky.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
At the time replacement_sort_tuples was introduced, there were still
cases where replacement selection sort noticeably outperformed using
quicksort even for the first run. However, those cases seem to have
evaporated as a result of further improvements made since that time
(and perhaps also advances in CPU technology). So remove replacement
selection and the controlling GUC entirely. This makes tuplesort.c
noticeably simpler and probably paves the way for further
optimizations someone might want to do later.
Peter Geoghegan, with review and testing by Tomas Vondra and me.
Discussion: https://postgr.es/m/CAH2-WzmmNjG_K0R9nqYwMq3zjyJJK+hCbiZYNGhAy-Zyjs64GQ@mail.gmail.com
This reverts commit 15bc038f9, along with the followon commits 1635e80d3
and 984c92074 that tried to clean up the problems exposed by bug #14825.
The result was incomplete because it failed to address parallel-query
requirements. With 10.0 release so close upon us, now does not seem like
the time to be adding more code to fix that. I hope we can un-revert this
code and add the missing parallel query support during the v11 cycle.
Back-patch to v10.
Discussion: https://postgr.es/m/20170922185904.1448.16585@wrigleys.postgresql.org
Commit 15bc038f9 allowed ALTER TYPE ADD VALUE to be executed inside
transaction blocks, by disallowing the use of the added value later
in the same transaction, except under limited circumstances. However,
the test for "limited circumstances" was heuristic and could reject
references to enum values that were created during CREATE TYPE AS ENUM,
not just later. This breaks the use-case of restoring pg_dump scripts
in a single transaction, as reported in bug #14825 from Balazs Szilfai.
We can improve this by keeping a "blacklist" table of enum value OIDs
created by ALTER TYPE ADD VALUE during the current transaction. Any
visible-but-uncommitted value whose OID is not in the blacklist must
have been created by CREATE TYPE AS ENUM, and can be used safely
because it could not have a lifespan shorter than its parent enum type.
This change also removes the restriction that a renamed enum value
can't be used before being committed (unless it was on the blacklist).
Andrew Dunstan, with cosmetic improvements by me.
Back-patch to v10.
Discussion: https://postgr.es/m/20170922185904.1448.16585@wrigleys.postgresql.org
This was intended as infrastructure for weakening VACUUM's locking
requirements, similar to what was done for btree indexes in commit
2ed5b87f96d473962ec5230fd820abfeaccb2069. However, for hash indexes,
it seems that the improvements which are possible are actually
extremely marginal. Furthermore, performing the LSN cross-check will
end up skipping cleanup far more often than is necessary; we only care
about page modifications due to a VACUUM, but the LSN check will fail
if ANY modification has occurred. So, rather than pressing forward
with that "optimization", just rip the LSN field out.
Patch by me, reviewed by Ashutosh Sharma and Amit Kapila
Discussion: http://postgr.es/m/CAA4eK1JxqqcuC5Un7YLQVhOYSZBS+t=3xqZuEkt5RyquyuxpwQ@mail.gmail.com
On Linux, shared memory segments created with shm_open() are backed by
swap files created in tmpfs. If the swap file needs to be extended,
but there's no tmpfs space left, you get a very unfriendly SIGBUS trap.
To avoid this, force allocation of the full request size when we create
the segment. This adds a few cycles, but none that we wouldn't expend
later anyway, assuming the request isn't hugely bigger than the actual
need.
Make this code #ifdef __linux__, because (a) there's not currently a
reason to think the same problem exists on other platforms, and (b)
applying posix_fallocate() to an FD created by shm_open() isn't very
portable anyway.
Back-patch to 9.4 where the DSM code came in.
Thomas Munro, per a bug report from Amul Sul
Discussion: https://postgr.es/m/1002664500.12301802.1471008223422.JavaMail.yahoo@mail.yahoo.com
The file handling functions from fd.c were called with a diverse mix of
notations for the file permissions when they were opening new files.
Almost all files created by the server should have the same permissions
set. So change the API so that e.g. OpenTransientFile() automatically
uses the standard permissions set, and OpenTransientFilePerm() is a new
function that takes an explicit permissions set for the few cases where
it is needed. This also saves an unnecessary argument for call sites
that are just opening an existing file.
While we're reviewing these APIs, get rid of the FileName typedef and
use the standard const char * for the file name and mode_t for the file
mode. This makes these functions match other file handling functions
and removes an unnecessary layer of mysteriousness. We can also get rid
of a few casts that way.
Author: David Steele <david@pgmasters.net>
Commit 09cb5c0e7d6fbc9dee26dc429e4fc0f2a88e5272 added a similar
optimization to btree back in 2006, but nobody bothered to implement
the same thing for hash indexes, probably because they weren't
WAL-logged and had lots of other performance problems as well. As
with the corresponding btree case, this eliminates the problem of
potentially needing to refind our position within the page, and cuts
down on pin/unpin traffic as well.
Ashutosh Sharma, reviewed by Alexander Korotkov, Jesper Pedersen,
Amit Kapila, and me. Some final edits to comments and README by
me.
Discussion: http://postgr.es/m/CAE9k0Pm3KTx93K8_5j6VMzG4h5F+SyknxUwXrN-zqSZ9X8ZS3w@mail.gmail.com
These functions are required by SUS v2, which is our minimum baseline
for Unix platforms, and are present on all interesting Windows versions
as well. Even our oldest buildfarm members have them. Thus, we were not
testing the "!USE_WIDE_UPPER_LOWER" code paths, which explains why the bug
fixed in commit e6023ee7f escaped detection. Per discussion, there seems
to be no more real-world value in maintaining this option. Hence, remove
the configure-time tests for wcstombs() and towlower(), remove the
USE_WIDE_UPPER_LOWER symbol, and remove all the !USE_WIDE_UPPER_LOWER code.
There's not actually all that much of the latter, but simplifying the #if
nests is a win in itself.
Discussion: https://postgr.es/m/20170921052928.GA188913@rfd.leadboat.com
"\if :{?variable_name}" will be translated to "\if TRUE" if the variable
exists and "\if FALSE" otherwise. Thus it will be possible to execute code
conditionally on the existence of the variable, regardless of its value.
Fabien Coelho, with some review by Robins Tharakan and some light text
editing by me.
Discussion: https://postgr.es/m/alpine.DEB.2.20.1708260835520.3627@lancre
This is not used for anything yet, but it is necessary infrastructure
for partition-wise join and for partition pruning without constraint
exclusion.
Ashutosh Bapat, reviewed by Amit Langote and with quite a few changes,
mostly cosmetic, by me. Additional review and testing of this patch
series by Antonin Houska, Amit Khandekar, Rafia Sabih, Rajkumar
Raghuwanshi, Thomas Munro, and Dilip Kumar.
Discussion: http://postgr.es/m/CAFjFpRfneFG3H+F6BaiXemMrKF+FY-POpx3Ocy+RiH3yBmXSNw@mail.gmail.com
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
The preceding patch allowed us to remove useless GiST support functions.
This patch actually does that for all the no-op cases in the core GiST
code. This buys us whatever performance gain is to be had, and more
importantly exercises the preceding patch.
There remain no-op functions in the contrib GiST opclasses, but those
will take more work to remove.
Discussion: https://postgr.es/m/CAJEAwVELVx9gYscpE=Be6iJxvdW5unZ_LkcAaVNSeOwvdwtD=A@mail.gmail.com
The use of strnlen rather than strlen was just paranoia. Instead of
giving up on the paranoia, just implement the safeguard
differently. And add a comment explaining why we're careful.
Author: Andres Freund
Discussion: https://postgr.es/m/E1duOkJ-0001Mc-U5@gemulon.postgresql.org
Previously multi-byte aware truncation was done on every
pgstat_report_activity() call - proving to be a bottleneck for
workloads with long query strings that execute quickly.
Instead move the truncation to the read side, which commonly is
executed far less frequently. That's possible because all server
encodings allow to determine the length of a multi-byte string from
the first byte.
Rename PgBackendStatus.st_activity to st_activity_raw so existing
extension users of the field break - their code has to be adjusted to
use pgstat_clip_activity().
Author: Andres Freund
Tested-By: Khuntal Ghosh
Reviewed-By: Robert Haas, Tom Lane
Discussion: https://postgr.es/m/20170912071948.pa7igbpkkkviecpz@alap3.anarazel.de
The bug was caused by not re-reading the control file during crash
recovery restarts, which lead to an attempt to pfree() shared memory
contents. The fix is to re-read the control file, which seems good
anyway.
It's unclear as of this moment, whether we want to keep the
refactoring introduced in the commit referenced above, or come up with
an alternative approach. But fixing the bug in the mean time seems
like a good idea regardless.
A followup commit will introduce regression test coverage for crash
restarts.
Reported-By: Tom Lane
Discussion: https://postgr.es/m/14134.1505572349@sss.pgh.pa.us
Make the btree page-flags test macros (P_ISLEAF and friends) return clean
boolean values, rather than values that might not fit in a bool. Use them
in a few places that were randomly referencing the flag bits directly.
In passing, change access/nbtree/'s only direct use of BUFFER_LOCK_SHARE to
BT_READ. (Some think we should go the other way, but as long as we have
BT_READ/BT_WRITE, let's use them consistently.)
Masahiko Sawada, reviewed by Doug Doole
Discussion: https://postgr.es/m/CAD21AoBmWPeN=WBB5Jvyz_Nt3rmW1ebUyAnk3ZbJP3RMXALJog@mail.gmail.com
By project convention, these names should include "P" when dealing with a
pointer type; that is, if the result of a GETARG macro is of type FOO *,
it should be called PG_GETARG_FOO_P not just PG_GETARG_FOO. Some newer
types such as JSONB and ranges had not followed the convention, and a
number of contrib modules hadn't gotten that memo either. Rename the
offending macros to improve consistency.
In passing, fix a few places that thought PG_DETOAST_DATUM() returns
a Datum; it does not, it returns "struct varlena *". Applying
DatumGetPointer to that happens not to cause any bad effects today,
but it's formally wrong. Also, adjust an ltree macro that was designed
without any thought for what pgindent would do with it.
This is all cosmetic and shouldn't have any impact on generated code.
Mark Dilger, some further tweaks by me
Discussion: https://postgr.es/m/EA5676F4-766F-4F38-8348-ECC7DB427C6A@gmail.com
The standard says that all changes of the same kind (insert, update, or
delete) caused in one table by a single SQL statement should be reported
in a single transition table; and by that, they mean to include foreign key
enforcement actions cascading from the statement's direct effects. It's
also reasonable to conclude that if the standard had wCTEs, they would say
that effects of wCTEs applying to the same table as each other or the outer
statement should be merged into one transition table. We weren't doing it
like that.
Hence, arrange to merge tuples from multiple update actions into a single
transition table as much as we can. There is a problem, which is that if
the firing of FK enforcement triggers and after-row triggers with
transition tables is interspersed, we might need to report more tuples
after some triggers have already seen the transition table. It seems like
a bad idea for the transition table to be mutable between trigger calls.
There's no good way around this without a major redesign of the FK logic,
so for now, resolve it by opening a new transition table each time this
happens.
Also, ensure that AFTER STATEMENT triggers fire just once per statement,
or once per transition table when we're forced to make more than one.
Previous versions of Postgres have allowed each FK enforcement query
to cause an additional firing of the AFTER STATEMENT triggers for the
referencing table, but that's certainly not per spec. (We're still
doing multiple firings of BEFORE STATEMENT triggers, though; is that
something worth changing?)
Also, forbid using transition tables with column-specific UPDATE triggers.
The spec requires such transition tables to show only the tuples for which
the UPDATE trigger would have fired, which means maintaining multiple
transition tables or else somehow filtering the contents at readout.
Maybe someday we'll bother to support that option, but it looks like a
lot of trouble for a marginal feature.
The transition tables are now managed by the AfterTriggers data structures,
rather than being directly the responsibility of ModifyTable nodes. This
removes a subtransaction-lifespan memory leak introduced by my previous
band-aid patch 3c4359521.
In passing, refactor the AfterTriggers data structures to reduce the
management overhead for them, by using arrays of structs rather than
several parallel arrays for per-query-level and per-subtransaction state.
I failed to resist the temptation to do some copy-editing on the SGML
docs about triggers, above and beyond merely documenting the effects
of this patch.
Back-patch to v10, because we don't want the semantics of transition
tables to change post-release.
Patch by me, with help and review from Thomas Munro.
Discussion: https://postgr.es/m/20170909064853.25630.12825@wrigleys.postgresql.org
Tuples can have type RECORDOID and a typmod number that identifies a blessed
TupleDesc in a backend-private cache. To support the sharing of such tuples
through shared memory and temporary files, provide a typmod registry in
shared memory.
To achieve that, introduce per-session DSM segments, created on demand when a
backend first runs a parallel query. The per-session DSM segment has a
table-of-contents just like the per-query DSM segment, and initially the
contents are a shared record typmod registry and a DSA area to provide the
space it needs to grow.
State relating to the current session is accessed via a Session object
reached through global variable CurrentSession that may require significant
redesign further down the road as we figure out what else needs to be shared
or remodelled.
Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
Previously we read the control file in multiple places. But soon the
segment size will be configurable and stored in the control file, and
that needs to be available earlier than it currently is needed.
Instead of adding yet another place where it's read, refactor things
so there's a single processing of the control file during startup (in
EXEC_BACKEND that's every individual backend's startup).
Author: Andres Freund
Discussion: http://postgr.es/m/20170913092828.aozd3gvvmw67gmyc@alap3.anarazel.de
Flattening the partitioning hierarchy at this stage makes various
desirable optimizations difficult. The original use case for this
patch was partition-wise join, which wants to match up the partitions
in one partitioning hierarchy with those in another such hierarchy.
However, it now seems that it will also be useful in making partition
pruning work using the PartitionDesc rather than constraint exclusion,
because with a flattened expansion, we have no easy way to figure out
which PartitionDescs apply to which leaf tables in a multi-level
partition hierarchy.
As it turns out, we end up creating both rte->inh and !rte->inh RTEs
for each intermediate partitioned table, just as we previously did for
the root table. This seems unnecessary since the partitioned tables
have no storage and are not scanned. We might want to go back and
rejigger things so that no partitioned tables (including the parent)
need !rte->inh RTEs, but that seems to require some adjustments not
related to the core purpose of this patch.
Ashutosh Bapat, reviewed by me and by Amit Langote. Some final
adjustments by me.
Discussion: http://postgr.es/m/CAFjFpRd=1venqLL7oGU=C1dEkuvk2DJgvF+7uKbnPHaum1mvHQ@mail.gmail.com
The order in which GRANTs are output is important as GRANTs which have
been GRANT'd by individuals via WITH GRANT OPTION GRANTs have to come
after the GRANT which included the WITH GRANT OPTION. This happens
naturally in the backend during normal operation as we only change
existing ACLs in-place, only add new ACLs to the end, and when removing
an ACL we remove any which depend on it also.
Also, adjust the comments in acl.h to make this clear.
Unfortunately, the updates to pg_dump to handle initial privileges
involved pulling apart ACLs and then combining them back together and
could end up putting them back together in an invalid order, leading to
dumps which wouldn't restore.
Fix this by adjusting the queries used by pg_dump to ensure that the
ACLs are rebuilt in the same order in which they were originally.
Back-patch to 9.6 where the changes for initial privileges were done.
Historically, the selectivity functions have simply not distinguished
< from <=, or > from >=, arguing that the fraction of the population that
satisfies the "=" aspect can be considered to be vanishingly small, if the
comparison value isn't any of the most-common-values for the variable.
(If it is, the code path that executes the operator against each MCV will
take care of things properly.) But that isn't really true unless we're
dealing with a continuum of variable values, and in practice we seldom are.
If "x = const" would estimate a nonzero number of rows for a given const
value, then it follows that we ought to estimate different numbers of rows
for "x < const" and "x <= const", even if the const is not one of the MCVs.
Handling this more honestly makes a significant difference in edge cases,
such as the estimate for a tight range (x BETWEEN y AND z where y and z
are close together).
Hence, split scalarltsel into scalarltsel/scalarlesel, and similarly
split scalargtsel into scalargtsel/scalargesel. Adjust <= and >=
operator definitions to reference the new selectivity functions.
Improve the core ineq_histogram_selectivity() function to make a
correction for equality. (Along the way, I learned quite a bit about
exactly why that function gives good answers, which I tried to memorialize
in improved comments.)
The corresponding join selectivity functions were, and remain, just stubs.
But I chose to split them similarly, to avoid confusion and to prevent the
need for doing this exercise again if someone ever makes them less stubby.
In passing, change ineq_histogram_selectivity's clamp for extreme
probability estimates so that it varies depending on the histogram
size, instead of being hardwired at 0.0001. With the default histogram
size of 100 entries, you still get the old clamp value, but bigger
histograms should allow us to put more faith in edge values.
Tom Lane, reviewed by Aleksander Alekseev and Kuntal Ghosh
Discussion: https://postgr.es/m/12232.1499140410@sss.pgh.pa.us
Before, only filters of the form "(<ldapsearchattribute>=<user>)"
could be used to search an LDAP server. Introduce ldapsearchfilter
so that more general filters can be configured using patterns, like
"(|(uid=$username)(mail=$username))" and "(&(uid=$username)
(objectClass=posixAccount))". Also allow search filters to be included
in an LDAP URL.
Author: Thomas Munro
Reviewed-By: Peter Eisentraut, Mark Cave-Ayland, Magnus Hagander
Discussion: https://postgr.es/m/CAEepm=0XTkYvMci0WRubZcf_1am8=gP=7oJErpsUfRYcKF2gwg@mail.gmail.com
This code isn't used, and there's no clear reason why anybody would ever
want to use it. These traversal mechanisms don't yield a visitation order
that is semantically meaningful for any external purpose, nor are they
any faster or simpler than the left-to-right or right-to-left traversals.
(In fact, some rough testing suggests they are slower :-(.) Moreover,
these mechanisms are impossible to test in any arm's-length fashion; doing
so requires knowledge of the red-black tree's internal implementation.
Hence, let's just jettison them.
Discussion: https://postgr.es/m/17735.1505003111@sss.pgh.pa.us
In commit fccebe421, we hacked get_actual_variable_range() to scan the
index with SnapshotDirty, so that if there are many uncommitted tuples
at the end of the index range, it wouldn't laboriously scan through all
of them looking for a live value to return. However, that didn't fix it
for the case of many recently-dead tuples at the end of the index;
SnapshotDirty recognizes those as committed dead and so we're back to
the same problem.
To improve the situation, invent a "SnapshotNonVacuumable" snapshot type
and use that instead. The reason this helps is that, if the snapshot
rejects a given index entry, we know that the indexscan will mark that
index entry as killed. This means the next get_actual_variable_range()
scan will proceed past that entry without visiting the heap, making the
scan a lot faster. We may end up accepting a recently-dead tuple as
being the estimated extremal value, but that doesn't seem much worse than
the compromise we made before to accept not-yet-committed extremal values.
The cost of the scan is still proportional to the number of dead index
entries at the end of the range, so in the interval after a mass delete
but before VACUUM's cleaned up the mess, it's still possible for
get_actual_variable_range() to take a noticeable amount of time, if you've
got enough such dead entries. But the constant factor is much much better
than before, since all we need to do with each index entry is test its
"killed" bit.
We chose to back-patch commit fccebe421 at the time, but I'm hesitant to
do so here, because this form of the problem seems to affect many fewer
people. Also, even when it happens, it's less bad than the case fixed
by commit fccebe421 because we don't get the contention effects from
expensive TransactionIdIsInProgress tests.
Dmitriy Sarafannikov, reviewed by Andrey Borodin
Discussion: https://postgr.es/m/05C72CF7-B5F6-4DB9-8A09-5AC897653113@yandex.ru
It is equivalent in ANSI C to write (*funcptr) () and funcptr(). These
two styles have been applied inconsistently. After discussion, we'll
use the more verbose style for plain function pointer variables, to make
it clear that it's a variable, and the shorter style when the function
pointer is in a struct (s.func() or s->func()), because then it's clear
that it's not a plain function name, and otherwise the excessive
punctuation makes some of those invocations hard to read.
Discussion: https://www.postgresql.org/message-id/f52c16db-14ed-757d-4b48-7ef360b1631d@2ndquadrant.com
This doesn't allow routing tuple to the foreign partitions themselves,
but it permits tuples to be routed to regular partitions despite the
presence of foreign partitions in the same inheritance hierarchy.
Etsuro Fujita, reviewed by Amit Langote and by me.
Discussion: http://postgr.es/m/bc3db4c1-1693-3b8a-559f-33ad2b50b7ad@lab.ntt.co.jp
Issuing a savepoint-related command in a Query message that contains
multiple SQL statements led to a FATAL exit with a complaint about
"unexpected state STARTED". This is a shortcoming of commit 4f896dac1,
which attempted to prevent such misbehaviors in multi-statement strings;
its quick hack of marking the individual statements as "not top-level"
does the wrong thing in this case, and isn't a very accurate description
of the situation anyway.
To fix, let's introduce into xact.c an explicit model of what happens for
multi-statement Query strings. This is an "implicit transaction block
in progress" state, which for many purposes works like the normal
TBLOCK_INPROGRESS state --- in particular, IsTransactionBlock returns true,
causing the desired result that PreventTransactionChain will throw error.
But in case of error abort it works like TBLOCK_STARTED, allowing the
transaction to be cancelled without need for an explicit ROLLBACK command.
Commit 4f896dac1 is reverted in toto, so that we go back to treating the
individual statements as "top level". We could have left it as-is, but
this allows sharpening the error message for PreventTransactionChain
calls inside functions.
Except for getting a normal error instead of a FATAL exit for savepoint
commands, this patch should result in no user-visible behavioral change
(other than that one error message rewording). There are some things
we might want to do in the line of changing the appearance or wording of
error and warning messages around this behavior, which would be much
simpler to do now that it's an explicitly modeled state. But I haven't
done them here.
Although this fixes a long-standing bug, no backpatch. The consequences
of the bug don't seem severe enough to justify the risk that this commit
itself creates some new issue.
Patch by me, but it owes something to previous investigation by
Takayuki Tsunakawa, who also reported the bug in the first place.
Also thanks to Michael Paquier for reviewing.
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F6BE40D@G01JPEXMBYT05
In the generic atomic ops that rely on a loop around a CAS primitive,
there's no need to force the initial read of the "old" value to be atomic.
In the typically-rare case that we get a torn value, that simply means
that the first CAS attempt will fail; but it will update "old" to the
atomically-read value, so the next attempt has a chance of succeeding.
It was already being done that way in pg_atomic_exchange_u64_impl(),
but let's duplicate the approach in the rest.
(Given the current coding of the pg_atomic_read functions, this change
is a no-op anyway on popular platforms; it only makes a difference where
pg_atomic_read_u64_impl() is implemented as a CAS.)
In passing, also remove unnecessary take-a-pointer-and-dereference-it
coding in the pg_atomic_read functions. That seems to have been based
on a misunderstanding of what the C standard requires. What actually
matters is that the pointer be declared as pointing to volatile, which
it is.
I don't believe this will change the assembly code at all on x86
platforms (even ignoring the likelihood that these implementations
get overridden by others); but it may help on less-mainstream CPUs.
Discussion: https://postgr.es/m/13707.1504718238@sss.pgh.pa.us
Index columns are referenced by ordinal number rather than name, e.g.
CREATE INDEX coord_idx ON measured (x, y, (z + t));
ALTER INDEX coord_idx ALTER COLUMN 3 SET STATISTICS 1000;
Incompatibility note for release notes:
\d+ for indexes now also displays Stats Target
Authors: Alexander Korotkov, with contribution by Adrien NAYRAT
Review: Adrien NAYRAT, Simon Riggs
Wordsmith: Simon Riggs
In addition to __sync_fetch_and_add, gcc offers __sync_fetch_and_sub,
__sync_fetch_and_and, and __sync_fetch_and_or, which correspond directly
to primitive atomic ops that we want. Testing shows that in some cases
they generate better code than our generic implementations, so use them.
We've assumed that configure's test for __sync_val_compare_and_swap is
sufficient to allow assuming that __sync_fetch_and_add is available, so
make the same assumption for these functions. Should that prove to be
wrong, we can add more configure tests.
Yura Sokolov, reviewed by Jesper Pedersen and myself
Discussion: https://postgr.es/m/7f65886daca545067f82bf2b463b218d@postgrespro.ru
The pg_atomic_compare_exchange_xxx functions are defined to update
*expected to whatever they read from the target variable. Therefore,
there's no need to do additional explicit reads after we've initialized
the "old" variable. The actual benefit of this is somewhat debatable,
but it seems fairly unlikely to hurt anything, especially since we
will override the generic implementations in most performance-sensitive
cases.
Yura Sokolov, reviewed by Jesper Pedersen and myself
Discussion: https://postgr.es/m/7f65886daca545067f82bf2b463b218d@postgrespro.ru
The NAMEDTUPLESTORE patch piggybacked on the infrastructure for
TABLEFUNC/VALUES/CTE RTEs, none of which can ever have dropped columns,
so the possibility was ignored most places. Fix that, including adding a
specification to parsenodes.h about what it's supposed to look like.
In passing, clean up assorted comments that hadn't been maintained
properly by said patch.
Per bug #14799 from Philippe Beaudoin. Back-patch to v10.
Discussion: https://postgr.es/m/20170906120005.25630.84360@wrigleys.postgresql.org
It has not been used in a long time, and it doesn't seem safe anyway, so
drop it.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: Ryan Murphy <ryanfmurphy@gmail.com>
Some compilers complain, not unreasonably, about left-shifting an
int32 "1" and then assigning the result to an int64. In practice
I sure hope that this data structure never gets large enough that
an overflow would actually occur; but let's cast the constant to
the right type to avoid the hazard.
In passing, fix a typo in dshash.h.
Amit Kapila, adjusted as per comment from Thomas Munro.
Discussion: https://postgr.es/m/CAA4eK1+5vfVMYtjK_NX8O3-42yM3o80qdqWnQzGquPrbq6mb+A@mail.gmail.com
Move the responsibility for creating/destroying TupleQueueReaders into
execParallel.c, to avoid duplicative coding in nodeGather.c and
nodeGatherMerge.c. Also, instead of having DestroyTupleQueueReader do
shm_mq_detach, do it in the caller (which is now only ExecParallelFinish).
This means execParallel.c does both the attaching and detaching of the
tuple-queue-reader shm_mqs, which seems less weird than the previous
arrangement.
These changes also eliminate a vestigial memory leak (of the pei->tqueue
array). It's now demonstrable that rescans of Gather or GatherMerge don't
leak memory.
Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us