Our documentation states that our maximum field size is 1 GB, and that
our maximum row size of 1.6 TB. However, while this might be attainable
in theory with enough contortions, it is not workable in practice; for
starters, pg_dump fails to dump tables containing rows larger than 1 GB,
even if individual columns are well below the limit; and even if one
does manage to manufacture a dump file containing a row that large, the
server refuses to load it anyway.
This commit enables dumping and reloading of such tuples, provided two
conditions are met:
1. no single column is larger than 1 GB (in output size -- for bytea
this includes the formatting overhead)
2. the whole row is not larger than 2 GB
There are three related changes to enable this:
a. StringInfo's API now has two additional functions that allow creating
a string that grows beyond the typical 1GB limit (and "long" string).
ABI compatibility is maintained. We still limit these strings to 2 GB,
though, for reasons explained below.
b. COPY now uses long StringInfos, so that pg_dump doesn't choke
trying to emit rows longer than 1GB.
c. heap_form_tuple now uses the MCXT_ALLOW_HUGE flag in its allocation
for the input tuple, which means that large tuples are accepted on
input. Note that at this point we do not apply any further limit to the
input tuple size.
The main reason to limit to 2 GB is that the FE/BE protocol uses 32 bit
length words to describe each row; and because the documentation is
ambiguous on its signedness and libpq does consider it signed, we cannot
use the highest-order bit. Additionally, the StringInfo API uses "int"
(which is 4 bytes wide in most platforms) in many places, so we'd need
to change that API too in order to improve, which has lots of fallout.
Backpatch to 9.5, which is the oldest that has
MemoryContextAllocExtended, a necessary piece of infrastructure. We
could apply to 9.4 with very minimal additional effort, but any further
than that would require backpatching "huge" allocations too.
This is the largest set of changes we could find that can be
back-patched without breaking compatibility with existing systems.
Fixing a bigger set of problems (for example, dumping tuples bigger than
2GB, or dumping fields bigger than 1GB) would require changing the FE/BE
protocol and/or changing the StringInfo API in an ABI-incompatible way,
neither of which would be back-patchable.
Authors: Daniel Vérité, Álvaro Herrera
Reviewed by: Tomas Vondra
Discussion: https://postgr.es/m/20160229183023.GA286012@alvherre.pgsql
The whole walreceiver API is now wrapped into a struct, like most of our
other loadable module APIs. The libpq connection is no longer a global
variable in libpqwalreceiver. Instead, it is encapsulated into a struct
that is passed around the functions. This allows multiple walreceivers
to run at the same time.
Add some rudimentary support for logical replication connections to
libpqwalreceiver.
These changes are mostly cosmetic and are going to be useful for the
future logical replication patches.
From: Petr Jelinek <petr@2ndquadrant.com>
Replace use of poll()/select() by WaitLatchOrSocket(), which is more
portable and flexible.
Also change walreceiver to use its procLatch instead of a custom latch.
From: Petr Jelinek <petr@2ndquadrant.com>
So far the hashtable stored representative tuples in the form of its
input slot, with all columns in the hashtable that are not
needed (i.e. not grouped upon or functionally dependent) set to NULL.
Thats good for saving memory, but it turns out that having tuples full
of NULL isn't free. slot_deform_tuple is faster if there's no NULL
bitmap even if no NULLs are encountered, and skipping over leading NULLs
isn't free.
So compute a separate tuple descriptor that only contains the needed
columns. As columns have already been moved in/out the slot for the
hashtable that does not imply additional per-row overhead.
Author: Andres Freund
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/20161103110721.h5i5t5saxfk5eeik@alap3.anarazel.de
Previously we did a ExecProject() for each individual aggregate
argument. That turned out to be a performance bottleneck in queries with
multiple aggregates.
Doing all the argument computations in one ExecProject() is quite a bit
cheaper because ExecProject's fastpath can do the work at once in a
relatively tight loop, and because it can get all the required columns
with a single slot_getsomeattr and save some other redundant setup
costs.
Author: Andres Freund
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/20161103110721.h5i5t5saxfk5eeik@alap3.anarazel.de
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency. Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.
In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel. There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.
This patch also removes the unworldly assumption that a split will
never be interrupted. With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion. While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.
Amit Kapila. I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's. Also
reviewed by Jesper Pedersen, Jeff Janes, and others.
Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
Instead of confusingly stating platform-dependent defaults for these
parameters in the comments in postgresql.conf.sample (with the main
entry being a lie on Linux), teach initdb to install the correct
platform-dependent value in postgresql.conf, similarly to the way
we handle other platform-dependent defaults. This won't do anything
for existing 9.6 installations, but since it's effectively only a
documentation improvement, that seems OK.
Since this requires initdb to have access to the default values,
move the #define's for those to pg_config_manual.h; the original
placement in bufmgr.h is unworkable because that file can't be
included by frontend programs.
Adjust the default value for wal_writer_flush_after so that it is 1MB
regardless of XLOG_BLCKSZ, conforming to what is stated in both the
SGML docs and postgresql.conf. (We could alternatively make it scale
with XLOG_BLCKSZ, but I'm not sure I see the point.)
Copy-edit related SGML documentation.
Fabien Coelho and Tom Lane, per a gripe from Tomas Vondra.
Discussion: <30ebc6e3-8358-09cf-44a8-578252938424@2ndquadrant.com>
A new thing also called a "barrier" is proposed, but whether we decide
to take that patch or not, this file seems to have outlived its
usefulness.
Thomas Munro
Previously, the right-hand side of a multiple-column assignment, if it
wasn't a sub-SELECT, had to be a simple parenthesized expression list,
because gram.y was responsible for "bursting" the construct into
independent column assignments. This had the minor defect that you
couldn't write ROW (though you should be able to, since the standard says
this is a row constructor), and the rather larger defect that unlike other
uses of row constructors, we would not expand a "foo.*" item into multiple
columns.
Fix that by changing the RHS to be just "a_expr" in the grammar, leaving
it to transformMultiAssignRef to separate the elements of a RowExpr;
which it will do only after performing standard transformation of the
RowExpr, so that "foo.*" behaves as expected.
The key reason we didn't do that before was the hard-wired handling of
DEFAULT tokens (SetToDefault nodes). This patch deals with that issue by
allowing DEFAULT in any a_expr and having parse analysis throw an error
if SetToDefault is found in an unexpected place. That's an improvement
anyway since the error can be more specific than just "syntax error".
The SQL standard suggests that the RHS could be any a_expr yielding a
suitable row value. This patch doesn't really move the goal posts in that
respect --- you're still limited to RowExpr or a sub-SELECT --- but it does
fix the grammar restriction, so it provides some tangible progress towards
a full implementation. And the limitation is now documented by an explicit
error message rather than an unhelpful "syntax error".
Discussion: <8542.1479742008@sss.pgh.pa.us>
Condition variables provide a flexible way to sleep until a
cooperating process causes an arbitrary condition to become true. In
simple cases, this can be accomplished with a WaitLatch/ResetLatch
loop; the cooperating process can call SetLatch after performing work
that might cause the condition to be satisfied, and the waiting
process can recheck the condition each time. However, if the process
performing the work doesn't have an easy way to identify which
processes might be waiting, this doesn't work, because it can't
identify which latches to set. Condition variables solve that problem
by internally maintaining a list of waiters; a process that may have
caused some waiter's condition to be satisfied must "signal" or
"broadcast" on the condition variable.
Robert Haas and Thomas Munro
Like pg_tables, pg_views, and others, this view contains information
about sequences in a way that is independent of the system catalog
layout but more comprehensive than the information schema.
To help implement the view, add a new internal function
pg_sequence_last_value() to return the last value of a sequence. This
is kept separate from pg_sequence_parameters() to separate querying
run-time state from catalog-like information.
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Previously, the handle for the control segment could not be zero, but
some other DSM segment could potentially have a handle value of zero.
However, that means that if someone wanted to store a dsm_handle that
might or might not be valid, they would need a separate boolean to
keep track of whether the associated value is legal. That's annoying,
so change things so that no DSM segment can ever have a handle of 0 -
or as we call it here, DSM_HANDLE_INVALID.
Thomas Munro. This was submitted as part of a much larger patch to
add an malloc-like allocator for dynamic shared memory, but this part
seems like a good idea independently of the rest of the patch.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
Be sure to pull up the subquery's hasRowSecurity flag when flattening a
subquery in pull_up_simple_subquery(). This isn't a bug today because
we don't look at the hasRowSecurity flag during planning, but it could
easily be a bug tomorrow.
Likewise, make rewriteRuleAction() pull up the hasRowSecurity flag when
absorbing RTEs from a rule action. This isn't a bug either, for the
opposite reason: the flag should never be set yet. But again, it seems
like good future proofing.
Add a comment explaining why rewriteTargetView() should *not* set
hasRowSecurity when adding stuff to securityQuals.
Improve some nearby comments about securityQuals processing, and document
that field more completely in parsenodes.h.
Patch by me, analysis by Dean Rasheed.
Discussion: <CAEZATCXZ8tb2DV6f=bkhsMV6u_gRcZ0CZBw2J-qU84RxSukZog@mail.gmail.com>
The reloptions stuff allows this option to be set on a matview.
While it's questionable whether that is useful or was really intended,
it does work, and we shouldn't change that in minor releases. Commit
e3e66d8a9 disabled the option since I didn't realize that it was
possible for it to be set on a matview. Tweak the test to re-allow it.
Discussion: <19749.1478711862@sss.pgh.pa.us>
The idea behind SPI_push was to allow transitioning back into an
"unconnected" state when a SPI-using procedure calls unrelated code that
might or might not invoke SPI. That sounds good, but in practice the only
thing it does for us is to catch cases where a called SPI-using function
forgets to call SPI_connect --- which is a highly improbable failure mode,
since it would be exposed immediately by direct testing of said function.
As against that, we've had multiple bugs induced by forgetting to call
SPI_push/SPI_pop around code that might invoke SPI-using functions; these
are much harder to catch and indeed have gone undetected for years in some
cases. And we've had to band-aid around some problems of this ilk by
introducing conditional push/pop pairs in some places, which really kind
of defeats the purpose altogether; if we can't draw bright lines between
connected and unconnected code, what's the point?
Hence, get rid of SPI_push[_conditional], SPI_pop[_conditional], and the
underlying state variable _SPI_curid. It turns out SPI_restore_connection
can go away too, which is a nice side benefit since it was never more than
a kluge. Provide no-op macros for the deleted functions so as to avoid an
API break for external modules.
A side effect of this removal is that SPI_palloc and allied functions no
longer permit being called when unconnected; they'll throw an error
instead. The apparent usefulness of the previous behavior was a mirage
as well, because it was depended on by only a few places (which I fixed in
preceding commits), and it posed a risk of allocations being unexpectedly
long-lived if someone forgot a SPI_push call.
Discussion: <20808.1478481403@sss.pgh.pa.us>
Invent a new function heap_modify_tuple_by_cols() that is functionally
equivalent to SPI_modifytuple except that it always allocates its result
by simple palloc. I chose however to make the API details a bit more
like heap_modify_tuple: pass a tupdesc rather than a Relation, and use
bool convention for the isnull array.
Use this function in place of SPI_modifytuple at all call sites where the
intended behavior is to allocate in current context. (There actually are
only two call sites left that depend on the old behavior, which makes me
wonder if we should just drop this function rather than keep it.)
This new function is easier to use than heap_modify_tuple() for purposes
of replacing a single column (or, really, any fixed number of columns).
There are a number of places where it would simplify the code to change
over, but I resisted that temptation for the moment ... everywhere except
in plpgsql's exec_assign_value(); changing that might offer some small
performance benefit, so I did it.
This is on the way to removing SPI_push/SPI_pop, but it seems like
good code cleanup in its own right.
Discussion: <9633.1478552022@sss.pgh.pa.us>
We really ought to make StdRdOptions and the other decoded forms of
reloptions self-identifying, but for the moment, assume that only plain
relations could possibly be user_catalog_tables. Fixes problem with bogus
"ON CONFLICT is not supported on table ... used as a catalog table" error
when target is a view with cascade option.
Discussion: <26681.1477940227@sss.pgh.pa.us>
This reverts commit c8ead2a3974d3eada145a0e18940150039493cc9.
Seems there is no way to do this that doesn't cause MSVC to give
warnings, so let's just go back to the way we've been doing it.
Discussion: <11843.1478358206@sss.pgh.pa.us>
Second try at the change originally made in commit 8518583cd;
this time with contrib updates so that manual extern declarations
are also marked with PGDLLEXPORT. The release notes should point
this out as a significant source-code change for extension authors,
since they'll have to make similar additions to avoid trouble on Windows.
Laurenz Albe, doc change by me
Patch: <A737B7A37273E048B164557ADEF4A58B53962ED8@ntex2010a.host.magwien.gv.at>
This is infrastructure for the complete SQL standard feature. No
support is included at this point for execution nodes or PLs. The
intent is to add that soon.
As this patch leaves things, standard syntax can create tuplestores
to contain old and/or new versions of rows affected by a statement.
References to these tuplestores are in the TriggerData structure.
C triggers can access the tuplestores directly, so they are usable,
but they cannot yet be referenced within a SQL statement.
LockBufferForCleanup() acquires a cleanup lock unconditionally, and
ConditionalLockBufferForCleanup() acquires a cleanup lock if it is
possible to do so without waiting; this patch adds a new API,
IsBufferCleanupOK(), which tests whether an exclusive lock already
held happens to be a cleanup lock. This is possible because a cleanup
lock simply means an exclusive lock plus the assurance any other pins
on the buffer are newer than our own pin. Therefore, just as the
existing functions decide that the exclusive lock that they've just
taken is a cleanup lock if they observe the pin count to be 1, this
new function allows us to observe that the pin count is 1 on a buffer
we've already locked.
This is useful in situations where a backend definitely wishes to
modify the buffer and also wishes to perform cleanup operations if
possible. The patch to eliminate heavyweight locking by hash indexes
uses this, and it may have other applications as well.
Amit Kapila, per a suggestion from me. Some comment adjustments by me
as well.
Commit 2bd9e412f92bc6a68f3e8bcb18e04955cc35001d added these in error.
They were part of an earlier design for that patch and survived in the
committed version only by inadvertency.
Julien Rouhaud
This will write the received transaction log into a file called
pg_wal.tar(.gz) next to the other tarfiles instead of writing it to
base.tar. When using fetch mode, the transaction log is still written to
base.tar like before, and when used against a pre-10 server, the file
is named pg_xlog.tar.
To do this, implement a new concept of a "walmethod", which is
responsible for writing the WAL. Two implementations exist, one that
writes to a plain directory (which is also used by pg_receivexlog) and
one that writes to a tar file with optional compression.
Reviewed by Michael Paquier
"xlog" is not a particularly clear abbreviation for "write-ahead log",
and it sometimes confuses users into believe that the contents of the
"pg_xlog" directory are not critical data, leading to unpleasant
consequences. So, rename the directory to "pg_wal".
This patch modifies pg_upgrade and pg_basebackup to understand both
the old and new directory layouts; the former is necessary given the
purpose of the tool, while the latter merely avoids an unnecessary
backward-compatibility break.
We may wish to consider renaming other programs, switches, and
functions which still use the old "xlog" naming to also refer to
"wal". However, that's still under discussion, so let's do just this
much for now.
Discussion: CAB7nPqTeC-8+zux8_-4ZD46V7YPwooeFxgndfsq5Rg8ibLVm1A@mail.gmail.com
Michael Paquier
This reverts commit 9e083fd4683294f41544e6d0d72f6e258ff3a77c. That was a
few bricks shy of a load:
* Query cancel stopped working
* Buildfarm member pademelon stopped working, because the box doesn't have
/dev/urandom nor /dev/random.
This clearly needs some more discussion, and a quite different patch, so
revert for now.
This adds a new routine, pg_strong_random() for generating random bytes,
for use in both frontend and backend. At the moment, it's only used in
the backend, but the upcoming SCRAM authentication patches need strong
random numbers in libpq as well.
pg_strong_random() is based on, and replaces, the existing implementation
in pgcrypto. It can acquire strong random numbers from a number of sources,
depending on what's available:
- OpenSSL RAND_bytes(), if built with OpenSSL
- On Windows, the native cryptographic functions are used
- /dev/urandom
- /dev/random
Original patch by Magnus Hagander, with further work by Michael Paquier
and me.
Discussion: <CAB7nPqRy3krN8quR9XujMVVHYtXJ0_60nqgVc6oUk8ygyVkZsA@mail.gmail.com>
The more efficient hashtable speeds up hash-aggregations with more than
a few hundred groups significantly. Improvements of over 120% have been
measured.
Due to the the different hash table queries that not fully
determined (e.g. GROUP BY without ORDER BY) may change their result
order.
The conversion is largely straight-forward, except that, due to the
static element types of simplehash.h type hashes, the additional data
some users store in elements (e.g. the per-group working data for hash
aggregaters) is now stored in TupleHashEntryData->additional. The
meaning of BuildTupleHashTable's entrysize (renamed to additionalsize)
has been changed to only be about the additionally stored size. That
size is only used for the initial sizing of the hash-table.
Reviewed-By: Tomas Vondra
Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
dynahash.c hash tables aren't quite fast enough for some
use-cases. There are several reasons for lacking performance:
- the use of chaining for collision handling makes them cache
inefficient, that's especially an issue when the tables get bigger.
- as the element sizes for dynahash are only determined at runtime,
offset computations are somewhat expensive
- hash and element comparisons are indirect function calls, causing
unnecessary pipeline stalls
- it's two level structure has some benefits (somewhat natural
partitioning), but increases the number of indirections
to fix several of these the hash tables have to be adjusted to the
individual use-case at compile-time. C unfortunately doesn't provide a
good way to do compile code generation (like e.g. c++'s templates for
all their weaknesses do). Thus the somewhat ugly approach taken here is
to allow for code generation using a macro-templatized header file,
which generates functions and types based on a prefix and other
parameters.
Later patches use this infrastructure to use such hash tables for
tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation,
...). In queries where these use up a large fraction of the time, this
has been measured to lead to performance improvements of over 100%.
There are other cases where this could be useful (e.g. catcache.c).
The hash table design chosen is a variant of linear open-addressing. The
biggest disadvantage of simple linear addressing schemes are highly
variable lookup times due to clustering, and deletions leaving a lot of
tombstones around. To address these issues a variant of "robin hood"
hashing is employed. Robin hood hashing optimizes chaining lengths by
moving elements close to their optimal bucket ("rich" elements), out of
the way if a to-be-inserted element is further away from its optimal
position (i.e. it's "poor"). While that can make insertions slower, the
average lookup performance is a lot better, and higher fill factors can
be used in a still performant manner. To avoid tombstones - which
normally solve the issue that a deleted node's presence is relevant to
determine whether a lookup needs to continue looking or is done -
buckets following a deleted element are shifted backwards, unless
they're empty or already at their optimal position.
There's further possible improvements that can be made to this
implementation. Amongst others:
- Use distance as a termination criteria during searches. This is
generally a good idea, but I've been able to see the overhead of
distance calculations in some cases.
- Consider combining the 'empty' status into the hashvalue, and enforce
storing the hashvalue. That could, in some cases, increase memory
density and remove a few instructions.
- Experiment further with the, very conservatively choosen, fillfactor.
- Make maximum size of hashtable configurable, to allow storing very
very large tables. That'd require 64bit hash values to be more common
than now, though.
- some smaller memcpy calls could be optimized to copy larger chunks
But since the new implementation is already considerably faster than
dynahash it seem sensible to start using it.
Reviewed-By: Tomas Vondra
Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
These are useful for very hot code paths. Because it's easy to guess
wrongly about likelihood, and because such likelihoods change over time,
they should be used sparingly.
Past tests have shown it'd be a good idea to use them in some places,
e.g. in error checks around ereports that ERROR out, but that's work for
later.
Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
This turns out not to be as harmless as I thought: MSVC will complain
if it sees an "extern" declaration without PGDLLEXPORT and then one with.
(Seems fairly silly, given that this can be changed after the fact by the
linker, but there you have it.) Therefore, contrib modules that have
extern's for V1 functions in header files are falling over in the
buildfarm, since none of those externs are marked PGDLLEXPORT.
We might or might not conclude that we're willing to plaster those
declarations with PGDLLEXPORT in HEAD, but in any case there's no way we're
going to ship this change in the back branches. Third-party authors would
not thank us for breaking their code in a minor release. Hence, revert
the addition of PGDLLEXPORT (but let's keep the extra info in the comment).
If we do the other changes we can revert this commit in HEAD.
Per buildfarm.
These functions were originally added in commit d8cedf67a to support
use of int2vector columns as catcache lookup keys. However, there are
no catcaches that use such columns. (Indeed I now think it must always
have been dead code: a catcache with such a key column would need an
underlying unique index on the column, but we've never had an int2vector
btree opclass.)
Getting rid of the int2vector-specific operator and function does not
lose any functionality, because operations on int2vectors will now fall
back to the generic anyarray support. This avoids a wart that a btree
index on an int2vector column (made using anyarray_ops) would fail to
match equality searches, because int2vectoreq wasn't a member of the
opclass. We don't really care much about that, since int2vector is not
meant as a type for users to use, but it's silly to have extra code and
less functionality.
If we ever do want a catcache to be indexed by an int2vector column,
we'd need to put back full btree and hash opclasses for int2vector,
comparable to the support for oidvector. (The anyarray code can't be
used at such a low level, because it needs to do catcache lookups.)
But we'll deal with that if/when the need arises.
Also worth noting is that removal of the hash int2vector_ops opclass will
break any user-created hash indexes on int2vector columns. While hash
anyarray_ops would serve the same purpose, it would probably not compute
the same hash values and thus wouldn't be on-disk-compatible. Given that
int2vector isn't a user-facing type and we're planning other incompatible
changes in hash indexes for v10 anyway, this doesn't seem like something
to worry about, but it's probably worth mentioning here.
Amit Langote
Discussion: <d9bb74f8-b194-7307-9ebd-90645d377e45@lab.ntt.co.jp>
This isn't really necessary for our own code, because we use a .DEF file
in MSVC builds (see gendef.pl), or --export-all-symbols in MinGW and
Cygwin builds, to ensure that all global symbols in loadable modules
will be exported on Windows. However, third-party authors might use
different build processes that need this marker, and it's harmless
enough for our own builds.
To some extent, this is an oversight in commit e7128e8db, so back-patch
to 9.4 where that was added.
Laurenz Albe
Discussion: <A737B7A37273E048B164557ADEF4A58B539300BD@ntex2010a.host.magwien.gv.at>
Pass the buffer size as argument to LogicalTapeRewindForRead, rather than
setting it earlier with the separate LogicTapeAssignReadBufferSize call.
This way, the buffer size is set closer to where it's actually used, which
makes the code easier to understand.
This makes the calculation for how much memory to use for the buffers less
precise. We now use the same amount of memory for every tape, rounded down
to the nearest BLCKSZ boundary, instead of using one more block for some
tapes, to get the total up to exact amount of memory available. That should
be OK, merging isn't too sensitive to the exact amount of memory used.
Reviewed by Peter Geoghegan
Discussion: <0f607c4b-df23-353e-bf56-c0389d28495f@iki.fi>
While this isn't a lot of code, it's been essentially untestable for
a very long time, because libpq doesn't support anything older than
protocol 2.0, and has not since release 6.3. There's no reason to
believe any other client-side code still uses that protocol, either.
Discussion: <2661.1475849167@sss.pgh.pa.us>
SCO OpenServer and SCO UnixWare are more or less dead platforms.
We have never had a buildfarm member testing the "sco" port, and
the last "unixware" member was last heard from in 2012, so it's
fair to doubt that the code even compiles anymore on either one.
Remove both ports. We can always undo this if someone shows up
with an interest in maintaining and testing these platforms.
Discussion: <17177.1476136994@sss.pgh.pa.us>
I somehow had assumed that in the spinlock (in turn possibly using
semaphores) based fallback atomics implementation 32 bit writes could be
done without a lock. As far as the write goes that's correct, since
postgres supports only platforms with single-copy atomicity for aligned
32bit writes. But writing without holding the spinlock breaks
read-modify-write operations like pg_atomic_compare_exchange_u32(),
since they'll potentially "miss" a concurrent write, which can't happen
in actual hardware implementations.
In 9.6+ when using the fallback atomics implementation this could lead
to buffer header locks not being properly marked as released, and
potentially some related state corruption. I don't see a related danger
in 9.5 (earliest release with the API), because pg_atomic_write_u32()
wasn't used in a concurrent manner there.
The state variable of local buffers, before this change, were
manipulated using pg_atomic_write_u32(), to avoid unnecessary
synchronization overhead. As that'd not be the case anymore, introduce
and use pg_atomic_unlocked_write_u32(), which does not correctly
interact with RMW operations.
This bug only caused issues when postgres is compiled on platforms
without atomics support (i.e. no common new platform), or when compiled
with --disable-atomics, which explains why this wasn't noticed in
testing.
Reported-By: Tom Lane
Discussion: <14947.1475690465@sss.pgh.pa.us>
Backpatch: 9.5-, where the atomic operations API was introduced.
Windows apparently has a constant named WAIT_TIMEOUT, and some of these
other names are pretty generic, too. Insert "PG_" at the front of each
name in order to disambiguate.
Michael Paquier
Buildfarm member mylodon doesn't like them. Actually, I don't like
them either, but I failed to notice these before pushing commit
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa.
WaitLatch, WaitLatchOrSocket, and WaitEventSetWait now taken an
additional wait_event_info parameter; legal values are defined in
pgstat.h. This makes it possible to uniquely identify every point in
the core code where we are waiting for a latch; extensions can pass
WAIT_EXTENSION.
Because latches were the major wait primitive not previously covered
by this patch, it is now possible to see information in
pg_stat_activity on a large number of important wait events not
previously addressed, such as ClientRead, ClientWrite, and SyncRep.
Unfortunately, many of the wait events added by this patch will fail
to appear in pg_stat_activity because they're only used in background
processes which don't currently appear in pg_stat_activity. We should
fix this either by creating a separate view for such information, or
else by deciding to include them in pg_stat_activity after all.
Michael Paquier and Robert Haas, reviewed by Alexander Korotkov and
Thomas Munro.
Commit 88e982302 invented GUC_UNIT_XSEGS for min_wal_size and max_wal_size,
but neglected to make it display sensibly in pg_settings.unit (by adding a
case to the switch in GetConfigOptionByNum). Fix that, and adjust said
switch to throw a run-time error the next time somebody forgets.
In passing, avoid using a static buffer for the output string --- the rest
of this function pstrdup's from a local buffer, and I see no very good
reason why the units code should do it differently and less safely.
Per report from Otar Shavadze. Back-patch to 9.5 where the new unit type
was added.
Report: <CAG-jOyA=iNFhN+yB4vfvqh688B7Tr5SArbYcFUAjZi=0Exp-Lg@mail.gmail.com>