A new MAX_RATE option allows imposing a limit to the network transfer
rate from the server side. This is useful to limit the stress that
taking a base backup has on the server.
pg_basebackup is now able to specify a value to the server, too.
Author: Antonin Houska
Patch reviewed by Stefan Radomski, Andres Freund, Zoltán Böszörményi,
Fujii Masao, and Álvaro Herrera.
The functions in slotfuncs.c don't exist in any released version,
but the changes to xlogfuncs.c represent backward-incompatibilities.
Per discussion, we're hoping that the queries using these functions
are few enough and simple enough that this won't cause too much
breakage for users.
Michael Paquier, reviewed by Andres Freund and further modified
by me.
Change input function error messages to be more consistent with what is
done elsewhere. Remove a bunch of redundant type casts, so that the
compiler will warn us if we screw up. Don't pass LSNs by value on
platforms where a Datum is only 32 bytes, per buildfarm. Move macros
for packing and unpacking LSNs to pg_lsn.h so that we can include
access/xlogdefs.h, to avoid an unsatisfied dependency on XLogRecPtr.
Several functions, mostly type input functions, calculated an allocation
size such that the calculation wrapped to a small positive value when
arguments implied a sufficiently-large requirement. Writes past the end
of the inadvertent small allocation followed shortly thereafter.
Coverity identified the path_in() vulnerability; code inspection led to
the rest. In passing, add check_stack_depth() to prevent stack overflow
in related functions.
Back-patch to 8.4 (all supported versions). The non-comment hstore
changes touch code that did not exist in 8.4, so that part stops at 9.0.
Noah Misch and Heikki Linnakangas, reviewed by Tom Lane.
Security: CVE-2014-0064
Many server functions use the MAXDATELEN constant to size a buffer for
parsing or displaying a datetime value. It was much too small for the
longest possible interval output and slightly too small for certain
valid timestamp input, particularly input with a long timezone name.
The long input was rejected needlessly; the long output caused
interval_out() to overrun its buffer. ECPG's pgtypes library has a copy
of the vulnerable functions, which bore the same vulnerabilities along
with some of its own. In contrast to the server, certain long inputs
caused stack overflow rather than failing cleanly. Back-patch to 8.4
(all supported versions).
Reported by Daniel Schüssler, reviewed by Tom Lane.
Security: CVE-2014-0063
If the name lookups come to different conclusions due to concurrent
activity, we might perform some parts of the DDL on a different table
than other parts. At least in the case of CREATE INDEX, this can be
used to cause the permissions checks to be performed against a
different table than the index creation, allowing for a privilege
escalation attack.
This changes the calling convention for DefineIndex, CreateTrigger,
transformIndexStmt, transformAlterTableStmt, CheckIndexCompatible
(in 9.2 and newer), and AlterTable (in 9.1 and older). In addition,
CheckRelationOwnership is removed in 9.2 and newer and the calling
convention is changed in older branches. A field has also been added
to the Constraint node (FkConstraint in 8.4). Third-party code calling
these functions or using the Constraint node will require updating.
Report by Andres Freund. Patch by Robert Haas and Andres Freund,
reviewed by Tom Lane.
Security: CVE-2014-0062
The primary role of PL validators is to be called implicitly during
CREATE FUNCTION, but they are also normal functions that a user can call
explicitly. Add a permissions check to each validator to ensure that a
user cannot use explicit validator calls to achieve things he could not
otherwise achieve. Back-patch to 8.4 (all supported versions).
Non-core procedural language extensions ought to make the same two-line
change to their own validators.
Andres Freund, reviewed by Tom Lane and Noah Misch.
Security: CVE-2014-0061
These are needed in HEAD to make assorted contrib modules build on Windows.
Now that all the MSVC and Mingw buildfarm members seem to be on the same
page about the need for them, we can have some confidence that future
problems of this ilk will be detected promptly; there seems nothing more
to be learned by delaying this fix further.
I chose to mark QueryCancelPending as well, since it's easy to imagine code
that wants to touch ProcDiePending also caring about QueryCancelPending.
This is needed on Windows to support contrib/postgres_fdw. Although it's
been broken since last March, we didn't notice until recently because there
were no active buildfarm members that complained about missing PGDLLIMPORT
marking. Efforts are underway to improve that situation, in support of
which we're delaying fixing some other cases of global variables that
should be marked PGDLLIMPORT. However, this case affects 9.3, so we
can't wait any longer to fix it.
I chose to mark DateOrder as well, though it's not strictly necessary
for postgres_fdw.
We used to have externs for getopt() and its API variables scattered
all over the place. Now that we find we're going to need to tweak the
variable declarations for Cygwin, it seems like a good idea to have
just one place to tweak.
In this commit, the variables are declared "#ifndef HAVE_GETOPT_H".
That may or may not work everywhere, but we'll soon find out.
Andres Freund
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
If you have a GIN query like "rare & frequent", we currently fetch all the
items that match either rare or frequent, call the consistent function for
each item, and let the consistent function filter out items that only match
one of the terms. However, if we can deduce that "rare" must be present for
the overall qual to be true, we can scan all the rare items, and for each
rare item, skip over to the next frequent item with the same or greater TID.
That greatly speeds up "rare & frequent" type queries.
To implement that, introduce the concept of a tri-state consistent function,
where the 3rd value is MAYBE, indicating that we don't know if that term is
present. Operator classes only provide a boolean consistent function, so we
simulate the tri-state consistent function by calling the boolean function
several times, with the MAYBE arguments set to all combinations of TRUE and
FALSE. Testing all combinations is only feasible for a small number of MAYBE
arguments, but it is envisioned that we'll provide a way for operator
classes to provide a native tri-state consistent function, which can be much
more efficient. But that is not included in this patch.
We were already using that trick to for lossy pages, calling the consistent
function with the lossy entry set to TRUE and FALSE. Now that we have the
tri-state consistent function, use it for lossy pages too.
Alexander Korotkov, with fair amount of refactoring by me.
The temporary statistics files don't need to be included in the backup
because they are always reset at the beginning of the archive recovery.
This patch changes pg_basebackup so that it skips all files located in
$PGDATA/pg_stat_tmp or the directory specified by stats_temp_directory
parameter.
Replication slots are a crash-safe data structure which can be created
on either a master or a standby to prevent premature removal of
write-ahead log segments needed by a standby, as well as (with
hot_standby_feedback=on) pruning of tuples whose removal would cause
replication conflicts. Slots have some advantages over existing
techniques, as explained in the documentation.
In a few places, we refer to the type of replication slots introduced
by this patch as "physical" slots, because forthcoming patches for
logical decoding will also have slots, but with somewhat different
properties.
Andres Freund and Robert Haas
This doesn't work for prepared queries, but it's not too easy to get
the information in that case and there's some debate as to exactly
what the right thing to measure is, so just do this for now.
Andreas Karlsson, with slight doc changes by me.
When skipping over some items in a posting tree, re-find the new location
by descending the tree from root, rather than walking the right links.
This can save a lot of I/O.
Heavily modified from Alexander Korotkov's fast scan patch.
If we're skipping past a certain TID, avoid decoding posting list segments
that only contain smaller TIDs.
Extracted from Alexander Korotkov's fast scan patch, heavily modified.
This patch adds an option, huge_tlb_pages, which allows requesting the
shared memory segment to be allocated using huge pages, by using the
MAP_HUGETLB flag in mmap(). This can improve performance.
The default is 'try', which means that we will attempt using huge pages,
and fall back to non-huge pages if it doesn't work. Currently, only Linux
has MAP_HUGETLB. On other platforms, the default 'try' behaves the same as
'off'.
In the passing, don't try to round the mmap() size to a multiple of
pagesize. mmap() doesn't require that, and there's no particular reason for
PostgreSQL to do that either. When using MAP_HUGETLB, however, round the
request size up to nearest 2MB boundary. This is to work around a bug in
some Linux kernel versions, but also to avoid wasting memory, because the
kernel will round the size up anyway.
Many people were involved in writing this patch, including Christian Kruse,
Richard Poole, Abhijit Menon-Sen, reviewed by Peter Geoghegan, Andres Freund
and me.
json_build_array() and json_build_object allow for the construction of
arbitrarily complex json trees. json_object() turns a one or two
dimensional array, or two separate arrays, into a json_object of
name/value pairs, similarly to the hstore() function.
json_object_agg() aggregates its two arguments into a single json object
as name value pairs.
Catalog version bumped.
Andrew Dunstan, reviewed by Marko Tiikkaja.
It's worth distinguishing these cases from run-of-the-mill wrong-password
problems, since users have been known to waste lots of time pursuing the
wrong theory about what's failing. Now, our longstanding policy about how
to report authentication failures is that we don't really want to tell the
*client* such things, since that might be giving information to a bad guy.
But there's nothing wrong with reporting the details to the postmaster log,
and indeed the comments in this area of the code contemplate that
interesting details should be so reported. We just weren't handling these
particular interesting cases usefully.
To fix, add infrastructure allowing subroutines of ClientAuthentication()
to return a string to be added to the errdetail_log field of the main
authentication-failed error report. We might later want to use this to
report other subcases of authentication failure the same way, but for the
moment I just dealt with password cases.
Per discussion of a patch from Josh Drake, though this is not what
he proposed.
This makes it possible to store lwlocks as part of some other data
structure in the main shared memory segment, or in a dynamic shared
memory segment. There is still a main LWLock array and this patch does
not move anything out of it, but it provides necessary infrastructure
for doing that in the future.
This change is likely to increase the size of LWLockPadded on some
platforms, especially 32-bit platforms where it was previously only
16 bytes.
Patch by me. Review by Andres Freund and KaiGai Kohei.
Fix integer overflow issue noted by Magnus Hagander, as well as a bunch
of other infelicities in commit ee1e5662d8d8330726eaef7d3110cb7add24d058
and its unreasonably large number of followups.
This allows ending recovery as a consistent state has been reached. Without
this, there was no easy way to e.g restore an online backup, without
replaying any extra WAL after the backup ended.
MauMau and me.
Add the ability to specify the objects to move by who those objects are
owned by (as relowner) and change ALL to mean ALL objects. This
makes the command always operate against a well-defined set of objects
and not have the objects-to-be-moved based on the role of the user
running the command.
Per discussion with Simon and Tom.
Since C99, it's been standard for printf and friends to accept a "z" size
modifier, meaning "whatever size size_t has". Up to now we've generally
dealt with printing size_t values by explicitly casting them to unsigned
long and using the "l" modifier; but this is really the wrong thing on
platforms where pointers are wider than longs (such as Win64). So let's
start using "z" instead. To ensure we can do that on all platforms, teach
src/port/snprintf.c to understand "z", and add a configure test to force
use of that implementation when the platform's version doesn't handle "z".
Having done that, modify a bunch of places that were using the
unsigned-long hack to use "z" instead. This patch doesn't pretend to have
gotten everyplace that could benefit, but it catches many of them. I made
an effort in particular to ensure that all uses of the same error message
text were updated together, so as not to increase the number of
translatable strings.
It's possible that this change will result in format-string warnings from
pre-C99 compilers. We might have to reconsider if there are any popular
compilers that will warn about this; but let's start by seeing what the
buildfarm thinks.
Andres Freund, with a little additional work by me
Some cases were still reporting errors and aborting, instead of a NOTICE
that the object was being skipped. This makes it more difficult to
cleanly handle pg_dump --clean, so change that to instead skip missing
objects properly.
Per bug #7873 reported by Dave Rolsky; apparently this affects a large
number of users.
Authors: Pavel Stehule and Dean Rasheed. Some tweaks by Álvaro Herrera
GIN posting lists are now encoded using varbyte-encoding, which allows them
to fit in much smaller space than the straight ItemPointer array format used
before. The new encoding is used for both the lists stored in-line in entry
tree items, and in posting tree leaf pages.
To maintain backwards-compatibility and keep pg_upgrade working, the code
can still read old-style pages and tuples. Posting tree leaf pages in the
new format are flagged with GIN_COMPRESSED flag, to distinguish old and new
format pages. Likewise, entry tree tuples in the new format have a
GIN_ITUP_COMPRESSED flag set in a bit that was previously unused.
This patch bumps GIN_CURRENT_VERSION from 1 to 2. New indexes created with
version 9.4 will therefore have version number 2 in the metapage, while old
pg_upgraded indexes will have version 1. The code treats them the same, but
it might be come handy in the future, if we want to drop support for the
uncompressed format.
Alexander Korotkov and me. Reviewed by Tomas Vondra and Amit Langote.
Unlike our other array functions, this considers the total number of
elements across all dimensions, and returns 0 rather than NULL when the
array has no elements. But it seems that both of those behaviors are
almost universally disliked, so hopefully that's OK.
Marko Tiikkaja, reviewed by Dean Rasheed and Pavel Stehule
Previously the presence of a nextval() prevented the
use of batch-mode COPY. This patch introduces a
special case just for nextval() functions. In future
we will introduce a general case solution for
labelling volatile functions as safe for use.
krb5 has been deprecated since 8.3, and the recommended way to do
Kerberos authentication is using the GSSAPI authentication method
(which is still fully supported).
libpq retains the ability to identify krb5 authentication, but only
gives an error message about it being unsupported. Since all authentication
is initiated from the backend, there is no need to keep it at all
in the backend.
Tablespaces have a few options which can be set on them to give PG hints
as to how the tablespace behaves (perhaps it's faster for sequential
scans, or better able to handle random access, etc). These options were
only available through the ALTER TABLESPACE command.
This adds the ability to set these options at CREATE TABLESPACE time,
removing the need to do both a CREATE TABLESPACE and ALTER TABLESPACE to
get the correct options set on the tablespace.
Vik Fearing, reviewed by Michael Paquier.
Historically, VACUUM has just reported its new_rel_tuples estimate
(the same thing it puts into pg_class.reltuples) to the stats collector.
That number counts both live and dead-but-not-yet-reclaimable tuples.
This behavior may once have been right, but modern versions of the
pgstats code track live and dead tuple counts separately, so putting
the total into n_live_tuples and zero into n_dead_tuples is surely
pretty bogus. Fix it to report live and dead tuple counts separately.
This doesn't really do much for situations where updating transactions
commit concurrently with a VACUUM scan (possibly causing double-counting or
omission of the tuples they add or delete); but it's clearly an improvement
over what we were doing before.
Hari Babu, reviewed by Amit Kapila
This adds a 'MOVE' sub-command to ALTER TABLESPACE which allows moving sets of
objects from one tablespace to another. This can be extremely handy and avoids
a lot of error-prone scripting. ALTER TABLESPACE ... MOVE will only move
objects the user owns, will notify the user if no objects were found, and can
be used to move ALL objects or specific types of objects (TABLES, INDEXES, or
MATERIALIZED VIEWS).
These changes should generally improve correctness/maintainability.
A nice side benefit is that several kilobytes move from initialized
data to text segment, allowing them to be shared across processes and
probably reducing copy-on-write overhead while forking a new backend.
Unfortunately this doesn't seem to help libpq in the same way (at least
not when it's compiled with -fpic on x86_64), but we can hope the linker
at least collects all nominally-const data together even if it's not
actually part of the text segment.
Also, make pg_encname_tbl[] static in encnames.c, since there seems
no very good reason for any other code to use it; per a suggestion
from Wim Lewis, who independently submitted a patch that was mostly
a subset of this one.
Oskari Saarenmaa, with some editorialization by me