This includes two new kinds of postmaster processes, walsenders and
walreceiver. Walreceiver is responsible for connecting to the primary server
and streaming WAL to disk, while walsender runs in the primary server and
streams WAL from disk to the client.
Documentation still needs work, but the basics are there. We will probably
pull the replication section to a new chapter later on, as well as the
sections describing file-based replication. But let's do that as a separate
patch, so that it's easier to see what has been added/changed. This patch
also adds a new section to the chapter about FE/BE protocol, documenting the
protocol used by walsender/walreceivxer.
Bump catalog version because of two new functions,
pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for
monitoring the progress of replication.
Fujii Masao, with additional hacking by me
This silences some warnings on Win64. Not using the proper SOCKET datatype
was actually wrong on Win32 as well, but didn't cause any warnings there.
Also create define PGINVALID_SOCKET to indicate an invalid/non-existing
socket, instead of using a hardcoded -1 value.
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
The temporary hash tables made by pgstat_collect_oids should be allocated
in a short-term memory context, which is not the default behavior of
hash_create. Noted while looking through hash_create calls in connection
with Robert Haas' recent complaint.
This is a pre-existing bug, but it doesn't seem important enough to
back-patch. The hash table is not so large that it would matter unless this
happened many times within a session, which seems quite unlikely.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
output filename if CSV logging was enabled and only one of the two possible
output files got rotated during a particular call (which would, in fact,
typically be the case during a size-based rotation). This would amount to
about MAXPGPATH (1KB) per rotation, and it's been there since the CSV
code was put in, so it's surprising that nobody noticed it before.
Per bug #5196 from Thomas Poindessous.
adopted for EXPLAIN. This will allow additional options to be implemented
in future without having to make them fully-reserved keywords. The old syntax
remains available for existing options, however.
Itagaki Takahiro
build actually attempts to advertise itself via Bonjour. Formerly it always
did so, which meant that packagers had to decide for their users whether
this behavior was wanted or not. The default is "off" to be on the safe
side, though this represents a change in the default behavior of a
Bonjour-enabled build. Per discussion.
with the not-so-deprecated DNSServiceRegister. This patch shouldn't change
any user-visible behavior, it just gets rid of a deprecation warning in
--with-bonjour builds. The new code will fail on OS X releases before 10.3,
but it seems unlikely that anyone will want to run Postgres 8.5 on 10.2.
Formerly, these message types would be discarded unless there was already
a stats hash table entry for the target table. However, the intent of
saving hash table space for unused tables was subverted by the fact that
the physical I/O done by the vacuum or analyze would result in an immediately
following tabstat message, which would create the hash table entry anyway.
All that we had left was surprising loss of statistical data, as in a recent
complaint from Jaime Casanova.
It seems unlikely that a real database would have many tables that go totally
untouched over the long haul, so the consensus is that this "optimization"
serves little purpose anyhow. Remove it, and just create the hash table
entry on demand in all cases.
via the "flat files" facility. This requires making it enough like a backend
to be able to run transactions; it's no longer an "auxiliary process" but
more like the autovacuum worker processes. Also, its signal handling has
to be brought into line with backends/workers. In particular, since it
now has to handle procsignal.c processing, the special autovac-launcher-only
signal conditions are moved to SIGUSR2.
Alvaro, with some cleanup from Tom
(That flat file is now completely useless, but removal will come later.)
To do this, postpone client authentication into the startup transaction
that's run by InitPostgres. We still collect the startup packet and do
SSL initialization (if needed) at the same time we did before. The
AuthenticationTimeout is applied separately to startup packet collection
and the actual authentication cycle. (This is a bit annoying, since it
means a couple extra syscalls; but the signal handling requirements inside
and outside a transaction are sufficiently different that it seems best
to treat the timeouts as completely independent.)
A small security disadvantage is that if the given database name is invalid,
this will be reported to the client before any authentication happens.
We could work around that by connecting to database "postgres" instead,
but consensus seems to be that it's not worth introducing such surprising
behavior.
Processing of all command-line switches and GUC options received from the
client is now postponed until after authentication. This means that
PostAuthDelay is much less useful than it used to be --- if you need to
investigate problems during InitPostgres you'll have to set PreAuthDelay
instead. However, allowing an unauthenticated user to set any GUC options
whatever seems a bit too risky, so we'll live with that.
PostgresMain switch. In point of fact, FrontendProtocol is already set
in a backend process, since ProcessStartupPacket() is executed inside
the backend --- it hasn't been run by the postmaster for many years.
And if it were, we'd still certainly want FrontendProtocol to be set before
we get as far as PostgresMain, so that startup errors get reported in the
right protocol.
-v might have some future use in standalone backends, so I didn't go so
far as to remove the switch outright.
Also, initialize FrontendProtocol to 0 not PG_PROTOCOL_LATEST. The only
likely result of presetting it like that is to mask failure-to-set-it
mistakes.
In the original coding, setting a single reloption would cause default
values to be used for all the other reloptions. This is a problem
particularly for autovacuum reloptions.
Itagaki Takahiro
Instead of sending stdout/stderr to /dev/null after forking away from the
terminal, send them to postmaster.log within the data directory. Since
this opens the door to indefinite logfile bloat, recommend even more
strongly that log output be redirected when using silent_mode.
Move the postmaster's initial calls of load_hba() and load_ident() down
to after we have started the log collector, if we are going to. This
is so that errors reported by them will appear in the "usual" place.
Reclassify silent_mode as a LOGGING_WHERE, not LOGGING_WHEN, parameter,
since it's got absolutely nothing to do with the latter category.
In passing, fix some obsolete references to -S ... this option hasn't
had that switch letter for a long time.
Back-patch to 8.4, since as of 8.4 load_hba() and load_ident() are more
picky (and thus more likely to fail) than they used to be. This entire
change was driven by a complaint about those errors disappearing into
the bit bucket.
This causes problems when the system load is high, per report from Zdenek
Kotala in <1250860954.1239.114.camel@localhost>; instead of calling kill
directly, have the signal handler set a flag which is checked in ServerLoop.
This way, the handler can return before being called again by a subsequent
signal sent from the autovacuum launcher. Also, increase the sleep in the
launcher in this failure path to 1 second.
Backpatch to 8.3, which is when the signalling between autovacuum
launcher/postmaster was introduced.
Also, add a couple of ReleasePostmasterChildSlot calls in error paths; this
part backpatched to 8.4 which is when the child slot stuff was introduced.
To make this work in the base case, pg_database now has a nailed-in-cache
relation descriptor that is initialized using hardwired knowledge in
relcache.c. This means pg_database is added to the set of relations that
need to have a Schema_pg_xxx macro maintained in pg_attribute.h. When this
path is taken, we'll have to do a seqscan of pg_database to find the row
we need.
In the normal case, we are able to do an indexscan to find the database's row
by name. This is made possible by storing a global relcache init file that
describes only the shared catalogs and their indexes (and therefore is usable
by all backends in any database). A new backend loads this cache file,
finds its database OID after an indexscan on pg_database, and then loads
the local relcache init file for that database.
This change should effectively eliminate number of databases as a factor
in backend startup time, even with large numbers of databases. However,
the real reason for doing it is as a first step towards getting rid of
the flat files altogether. There are still several other sub-projects
to be tackled before that can happen.
if a smart shutdown is already in progress. Backpatch to 8.3, this was broken
in the patch that introduced "dead-end backends".
Per report by Itagaki Takahiro, patch by Fujii Masao.
This patch gets us out from under the Unix limitation of two user-defined
signal types. We already had done something similar for signals directed to
the postmaster process; this adds multiplexing for signals directed to
backends and auxiliary processes (so long as they're connected to shared
memory).
As proof of concept, replace the former usage of SIGUSR1 and SIGUSR2
for backends with use of the multiplexing mechanism. There are still some
hard-wired definitions of SIGUSR1 and SIGUSR2 for other process types,
but getting rid of those doesn't seem interesting at the moment.
Fujii Masao
that memory allocated by starting third party DLLs doesn't end up
conflicting with it.
Hopefully this solves the long-time issue with "could not reattach
to shared memory" errors on Win32.
Patch from Tsutomu Yamada and me, based on idea from Trevor Talbot.
LC_CTYPE settings to children via BackendParameters. Per discussion,
the postmaster is now just using system defaults anyway, so we might as
well save a few cycles during backend startup.
archive recovery. Invent a separate state variable and inquiry function
for XLogInsertAllowed() to clarify some tests and make the management of
writing the end-of-recovery checkpoint less klugy. Fix several places
that were incorrectly testing InRecovery when they should be looking at
RecoveryInProgress or XLogInsertAllowed (because they will now be executed
in the bgwriter not startup process). Clarify handling of bad LSNs passed
to XLogFlush during recovery. Use a spinlock for setting/testing
SharedRecoveryInProgress. Improve quite a lot of comments.
Heikki and Tom
during it:
When bgwriter is active, the startup process can't perform mdsync() correctly
because it won't see the fsync requests accumulated in bgwriter's private
pendingOpsTable. Therefore make bgwriter responsible for the end-of-recovery
checkpoint as well, when it's active.
When bgwriter is active (= archive recovery), the startup process must not
accumulate fsync requests to its own pendingOpsTable, since bgwriter won't
see them there when it performs restartpoints. Make startup process drop its
pendingOpsTable when bgwriter is launched to avoid that.
Update minimum recovery point one last time when leaving archive recovery.
It won't be updated by the end-of-recovery checkpoint because XLogFlush()
sees us as out of recovery already.
This fixes bug #4879 reported by Fujii Masao.
behavior in cases where we don't know the heap tuple count accurately; in
particular partial vacuum, but this also makes the API a bit more useful
for ANALYZE. This patch adds "estimated_count" flags to both structs so
that an approximate count can be flagged as such, and adjusts the logic
so that approximate counts are not used for updating pg_class.reltuples.
This fixes my previous complaint that VACUUM was putting ridiculous values
into pg_class.reltuples for indexes. The actual impact of that bug is
limited, because the planner only pays attention to reltuples for an index
if the index is partial; which probably explains why beta testers hadn't
noticed a degradation in plan quality from it. But it needs to be fixed.
The whole thing is a bit messy and should be redesigned in future, because
reltuples now has the potential to drift quite far away from reality when
a long period elapses with no non-partial vacuums. But this is as good as
it's going to get for 8.4.
by extending the ereport() API to cater for pluralization directly. This
is better than the original method of calling ngettext outside the elog.c
code because (1) it avoids double translation, which wastes cycles and in
the worst case could give a wrong result; and (2) it avoids having to use
a different coding method in PL code than in the core backend. The
client-side uses of ngettext are not touched since neither of these concerns
is very pressing in the client environment. Per my proposal of yesterday.
copies?) to ensure they really don't run proc_exit/shmem_exit callbacks,
as was intended. I broke this behavior recently by installing atexit
callbacks without thinking about the one case where we truly don't want
to run those callback functions. Noted in an example from Dave Page.
a backend has done exit(0) or exit(1) without having disengaged itself
from shared memory. We are at risk for this whenever third-party code is
loaded into a backend, since such code might not know it's supposed to go
through proc_exit() instead. Also, it is reported that under Windows
there are ways to externally kill a process that cause the status code
returned to the postmaster to be indistinguishable from a voluntary exit
(thank you, Microsoft). If this does happen then the system is probably
hosed --- for instance, the dead session might still be holding locks.
So the best recovery method is to treat this like a backend crash.
The dead man switch is armed for a particular child process when it
acquires a regular PGPROC, and disarmed when the PGPROC is released;
these should be the first and last touches of shared memory resources
in a backend, or close enough anyway. This choice means there is no
coverage for auxiliary processes, but I doubt we need that, since they
shouldn't be executing any user-provided code anyway.
This patch also improves the management of the EXEC_BACKEND
ShmemBackendArray array a bit, by reducing search costs.
Although this problem is of long standing, the lack of field complaints
seems to mean it's not critical enough to risk back-patching; at least
not till we get some more testing of this mechanism.
error message if the installation directory layout is messed up (or at least,
something more useful than the behavior exhibited in bug #4787). During
postmaster startup, check that get_pkglib_path resolves as a readable
directory; and if ParseTzFile() fails to open the expected timezone
abbreviation file, check the possibility that the directory is missing rather
than just the specified file. In case of either failure, issue a hint
suggesting that the installation is broken. These two checks cover the lib/
and share/ trees of a full installation, which should take care of most
scenarios where a sysadmin decides to get cute.
are using our own ports of getopt or getopt_long, those will define
the variable for themselves; and if not, we don't need these, because
we never touch the variable anyway.
temp relations; this is no more expensive than before, now that we have
pg_class.relistemp. Insert tests into bufmgr.c to prevent attempting
to fetch pages from nonlocal temp relations. This provides a low-level
defense against bugs-of-omission allowing temp pages to be loaded into shared
buffers, as in the contrib/pgstattuple problem reported by Stuart Bishop.
While at it, tweak a bunch of places to use new relcache tests (instead of
expensive probes into pg_namespace) to detect local or nonlocal temp tables.
In the backend, I changed only a handful of exemplary or important-looking
instances to make use of the plural support; there is probably more work
there. For the rest of the source, this should cover all relevant cases.
the cause of the "could not write to log file: Bad file descriptor"
errors reported at
http://archives.postgresql.org//pgsql-general/2008-06/msg00193.php
Backpatch to 8.3, the race condition was introduced by the CSV logging
patch.
Analysis and patch by Gurjeet Singh.
recovery: if background writer or pgstat process dies during recovery (or
any other child process, but those two are the only ones running), send
SIGQUIT to the startup process using correct pid.