Commit Graph

1444 Commits

Author SHA1 Message Date
3c0fd64fec Split ProcSleep function into JoinWaitQueue and ProcSleep
Split ProcSleep into two functions: JoinWaitQueue and ProcSleep.
JoinWaitQueue is called while holding the partition lock, and inserts
the current process to the wait queue, while ProcSleep() does the
actual sleeping. ProcSleep() is now called without holding the
partition lock, and it no longer re-acquires the partition lock before
returning. That makes the wakeup a little cheaper. Once upon a time,
re-acquiring the partition lock was needed to prevent a signal handler
from longjmping out at a bad time, but these days our signal handlers
just set flags, and longjmping can only happen at points where we
explicitly run CHECK_FOR_INTERRUPTS().

If JoinWaitQueue detects an "early deadlock" before even joining the
wait queue, it returns without changing the shared lock entry, leaving
the cleanup of the shared lock entry to the caller. This makes the
handling of an early deadlock the same as the dontWait=true case.

One small user-visible side-effect of this refactoring is that we now
only set the 'ps' title to say "waiting" when we actually enter the
sleep, not when the lock is skipped because dontWait=true, or when a
deadlock is detected early before entering the sleep.

This eliminates the 'lockAwaited' global variable in proc.c, which was
largely redundant with 'awaitedLock' in lock.c

Note: Updating the local lock table is now the caller's responsibility.
JoinWaitQueue and ProcSleep are now only responsible for modifying the
shared state. Seems a little nicer that way.

Based on Thomas Munro's earlier patch and observation that ProcSleep
doesn't really need to re-acquire the partition lock.

Reviewed-by: Maxim Orlov
Discussion: https://www.postgresql.org/message-id/7c2090cd-a72a-4e34-afaa-6dd2ef31440e@iki.fi
2024-11-04 17:59:24 +02:00
a9c546a5a3 Use ProcNumbers instead of direct Latch pointers to address other procs
This is in preparation for replacing Latches with a new abstraction.
That's still work in progress, but this seems a little tidier anyway,
so let's get this refactoring out of the way already.

Discussion: https://www.postgresql.org/message-id/391abe21-413e-4d91-a650-b663af49500c%40iki.fi
2024-11-01 13:47:20 +02:00
243e9b40f1 For inplace update, send nontransactional invalidations.
The inplace update survives ROLLBACK.  The inval didn't, so another
backend's DDL could then update the row without incorporating the
inplace update.  In the test this fixes, a mix of CREATE INDEX and ALTER
TABLE resulted in a table with an index, yet relhasindex=f.  That is a
source of index corruption.  Back-patch to v12 (all supported versions).
The back branch versions don't change WAL, because those branches just
added end-of-recovery SIResetAll().  All branches change the ABI of
extern function PrepareToInvalidateCacheTuple().  No PGXN extension
calls that, and there's no apparent use case in extensions.

Reviewed by Nitin Motiani and (in earlier versions) Andres Freund.

Discussion: https://postgr.es/m/20240523000548.58.nmisch@google.com
2024-10-25 06:51:02 -07:00
755a4c10d1 bufmgr/smgr: Don't cross segment boundaries in StartReadBuffers()
With real AIO it doesn't make sense to cross segment boundaries with one
IO. Add smgrmaxcombine() to allow upper layers to query which buffers can be
merged.

We could continue to cross segment boundaries when not using AIO, but it
doesn't really make sense, because md.c will never be able to perform the read
across the segment boundary in one system call. Which means we'll mark more
buffers as undergoing IO than really makes sense - if another backend desires
to read the same blocks, it'll be blocked longer than necessary. So it seems
better to just never cross the boundary.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
2024-10-08 11:37:45 -04:00
2bbc261ddb Use an shmem_exit callback to remove backend from PMChildFlags on exit
This seems nicer than having to duplicate the logic between
InitProcess() and ProcKill() for which child processes have a
PMChildFlags slot.

Move the MarkPostmasterChildActive() call earlier in InitProcess(),
out of the section protected by the spinlock.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/a102f15f-eac4-4ff2-af02-f9ff209ec66f@iki.fi
2024-10-08 15:06:34 +03:00
094ae07160 Remove unneeded #include
Unneeded since commit d72731a704.

Discussion: https://www.postgresql.org/message-id/391abe21-413e-4d91-a650-b663af49500c@iki.fi
2024-10-05 15:09:32 +03:00
10b721821d Use macro to define the number of enum values
Refactoring in the interest of code consistency, a follow-up to 2e068db56e31.

The argument against inserting a special enum value at the end of the enum
definition is that a switch statement might generate a compiler warning unless
it has a default clause.

Aleksander Alekseev, reviewed by Michael Paquier, Dean Rasheed, Peter Eisentraut

Discussion: https://postgr.es/m/CAJ7c6TMsiaV5urU_Pq6zJ2tXPDwk69-NKVh4AMN5XrRiM7N%2BGA%40mail.gmail.com
2024-10-01 09:30:24 -04:00
aac2c9b4fd For inplace update durability, make heap_update() callers wait.
The previous commit fixed some ways of losing an inplace update.  It
remained possible to lose one when a backend working toward a
heap_update() copied a tuple into memory just before inplace update of
that tuple.  In catalogs eligible for inplace update, use LOCKTAG_TUPLE
to govern admission to the steps of copying an old tuple, modifying it,
and issuing heap_update().  This includes MERGE commands.  To avoid
changing most of the pg_class DDL, don't require LOCKTAG_TUPLE when
holding a relation lock sufficient to exclude inplace updaters.
Back-patch to v12 (all supported versions).  In v13 and v12, "UPDATE
pg_class" or "UPDATE pg_database" can still lose an inplace update.  The
v14+ UPDATE fix needs commit 86dc90056dfdbd9d1b891718d2e5614e3e432f35,
and it wasn't worth reimplementing that fix without such infrastructure.

Reviewed by Nitin Motiani and (in earlier versions) Heikki Linnakangas.

Discussion: https://postgr.es/m/20231027214946.79.nmisch@google.com
2024-09-24 15:25:18 -07:00
c4d5cb71d2 Increase the number of fast-path lock slots
Replace the fixed-size array of fast-path locks with arrays, sized on
startup based on max_locks_per_transaction. This allows using fast-path
locking for workloads that need more locks.

The fast-path locking introduced in 9.2 allowed each backend to acquire
a small number (16) of weak relation locks cheaply. If a backend needs
to hold more locks, it has to insert them into the shared lock table.
This is considerably more expensive, and may be subject to contention
(especially on many-core systems).

The limit of 16 fast-path locks was always rather low, because we have
to lock all relations - not just tables, but also indexes, views, etc.
For planning we need to lock all relations that might be used in the
plan, not just those that actually get used in the final plan. So even
with rather simple queries and schemas, we often need significantly more
than 16 locks.

As partitioning gets used more widely, and the number of partitions
increases, this limit is trivial to hit. Complex queries may easily use
hundreds or even thousands of locks. For workloads doing a lot of I/O
this is not noticeable, but for workloads accessing only data in RAM,
the access to the shared lock table may be a serious issue.

This commit removes the hard-coded limit of the number of fast-path
locks. Instead, the size of the fast-path arrays is calculated at
startup, and can be set much higher than the original 16-lock limit.
The overall fast-path locking protocol remains unchanged.

The variable-sized fast-path arrays can no longer be part of PGPROC, but
are allocated as a separate chunk of shared memory and then references
from the PGPROC entries.

The fast-path slots are organized as a 16-way set associative cache. You
can imagine it as a hash table of 16-slot "groups". Each relation is
mapped to exactly one group using hash(relid), and the group is then
processed using linear search, just like the original fast-path cache.
With only 16 entries this is cheap, with good locality.

Treating this as a simple hash table with open addressing would not be
efficient, especially once the hash table gets almost full. The usual
remedy is to grow the table, but we can't do that here easily. The
access would also be more random, with worse locality.

The fast-path arrays are sized using the max_locks_per_transaction GUC.
We try to have enough capacity for the number of locks specified in the
GUC, using the traditional 2^n formula, with an upper limit of 1024 lock
groups (i.e. 16k locks). The default value of max_locks_per_transaction
is 64, which means those instances will have 64 fast-path slots.

The main purpose of the max_locks_per_transaction GUC is to size the
shared lock table. It is often set to the "average" number of locks
needed by backends, with some backends using significantly more locks.
This should not be a major issue, however. Some backens may have to
insert locks into the shared lock table, but there can't be too many of
them, limiting the contention.

The only solution is to increase the GUC, even if the shared lock table
already has sufficient capacity. That is not free, especially in terms
of memory usage (the shared lock table entries are fairly large). It
should only happen on machines with plenty of memory, though.

In the future we may consider a separate GUC for the number of fast-path
slots, but let's try without one first.

Reviewed-by: Robert Haas, Jakub Wartak
Discussion: https://postgr.es/m/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com
2024-09-21 20:09:35 +02:00
70d38e3d8a Allow ReadStream to be consumed as raw block numbers.
Commits 041b9680 and 6377e12a changed the interface of
scan_analyze_next_block() to take a ReadStream instead of a BlockNumber
and a BufferAccessStrategy, and to return a value to indicate when the
stream has run out of blocks.

This caused integration problems for at least one known extension that
uses specially encoded BlockNumber values that map to different
underlying storage, because acquire_sample_rows() sets up the stream so
that read_stream_next_buffer() reads blocks from the main fork of the
relation's SMgrRelation.

Provide read_stream_next_block(), as a way for such an extension to
access the stream of raw BlockNumbers directly and forward them to its
own ReadBuffer() calls after decoding, as it could in earlier releases.
The new function returns the BlockNumber and BufferAccessStrategy that
were previously passed directly to scan_analyze_next_block().
Alternatively, an extension could wrap the stream of BlockNumbers in
another ReadStream with a callback that performs any decoding required
to arrive at real storage manager BlockNumber values, so that it could
benefit from the I/O combining and concurrency provided by
read_stream.c.

Another class of table access method that does nothing in
scan_analyze_next_block() because it is not block-oriented could use
this function to control the number of block sampling loops.  It could
match the previous behavior with "return read_stream_next_block(stream,
&bas) != InvalidBlockNumber".

Ongoing work is expected to provide better ANALYZE support for table
access methods that don't behave like heapam with respect to storage
blocks, but that will be for future releases.

Back-patch to 17.

Reported-by: Mats Kindahl <mats@timescale.com>
Reviewed-by: Mats Kindahl <mats@timescale.com>
Discussion: https://postgr.es/m/CA%2B14425%2BCcm07ocG97Fp%2BFrD9xUXqmBKFvecp0p%2BgV2YYR258Q%40mail.gmail.com
2024-09-18 11:34:28 +12:00
c582b75851 Add block_range_read_stream_cb(), to deduplicate code.
This replaces two functions for iterating over all blocks in a range.  A
pending patch will use this instead of adding a third.

Nazir Bilal Yavuz

Discussion: https://postgr.es/m/20240820184742.f2.nmisch@google.com
2024-09-03 10:46:20 -07:00
2b5f57977f Add const qualifiers to XLogRegister*() functions
Add const qualifiers to XLogRegisterData() and XLogRegisterBufData().
Several unconstify() calls can be removed.

Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://www.postgresql.org/message-id/dd889784-9ce7-436a-b4f1-52e4a5e577bd@eisentraut.org
2024-09-03 08:06:03 +02:00
4236825197 Fix typos and grammar in code comments and docs
Author: Alexander Lakhin
Discussion: https://postgr.es/m/f7e514cf-2446-21f1-a5d2-8c089a6e2168@gmail.com
2024-09-03 14:49:04 +09:00
478846e768 Rename some shared memory initialization routines
To make them follow the usual naming convention where
FoobarShmemSize() calculates the amount of shared memory needed by
Foobar subsystem, and FoobarShmemInit() performs the initialization.

I didn't rename CreateLWLocks() and InitShmmeIndex(), because they are
a little special. They need to be called before any of the other
ShmemInit() functions, because they set up the shared memory
bookkeeping itself. I also didn't rename InitProcGlobal(), because
unlike other Shmeminit functions, it's not called by individual
backends.

Reviewed-by: Andreas Karlsson
Discussion: https://www.postgresql.org/message-id/c09694ff-2453-47e5-b26c-32a16cd75ce6@iki.fi
2024-08-29 09:46:21 +03:00
fbce7dfc77 Refactor lock manager initialization to make it a bit less special
Split the shared and local initialization to separate functions, and
follow the common naming conventions. With this, we no longer create
the LockMethodLocalHash hash table in the postmaster process, which
was always pointless.

Reviewed-by: Andreas Karlsson
Discussion: https://www.postgresql.org/message-id/c09694ff-2453-47e5-b26c-32a16cd75ce6@iki.fi
2024-08-29 09:46:06 +03:00
5304fec4d8 Apply PGDLLIMPORT markings to some GUC variables
According to the commit message in 8ec569479, we must have all variables
in header files marked with PGDLLIMPORT. In commit d3cc5ffe81f6 some
variables were moved from launch_backend.c file to several header files.

This adds PGDLLIMPORT to moved variables.

Author: Sofia Kopikova <s.kopikova@postgrespro.ru>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/e0b17014-5319-4dd6-91cd-93d9c8fc9539%40postgrespro.ru
2024-08-14 11:36:12 +02:00
b5df24e520 Fix typo in bufpage.h.
Author: Senglee Choi
Reviewed-by: Tender Wang
Discussion: https://postgr.es/m/CACUsy79U0=S5zWEf6D57F=vB7rOEa86xFY6oovDZ58jRcROCxQ@mail.gmail.com
2024-08-05 14:38:00 +05:30
3c5db1d6b0 Implement pg_wal_replay_wait() stored procedure
pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

pg_wal_replay_wait() needs to wait without any snapshot held.  Otherwise,
the snapshot could prevent the replay of WAL records, implying a kind of
self-deadlock.  This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working without an active snapshot,
not a function.

Catversion is bumped.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
2024-08-02 21:16:56 +03:00
e25626677f Remove --disable-spinlocks.
A later change will require atomic support, so it wouldn't make sense
for a hypothetical new system not to be able to implement spinlocks.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (concept, not the patch)
Reviewed-by: Andres Freund <andres@anarazel.de> (concept, not the patch)
Discussion: https://postgr.es/m/3351991.1697728588%40sss.pgh.pa.us
2024-07-30 22:58:37 +12:00
9d9b9d46f3 Move cancel key generation to after forking the backend
Move responsibility of generating the cancel key to the backend
process. The cancel key is now generated after forking, and the
backend advertises it in the ProcSignal array. When a cancel request
arrives, the backend handling it scans the ProcSignal array to find
the target pid and cancel key. This is similar to how this previously
worked in the EXEC_BACKEND case with the ShmemBackendArray, just
reusing the ProcSignal array.

One notable change is that we no longer generate cancellation keys for
non-backend processes. We generated them before just to prevent a
malicious user from canceling them; the keys for non-backend processes
were never actually given to anyone. There is now an explicit flag
indicating whether a process has a valid key or not.

I wrote this originally in preparation for supporting longer cancel
keys, but it's a nice cleanup on its own.

Reviewed-by: Jelte Fennema-Nio
Discussion: https://www.postgresql.org/message-id/508d0505-8b7a-4864-a681-e7e5edfe32aa@iki.fi
2024-07-29 15:37:48 +03:00
774d47b6c0 Move all extern declarations for GUC variables to header files
Add extern declarations in appropriate header files for global
variables related to GUC.  In many cases, this was handled quite
inconsistently before, with some GUC variables declared in a header
file and some only pulled in via ad-hoc extern declarations in various
.c files.

Also add PGDLLIMPORT qualifications to those variables.  These were
previously missing because src/tools/mark_pgdllimport.pl has only been
used with header files.

This also fixes -Wmissing-variable-declarations warnings for GUC
variables (not yet part of the standard warning options).

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/e0a62134-83da-4ba4-8cdb-ceb0111c95ce@eisentraut.org
2024-07-24 06:31:07 +02:00
d3cc5ffe81 Move extern declarations for EXEC_BACKEND to header files
This fixes warnings from -Wmissing-variable-declarations (not yet part
of the standard warning options) under EXEC_BACKEND.  The
NON_EXEC_STATIC variables need a suitable declaration in a header file
under EXEC_BACKEND.

Also fix the inconsistent application of the volatile qualifier for
PMSignalState, which was revealed by this change.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/e0a62134-83da-4ba4-8cdb-ceb0111c95ce@eisentraut.org
2024-07-23 15:07:10 +02:00
a858be17c3 Add a way to create read stream object by using SMgrRelation.
Currently read stream object can be created only by using Relation.

Nazir Bilal Yavuz

Discussion: https://postgr.es/m/CAN55FZ0JKL6vk1xQp6rfOXiNFV1u1H0tJDPPGHWoiO3ea2Wc=A@mail.gmail.com
2024-07-20 04:22:12 -07:00
af07a827b9 Refactor PinBufferForBlock() to remove checks about persistence.
There are checks in PinBufferForBlock() function to set persistence of
the relation.  This function is called for each block in the relation.
Instead, set persistence of the relation before PinBufferForBlock().

Nazir Bilal Yavuz

Discussion: https://postgr.es/m/CAN55FZ0JKL6vk1xQp6rfOXiNFV1u1H0tJDPPGHWoiO3ea2Wc=A@mail.gmail.com
2024-07-20 04:22:12 -07:00
e00c45f685 Remove "smgr_persistence == 0" dead code.
Reaching that code would have required multiple processes performing
relation extension during recovery, which does not happen.  That caller
has the persistence available, so pass it.  This was dead code as soon
as commit 210622c60e1a9db2e2730140b8106ab57d259d15 added it.

Discussion: https://postgr.es/m/CAN55FZ0JKL6vk1xQp6rfOXiNFV1u1H0tJDPPGHWoiO3ea2Wc=A@mail.gmail.com
2024-07-20 04:22:12 -07:00
98347b5a3a Lift limitation that PGPROC->links must be the first field
Since commit 5764f611e1, we've been using the ilist.h functions for
handling the linked list. There's no need for 'links' to be the first
element of the struct anymore, except for one call in InitProcess
where we used a straight cast from the 'dlist_node *' to PGPROC *,
without the dlist_container() macro. That was just an oversight in
commit 5764f611e1, fix it.

There no imminent need to move 'links' from being the first field, but
let's be tidy.

Reviewed-by: Aleksander Alekseev, Andres Freund
Discussion: https://www.postgresql.org/message-id/22aa749e-cc1a-424a-b455-21325473a794@iki.fi
2024-07-05 11:21:46 +03:00
8f8bcb8883 Improve some global variable declarations
We have in launch_backend.c:

    /*
     * The following need to be available to the save/restore_backend_variables
     * functions.  They are marked NON_EXEC_STATIC in their home modules.
     */
    extern slock_t *ShmemLock;
    extern slock_t *ProcStructLock;
    extern PGPROC *AuxiliaryProcs;
    extern PMSignalData *PMSignalState;
    extern pg_time_t first_syslogger_file_time;
    extern struct bkend *ShmemBackendArray;
    extern bool redirection_done;

That comment is not completely true: ShmemLock, ShmemBackendArray, and
redirection_done are not in fact NON_EXEC_STATIC.  ShmemLock once was,
but was then needed elsewhere.  ShmemBackendArray was static inside
postmaster.c before launch_backend.c was created.  redirection_done
was never static.

This patch moves the declaration of ShmemLock and redirection_done to
a header file.

ShmemBackendArray gets a NON_EXEC_STATIC.  This doesn't make a
difference, since it only exists if EXEC_BACKEND anyway, but it makes
it consistent.

After that, the comment is now correct.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/e0a62134-83da-4ba4-8cdb-ceb0111c95ce@eisentraut.org
2024-07-02 07:26:22 +02:00
edadeb0710 Remove support for HPPA (a/k/a PA-RISC) architecture.
This old CPU architecture hasn't been produced in decades, and
whatever instances might still survive are surely too underpowered
for anyone to consider running Postgres on in production.  We'd
nonetheless continued to carry code support for it (largely at my
insistence), because its unique implementation of spinlocks seemed
like a good edge case for our spinlock infrastructure.  However,
our last buildfarm animal of this type was retired last year, and
it seems quite unlikely that another will emerge.  Without the ability
to run tests, the argument that this is useful test code fails to
hold water.  Furthermore, carrying code support for an untestable
architecture has costs not to be ignored.  So, remove HPPA-specific
code, in the same vein as commits 718aa43a4 and 92d70b77e.

Discussion: https://postgr.es/m/3351991.1697728588@sss.pgh.pa.us
2024-07-01 13:55:52 -04:00
0cecc908e9 Lock before setting relhassubclass on RELKIND_PARTITIONED_INDEX.
Commit 5b562644fec696977df4a82790064e8287927891 added a comment that
SetRelationHasSubclass() callers must hold this lock.  When commit
17f206fbc824d2b4b14480199ca9ff7dea417eda extended use of this column to
partitioned indexes, it didn't take the lock.  As the latter commit
message mentioned, we currently never reset a partitioned index to
relhassubclass=f.  That largely avoids harm from the lock omission.  The
cause for fixing this now is to unblock introducing a rule about locks
required to heap_update() a pg_class row.  This might cause more
deadlocks.  It gives minor user-visible benefits:

- If an ALTER INDEX SET TABLESPACE runs concurrently with ALTER TABLE
  ATTACH PARTITION or CREATE PARTITION OF, one transaction blocks
  instead of failing with "tuple concurrently updated".  (Many cases of
  DDL concurrency still fail that way.)

- Match ALTER INDEX ATTACH PARTITION in choosing to lock the index.

While not user-visible today, we'll need this if we ever make something
set the flag to false for a partitioned index, like ANALYZE does today
for tables.  Back-patch to v12 (all supported versions), the plan for
the commit relying on the new rule.  In back branches, add
LockOrStrongerHeldByMe() instead of adding a LockHeldByMe() parameter.

Reviewed (in an earlier version) by Robert Haas.

Discussion: https://postgr.es/m/20240611024525.9f.nmisch@google.com
2024-06-27 19:21:05 -07:00
bb93640a68 Add wait event type "InjectionPoint", a custom type like "Extension".
Both injection points and customization of type "Extension" are new in
v17, so this just changes a detail of an unreleased feature.

Reported by Robert Haas.  Reviewed by Michael Paquier.

Discussion: https://postgr.es/m/CA+TgmobfMU5pdXP36D5iAwxV5WKE_vuDLtp_1QyH+H5jMMt21g@mail.gmail.com
2024-06-27 19:21:05 -07:00
cbfbda7841 Fix MVCC bug with prepared xact with subxacts on standby
We did not recover the subtransaction IDs of prepared transactions
when starting a hot standby from a shutdown checkpoint. As a result,
such subtransactions were considered as aborted, rather than
in-progress. That would lead to hint bits being set incorrectly, and
the subtransactions suddenly becoming visible to old snapshots when
the prepared transaction was committed.

To fix, update pg_subtrans with prepared transactions's subxids when
starting hot standby from a shutdown checkpoint. The snapshots taken
from that state need to be marked as "suboverflowed", so that we also
check the pg_subtrans.

Backport to all supported versions.

Discussion: https://www.postgresql.org/message-id/6b852e98-2d49-4ca1-9e95-db419a2696e0@iki.fi
2024-06-27 21:09:58 +03:00
6207f08f70 Harmonize function parameter names for Postgres 17.
Make sure that function declarations use names that exactly match the
corresponding names from function definitions in a few places.  These
inconsistencies were all introduced during Postgres 17 development.

pg_bsd_indent still has a couple of similar inconsistencies, which I
(pgeoghegan) have left untouched for now.

This commit was written with help from clang-tidy, by mechanically
applying the same rules as similar clean-up commits (the earliest such
commit was commit 035ce1fe).
2024-06-12 17:01:51 -04:00
be5942aee7 Remove unused typedefs
There were a few typedefs that were never used to define a variable or
field.  This had the effect that the buildfarm's typedefs.list would
not include them, and so they would need to be re-added manually to
keep the overall pgindent result perfectly clean.

We can easily get rid of these typedefs to avoid the issue.  In a few
cases, we just let the enum or struct stand on its own without a
typedef around it.  In one case, an enum was abused to define flag
bits; that's better done with macro definitions.

This fixes all the remaining issues with missing entries in the
buildfarm's typedefs.list.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/1919000.1715815925@sss.pgh.pa.us
2024-05-17 07:36:12 +02:00
950d4a2cb1 Fix typos and duplicate words
This fixes various typos, duplicated words, and tiny bits of whitespace
mainly in code comments but also in docs.

Author: Daniel Gustafsson <daniel@yesql.se>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
Author: Alexander Lakhin <exclusion@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/3F577953-A29E-4722-98AD-2DA9EFF2CBB8@yesql.se
2024-04-18 21:28:07 +02:00
772faafca1 Revert: Implement pg_wal_replay_wait() stored procedure
This commit reverts 06c418e163, e37662f221, bf1e650806, 25f42429e2,
ee79928441, and 74eaf66f98 per review by Heikki Linnakangas.

Discussion: https://postgr.es/m/b155606b-e744-4218-bda5-29379779da1a%40iki.fi
2024-04-11 17:28:15 +03:00
13453eedd3 Add pg_buffercache_evict() function for testing.
When testing buffer pool logic, it is useful to be able to evict
arbitrary blocks.  This function can be used in SQL queries over the
pg_buffercache view to set up a wide range of buffer pool states.  Of
course, buffer mappings might change concurrently so you might evict a
block other than the one you had in mind, and another session might
bring it back in at any time.  That's OK for the intended purpose of
setting up developer testing scenarios, and more complicated interlocking
schemes to give stronger guararantees about that would likely be less
flexible for actual testing work anyway.  Superuser-only.

Author: Palak Chaturvedi <chaturvedipalak1911@gmail.com>
Author: Thomas Munro <thomas.munro@gmail.com> (docs, small tweaks)
Reviewed-by: Nitin Jadhav <nitinjadhavpostgres@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Cary Huang <cary.huang@highgo.ca>
Reviewed-by: Cédric Villemain <cedric.villemain+pgsql@abcsql.com>
Reviewed-by: Jim Nasby <jim.nasby@gmail.com>
Reviewed-by: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CALfch19pW48ZwWzUoRSpsaV9hqt0UPyaBPC4bOZ4W+c7FF566A@mail.gmail.com
2024-04-08 16:23:40 +12:00
25f42429e2 Use an LWLock instead of a spinlock in waitlsn.c
This should prevent busy-waiting when number of waiting processes is high.

Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
Author: Alvaro Herrera
2024-04-07 00:49:53 +03:00
3bd8439ed6 Allow BufferAccessStrategy to limit pin count.
While pinning extra buffers to look ahead, users of strategies are in
danger of using too many buffers.  For some strategies, that means
"escaping" from the ring, and in others it means forcing dirty data to
disk very frequently with associated WAL flushing.  Since external code
has no insight into any of that, allow individual strategy types to
expose a clamp that should be applied when deciding how many buffers to
pin at once.

Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com
2024-04-06 23:11:45 +13:00
6faca9ae28 Avoid deadlock during orphan temp table removal.
If temp tables have dependencies (such as sequences) then it's
possible for autovacuum's cleanup of orphan temp tables to deadlock
against an incoming backend that's trying to clean out the temp
namespace for its own use.  That can happen because RemoveTempRelations'
performDeletion call can visit objects within the namespace in
an order different from the order in which a per-table deletion
will visit them.

To fix, observe that performDeletion will begin by taking an exclusive
lock on the temp namespace (even though it won't actually delete it).
So, if we can get a shared lock on the namespace, we can be sure we're
not running concurrently with RemoveTempRelations, while also not
conflicting with ordinary use of the namespace.  This requires
introducing a conditional version of LockDatabaseObject, but that's no
big deal.  (It's surprising we've got along without that this long.)

Report and patch by Mikhail Zhilin.  Back-patch to all supported
branches.

Discussion: https://postgr.es/m/c43ce028-2bc2-4865-9b89-3f706246eed5@postgrespro.ru
2024-04-02 14:59:32 -04:00
b5a9b18cd0 Provide API for streaming relation data.
Introduce an abstraction allowing relation data to be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.

Client code supplies a callback that can say which block number it wants
next, and then consumes individual buffers one at a time from the
stream.  This division puts read_stream.c in control of how far ahead it
can see and allows it to read clusters of neighboring blocks with
StartReadBuffers().  It also issues POSIX_FADV_WILLNEED advice ahead of
time when random access is detected.

Other variants of I/O stream will be proposed in future work (for
example to support recovery, whose LsnReadQueue device in
xlogprefetcher.c is a distant cousin of this code and should eventually
be replaced by this), but this basic API is sufficient for many common
executor usage patterns involving predictable access to a single fork of
a single relation.

Several patches using this API are proposed separately.

This stream concept is loosely based on ideas from Andres Freund on how
we should pave the way for later work on asynchronous I/O.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (contributions)
Author: Melanie Plageman <melanieplageman@gmail.com> (contributions)
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
2024-04-03 00:49:46 +13:00
210622c60e Provide vectored variant of ReadBuffer().
Break ReadBuffer() up into two steps.  StartReadBuffers() and
WaitReadBuffers() give us two main advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the
kernel ahead of time.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.

A new GUC io_combine_limit is defined, and the functions for limiting
per-backend pin counts are made into public APIs.  Those are provided
for use by callers of StartReadBuffers(), when deciding how many buffers
to read at once.  The following commit will add a higher level mechanism
for doing that automatically with a practical interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of just issuing
advice and leaving WaitReadBuffers() to do the work synchronously.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de> (some optimization tweaks)
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
2024-04-03 00:23:20 +13:00
667e65aac3 Use TidStore for dead tuple TIDs storage during lazy vacuum.
Previously, we used a simple array for storing dead tuple IDs during
lazy vacuum, which had a number of problems:

* The array used a single allocation and so was limited to 1GB.
* The allocation was pessimistically sized according to table size.
* Lookup with binary search was slow because of poor CPU cache and
  branch prediction behavior.

This commit replaces that array with the TID store from commit
30e144287a.

Since the backing radix tree makes small allocations as needed, the
1GB limit is now gone. Further, the total memory used is now often
smaller by an order of magnitude or more, depending on the
distribution of blocks and offsets. These two features should make
multiple rounds of heap scanning and index cleanup an extremely rare
event. TID lookup during index cleanup is also several times faster,
even more so when index order is correlated with heap tuple order.

Since there is no longer a predictable relationship between the number
of dead tuples vacuumed and the space taken up by their TIDs, the
number of tuples no longer provides any meaningful insights for users,
nor is the maximum number predictable. For that reason this commit
also changes to byte-based progress reporting, with the relevant
columns of pg_stat_progress_vacuum renamed accordingly to
max_dead_tuple_bytes and dead_tuple_bytes.

For parallel vacuum, both the TID store and supplemental information
specific to vacuum are shared among the parallel vacuum workers. As
with the previous array, we don't take any locks on TidStore during
parallel vacuum since writes are still only done by the leader
process.

Bump catalog version.

Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
2024-04-02 10:15:37 +09:00
da952b415f Rework lwlocknames.txt to become lwlocklist.h
This way, we can fold the list of lock names to occur in
BuiltinTrancheNames instead of having its own separate array.  This
saves two lines of code in GetLWTrancheName and some space in
BuiltinTrancheNames, as foreseen in commit 74a730631065, as well as
removing the need for a separate lwlocknames.c file.

We still have to build lwlocknames.h using Perl code, which initially I
wanted to avoid, but it gives us the chance to cross-check
wait_event_names.txt.

Discussion: https://postgr.es/m/202401231025.gbv4nnte5fmm@alvherre.pgsql
2024-03-20 11:55:20 +01:00
2346df6fc3 Allow a no-wait lock acquisition to succeed in more cases.
We don't determine the position at which a process waiting for a lock
should insert itself into the wait queue until we reach ProcSleep(),
and we may at that point discover that we must insert ourselves ahead
of everyone who wants a conflicting lock, in which case we obtain the
lock immediately. Up until now, a no-wait lock acquisition would fail
in such cases, erroneously claiming that the lock couldn't be obtained
immediately.  Fix that by trying ProcSleep even in the no-wait case.

No back-patch for now, because I'm treating this as an improvement to
the existing no-wait feature. It could instead be argued that it's a
bug fix, on the theory that there should never be any case whatsoever
where no-wait fails to obtain a lock that would have been obtained
immediately without no-wait, but I'm reluctant to interpret the
semantics of no-wait that strictly.

Robert Haas and Jingxian Li

Discussion: http://postgr.es/m/CA+TgmobCH-kMXGVpb0BB-iNMdtcNkTvcZ4JBxDJows3kYM+GDg@mail.gmail.com
2024-03-14 08:56:06 -04:00
e85662df44 Fix false reports in pg_visibility
Currently, pg_visibility computes its xid horizon using the
GetOldestNonRemovableTransactionId().  The problem is that this horizon can
sometimes go backward.  That can lead to reporting false errors.

In order to fix that, this commit implements a new function
GetStrictOldestNonRemovableTransactionId().  This function computes the xid
horizon, which would be guaranteed to be newer or equal to any xid horizon
computed before.

We have to do the following to achieve this.

1. Ignore processes xmin's, because they consider connection to other databases
   that were ignored before.
2. Ignore KnownAssignedXids, because they are not database-aware. At the same
   time, the primary could compute its horizons database-aware.
3. Ignore walsender xmin, because it could go backward if some replication
   connections don't use replication slots.

As a result, we're using only currently running xids to compute the horizon.
Surely these would significantly sacrifice accuracy.  But we have to do so to
avoid reporting false errors.

Inspired by earlier patch by Daniel Shelepanov and the following discussion
with Robert Haas and Tom Lane.

Discussion: https://postgr.es/m/1649062270.289865713%40f403.i.mail.ru
Reviewed-by: Alexander Lakhin, Dmitry Koval
2024-03-14 13:12:05 +02:00
024c521117 Replace BackendIds with 0-based ProcNumbers
Now that BackendId was just another index into the proc array, it was
redundant with the 0-based proc numbers used in other places. Replace
all usage of backend IDs with proc numbers.

The only place where the term "backend id" remains is in a few pgstat
functions that expose backend IDs at the SQL level. Those IDs are now
in fact 0-based ProcNumbers too, but the documentation still calls
them "backend ids". That term still seems appropriate to describe what
the numbers are, so I let it be.

One user-visible effect is that pg_temp_0 is now a valid temp schema
name, for backend with ProcNumber 0.

Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
2024-03-03 19:38:22 +02:00
ab355e3a88 Redefine backend ID to be an index into the proc array
Previously, backend ID was an index into the ProcState array, in the
shared cache invalidation manager (sinvaladt.c). The entry in the
ProcState array was reserved at backend startup by scanning the array
for a free entry, and that was also when the backend got its backend
ID. Things become slightly simpler if we redefine backend ID to be the
index into the PGPROC array, and directly use it also as an index to
the ProcState array. This uses a little more memory, as we reserve a
few extra slots in the ProcState array for aux processes that don't
need them, but the simplicity is worth it.

Aux processes now also have a backend ID. This simplifies the
reservation of BackendStatusArray and ProcSignal slots.

You can now convert a backend ID into an index into the PGPROC array
simply by subtracting 1. We still use 0-based "pgprocnos" in various
places, for indexes into the PGPROC array, but the only difference now
is that backend IDs start at 1 while pgprocnos start at 0. (The next
commmit will get rid of the term "backend ID" altogether and make
everything 0-based.)

There is still a 'backendId' field in PGPROC, now part of 'vxid' which
encapsulates the backend ID and local transaction ID together. It's
needed for prepared xacts. For regular backends, the backendId is
always equal to pgprocno + 1, but for prepared xact PGPROC entries,
it's the ID of the original backend that processed the transaction.

Reviewed-by: Andres Freund, Reid Thompson
Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
2024-03-03 19:37:28 +02:00
653b55b570 Return ssize_t in fd.c I/O functions.
In the past, FileRead() and FileWrite() used types based on the Unix
read() and write() functions from before C and POSIX standardization,
though not exactly (we had int for amount instead of unsigned).  In
commit 2d4f1ba6 we changed to the appropriate standard C types, just
like the modern POSIX functions they wrap, but again not exactly: the
return type stayed as int.  In theory, a ssize_t value could be returned
by the underlying call that is too large for an int.

That wasn't really a live bug, because we don't expect PostgreSQL code
to perform reads or writes of gigabytes, and OSes probably apply
internal caps smaller than that anyway.  This change is done on the
principle that the return might as well follow the standard interfaces
consistently.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/1672202.1703441340%40sss.pgh.pa.us
2024-03-02 12:09:28 +13:00
53c2a97a92 Improve performance of subsystems on top of SLRU
More precisely, what we do here is make the SLRU cache sizes
configurable with new GUCs, so that sites with high concurrency and big
ranges of transactions in flight (resp. multixacts/subtransactions) can
benefit from bigger caches.  In order for this to work with good
performance, two additional changes are made:

1. the cache is divided in "banks" (to borrow terminology from CPU
   caches), and algorithms such as eviction buffer search only affect
   one specific bank.  This forestalls the problem that linear searching
   for a specific buffer across the whole cache takes too long: we only
   have to search the specific bank, whose size is small.  This work is
   authored by Andrey Borodin.

2. Change the locking regime for the SLRU banks, so that each bank uses
   a separate LWLock.  This allows for increased scalability.  This work
   is authored by Dilip Kumar.  (A part of this was previously committed as
   d172b717c6f4.)

Special care is taken so that the algorithms that can potentially
traverse more than one bank release one bank's lock before acquiring the
next.  This should happen rarely, but particularly clog.c's group commit
feature needed code adjustment to cope with this.  I (Álvaro) also added
lots of comments to make sure the design is sound.

The new GUCs match the names introduced by bcdfa5f2e2f2 in the
pg_stat_slru view.

The default values for these parameters are similar to the previous
sizes of each SLRU.  commit_ts, clog and subtrans accept value 0, which
means to adjust by dividing shared_buffers by 512 (so 2MB for every 1GB
of shared_buffers), with a cap of 8MB.  (A new slru.c function
SimpleLruAutotuneBuffers() was added to support this.)  The cap was
previously 1MB for clog, so for sites with more than 512MB of shared
memory the total memory used increases, which is likely a good tradeoff.
However, other SLRUs (notably multixact ones) retain smaller sizes and
don't support a configured value of 0.  These values based on
shared_buffers may need to be revisited, but that's an easy change.

There was some resistance to adding these new GUCs: it would be better
to adjust to memory pressure automatically somehow, for example by
stealing memory from shared_buffers (where the caches can grow and
shrink naturally).  However, doing that seems to be a much larger
project and one which has made virtually no progress in several years,
and because this is such a pain point for so many users, here we take
the pragmatic approach.

Author: Andrey Borodin <x4mmm@yandex-team.ru>
Author: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amul Sul, Gilles Darold, Anastasia Lubennikova,
	Ivan Lazarev, Robert Haas, Thomas Munro, Tomas Vondra,
	Yura Sokolov, Васильев Дмитрий (Dmitry Vasiliev).
Discussion: https://postgr.es/m/2BEC2B3F-9B61-4C1D-9FB5-5FAB0F05EF86@yandex-team.ru
Discussion: https://postgr.es/m/CAFiTN-vzDvNz=ExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A@mail.gmail.com
2024-02-28 17:05:31 +01:00
0b16bb8776 Remove AIX support
There isn't a lot of user demand for AIX support, we have a bunch of
hacks to work around AIX-specific compiler bugs and idiosyncrasies,
and no one has stepped up to the plate to properly maintain it.
Remove support for AIX to get rid of that maintenance overhead. It's
still supported for stable versions.

The acute issue that triggered this decision was that after commit
8af2565248, the AIX buildfarm members have been hitting this
assertion:

    TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)"), File: "md.c", Line: 472, PID: 2949728

Apperently the "pg_attribute_aligned(a)" attribute doesn't work on AIX
for values larger than PG_IO_ALIGN_SIZE, for a static const variable.
That could be worked around, but we decided to just drop the AIX support
instead.

Discussion: https://www.postgresql.org/message-id/20240224172345.32@rfd.leadboat.com
Reviewed-by: Andres Freund, Noah Misch, Thomas Munro
2024-02-28 15:17:23 +04:00