Commit Graph

11495 Commits

Author SHA1 Message Date
2ea5d8bece Mark some new location fields as ParseLoc
Some new code probably didn't see 605721f819f and continued to use
type int for parse location fields.  Fix those.
2024-04-16 20:22:41 +02:00
ec07d0d7fa Clean up more indent breakage from 6377e12a5.
Per buildfarm member koel.
2024-04-16 13:00:40 -04:00
f698704155 Fix nbtree posting list comment.
Oversight in commit 0d861bbb70.
2024-04-16 11:20:41 -04:00
aa1def44c3 Fix nbtree page recycling comment.
Oversight in commit e5d8a99903.
2024-04-16 11:14:36 -04:00
6377e12a5a revert: Generalize relation analyze in table AM interface
This commit reverts 27bc1772fc and dd1f6b0c17.  Per review by Andres Freund.

Discussion: https://postgr.es/m/20240415201057.khoyxbwwxfgzomeo%40awork3.anarazel.de
2024-04-16 13:14:20 +03:00
768ceeeaa1 Add missing PGDLLIMPORT markings
All backend-side variables should be marked with PGDLLIMPORT, as per
policy introduced in 8ec569479f.  aafc05de1bf5 has forgotten
MyClientSocket, and 05c3980e7f47 LoadedSSL.

These can be spotted with a command like this one (be careful of not
switching __pg_log_level):
src/tools/mark_pgdllimport.pl $(git ls-files src/include/)

Reported-by: Peter Eisentraut
Discussion: https://postgr.es/m/ZhzkRzrkKhbeQMRm@paquier.xyz
2024-04-16 09:38:46 +09:00
cee8db3f68 ATTACH PARTITION: Don't match a PK with a UNIQUE constraint
When matching constraints in AttachPartitionEnsureIndexes() we weren't
testing the constraint type, which could make a UNIQUE key lacking a
not-null constraint incorrectly satisfy a primary key requirement.  Fix
this by testing that the constraint types match.  (Other possible
mismatches are verified by comparing index properties.)

Discussion: https://postgr.es/m/202402051447.wimb4xmtiiyb@alvherre.pgsql
2024-04-15 15:07:47 +02:00
661ab4e185 Fix some memory leaks associated with parsing json and manifests
Coverity complained about not freeing some memory associated with
incrementally parsing backup manifests. To fix that, provide and use a new
shutdown function for the JsonManifestParseIncrementalState object, in
line with a suggestion from Tom Lane.

While analysing the problem, I noticed a buglet in freeing memory for
incremental json lexers. To fix that remove a bogus condition on
freeing the memory allocated for them.
2024-04-12 10:32:30 -04:00
3af7040985 Fix IS [NOT] NULL qual optimization for inheritance tables
b262ad440 added code to have the planner remove redundant IS NOT NULL
quals and eliminate needless scans for IS NULL quals on tables where the
qual's column has a NOT NULL constraint.

That commit failed to consider that an inheritance parent table could
have differing NOT NULL constraints between the parent and the child.
This caused issues as if we eliminated a qual on the parent, when
applying the quals to child tables in apply_child_basequals(), the qual
might not have been added to the parent's baserestrictinfo.

Here we fix this by not applying the optimization to remove redundant
quals to RelOptInfos belonging to inheritance parents and applying the
optimization again in apply_child_basequals().  Effectively, this means
that the parent and child are considered independently as the parent has
both an inh=true and inh=false RTE and we still apply the optimization
to the RelOptInfo corresponding to the inh=false RTE.

We're able to still apply the optimization in add_base_clause_to_rel()
for partitioned tables as the NULLability of partitions must match that
of their parent.  And, if we ever expand restriction_is_always_false()
and restriction_is_always_true() to handle partition constraints then we
can apply the same logic as, even in multi-level partitioned tables,
there's no way to route values to a partition when the qual does not
match the partition qual of the partitioned table's parent partition.
The same is true for CHECK constraints as those must also match between
arent partitioned tables and their partitions.

Author: Richard Guo, David Rowley
Discussion: https://postgr.es/m/CAMbWs4930gQSZmjR7aANzEapdy61gCg6z8dT-kAEYD0sYWKPdQ@mail.gmail.com
2024-04-12 20:07:53 +12:00
772faafca1 Revert: Implement pg_wal_replay_wait() stored procedure
This commit reverts 06c418e163, e37662f221, bf1e650806, 25f42429e2,
ee79928441, and 74eaf66f98 per review by Heikki Linnakangas.

Discussion: https://postgr.es/m/b155606b-e744-4218-bda5-29379779da1a%40iki.fi
2024-04-11 17:28:15 +03:00
922c4c461d Revert: Allow table AM to store complex data structures in rd_amcache
This commit reverts 02eb07ea89 per review by Andres Freund.

Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
2024-04-11 16:02:49 +03:00
8dd0bb84da Revert: Allow table AM tuple_insert() method to return the different slot
This commit reverts c35a3fb5e0 per review by Andres Freund.

Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
2024-04-11 16:02:45 +03:00
193e6d18e5 Revert: Allow locking updated tuples in tuple_update() and tuple_delete()
This commit reverts 87985cc925 and 818861eb57 per review by Andres Freund.

Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
2024-04-11 16:01:34 +03:00
da841aa4dc Revert: Let table AM insertion methods control index insertion
This commit reverts b1484a3f19 per review by Andres Freund.

Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
2024-04-11 16:01:30 +03:00
bc1e2092eb Revert: Custom reloptions for table AM
This commit reverts 9bd99f4c26 and 422041542f per review by Andres Freund.

Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
2024-04-11 15:46:35 +03:00
810f64a015 Revert indexed and enlargable binary heap implementation.
This reverts commit b840508644 and bcb14f4abc. These commits were made
for commit 5bec1d6bc5 (Improve eviction algorithm in ReorderBuffer
using max-heap for many subtransactions). However, per discussion,
commit efb8acc0d0 replaced binary heap + index with pairing heap, and
made these commits unnecessary.

Reported-by: Jeff Davis
Discussion: https://postgr.es/m/12747c15811d94efcc5cda72d6b35c80d7bf3443.camel%40j-davis.com
2024-04-11 17:18:05 +09:00
efb8acc0d0 Replace binaryheap + index with pairingheap in reorderbuffer.c
A pairing heap can perform the same operations as the binary heap +
index, with as good or better algorithmic complexity, and that's an
existing data structure so that we don't need to invent anything new
compared to v16. This commit makes the new binaryheap functionality
that was added in commits b840508644 and bcb14f4abc unnecessary, but
they will be reverted separately.

Remove the optimization to only build and maintain the heap when the
amount of memory used is close to the limit, becuase the bookkeeping
overhead with the pairing heap seems to be small enough that it
doesn't matter in practice.

Reported-by: Jeff Davis
Author: Heikki Linnakangas
Reviewed-by: Michael Paquier, Hayato Kuroda, Masahiko Sawada
Discussion: https://postgr.es/m/12747c15811d94efcc5cda72d6b35c80d7bf3443.camel%40j-davis.com
2024-04-11 17:04:38 +09:00
ff9f72c68f revert: Transform OR clauses to ANY expression
This commit reverts 72bd38cc99 due to implementation and design issues.

Reported-by: Tom Lane
Discussion: https://postgr.es/m/3604469.1712628736%40sss.pgh.pa.us
2024-04-10 02:28:09 +03:00
0fe5f64367 Teach radix tree to embed values at runtime
Previously, the decision to store values in leaves or within the child
pointer was made at compile time, with variable length values using
leaves by necessity. This commit allows introspecting the length of
variable length values at runtime for that decision. This requires
the ability to tell whether the last-level child pointer is actually
a value, so we use a pointer tag in the lowest level bit.

Use this in TID store. This entails adding a byte to the header to
reserve space for the tag. Commit f35bd9bf3 stores up to three offsets
within the header with no bitmap, and now the header can be embedded
as above. This reduces worst-case memory usage when TIDs are sparse.

Reviewed (in an earlier version) by Masahiko Sawada

Discussion: https://postgr.es/m/CANWCAZYw+_KAaUNruhJfE=h6WgtBKeDG32St8vBJBEY82bGVRQ@mail.gmail.com
Discussion: https://postgr.es/m/CAD21AoBci3Hujzijubomo1tdwH3XtQ9F89cTNQ4bsQijOmqnEw@mail.gmail.com
2024-04-08 18:54:35 +07:00
dd1f6b0c17 Provide a way block-level table AMs could re-use acquire_sample_rows()
While keeping API the same, this commit provides a way for block-level table
AMs to re-use existing acquire_sample_rows() by providing custom callbacks
for getting the next block and the next tuple.

Reported-by: Andres Freund
Discussion: https://postgr.es/m/20240407214001.jgpg5q3yv33ve6y3%40awork3.anarazel.de
Reviewed-by: Pavel Borisov
2024-04-08 14:39:48 +03:00
df64c81ca9 Fix some grammer errors from error messages and codes comments
Discussion: https://postgr.es/m/CAHewXNkGMPU50QG7V6Q60JGFORfo8LfYO1_GCkCa0VWbmB-fEw%40mail.gmail.com
Author: Tender Wang
2024-04-08 14:39:41 +03:00
422041542f Fill CommonRdOptions with default values in extract_autovac_opts()
Reported-by: Thomas Munro
Reported-by: Pavel Borisov
Discussion: https://postgr.es/m/CA%2BhUKGLZzLR50RBvuqOO3MZ%3DF54ETz-rTp1PDX9uDGP_GqyYqA%40mail.gmail.com
2024-04-08 12:18:23 +03:00
9bd99f4c26 Custom reloptions for table AM
Let table AM define custom reloptions for its tables. This allows specifying
AM-specific parameters by the WITH clause when creating a table.

The reloptions, which could be used outside of table AM, are now extracted
into the CommonRdOptions data structure.  These options could be by decision
of table AM directly specified by a user or calculated in some way.

The new test module test_tam_options evaluates the ability to set up custom
reloptions and calculate fields of CommonRdOptions on their base.

The code may use some parts from prior work by Hao Wu.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn
Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent, Jess Davis
2024-04-08 11:23:28 +03:00
8a1b31e6e5 Use bump context for TID bitmaps stored by vacuum
Vacuum does not pfree individual entries, and only frees the entire
storage space when finished with it. This allows using a bump context,
eliminating the chunk header in each leaf allocation. Most leaf
allocations will be 16 to 32 bytes, so that's a significant savings.
TidStoreCreateLocal gets a boolean parameter to indicate that the
created store is insert-only.

This requires a separate tree context for iteration, since we free
the iteration state after iteration completes.

Discussion: https://postgr.es/m/CANWCAZac%3DpBePg3rhX8nXkUuaLoiAJJLtmnCfZsPEAS4EtJ%3Dkg%40mail.gmail.com
Discussion: https://postgr.es/m/CANWCAZZQFfxvzO8yZHFWtQV+Z2gAMv1ku16Vu7KWmb5kZQyd1w@mail.gmail.com
2024-04-08 14:39:49 +07:00
bb766cde63 JSON_TABLE: Add support for NESTED paths and columns
A NESTED path allows to extract data from nested levels of JSON
objects given by the parent path expression, which are projected as
columns specified using a nested COLUMNS clause, just like the parent
COLUMNS clause.  Rows comprised from a NESTED columns are "joined"
to the row comprised from the parent columns.  If a particular NESTED
path evaluates to 0 rows, then the nested COLUMNS will emit NULLs,
making it an OUTER join.

NESTED columns themselves may include NESTED paths to allow
extracting data from arbitrary nesting levels, which are likewise
joined against the rows at the parent level.

Multiple NESTED paths at a given level are called "sibling" paths
and their rows are combined by UNIONing them, that is, after being
joined against the parent row as described above.

Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Author: Teodor Sigaev <teodor@sigaev.ru>
Author: Oleg Bartunov <obartunov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Andrew Dunstan <andrew@dunslane.net>
Author: Amit Langote <amitlangote09@gmail.com>
Author: Jian He <jian.universality@gmail.com>

Reviewers have included (in no particular order):

Andres Freund, Alexander Korotkov, Pavel Stehule, Andrew Alsup,
Erik Rijkers, Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson,
Justin Pryzby, Álvaro Herrera, Jian He

Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
2024-04-08 16:14:13 +09:00
13453eedd3 Add pg_buffercache_evict() function for testing.
When testing buffer pool logic, it is useful to be able to evict
arbitrary blocks.  This function can be used in SQL queries over the
pg_buffercache view to set up a wide range of buffer pool states.  Of
course, buffer mappings might change concurrently so you might evict a
block other than the one you had in mind, and another session might
bring it back in at any time.  That's OK for the intended purpose of
setting up developer testing scenarios, and more complicated interlocking
schemes to give stronger guararantees about that would likely be less
flexible for actual testing work anyway.  Superuser-only.

Author: Palak Chaturvedi <chaturvedipalak1911@gmail.com>
Author: Thomas Munro <thomas.munro@gmail.com> (docs, small tweaks)
Reviewed-by: Nitin Jadhav <nitinjadhavpostgres@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Cary Huang <cary.huang@highgo.ca>
Reviewed-by: Cédric Villemain <cedric.villemain+pgsql@abcsql.com>
Reviewed-by: Jim Nasby <jim.nasby@gmail.com>
Reviewed-by: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CALfch19pW48ZwWzUoRSpsaV9hqt0UPyaBPC4bOZ4W+c7FF566A@mail.gmail.com
2024-04-08 16:23:40 +12:00
af7e90a277 simplehash: Free collisions array in SH_STAT
While SH_STAT() is only used for debugging, the allocated array can be large,
and therefore should be freed.

It's unclear why coverity started warning now.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Reported-by: Coverity
Discussion: https://postgr.es/m/3005248.1712538233@sss.pgh.pa.us
Backpatch: 12-
2024-04-07 19:08:41 -07:00
91044ae4ba Send ALPN in TLS handshake, require it in direct SSL connections
libpq now always tries to send ALPN. With the traditional negotiated
SSL connections, the server accepts the ALPN, and refuses the
connection if it's not what we expect, but connecting without ALPN is
still OK. With the new direct SSL connections, ALPN is mandatory.

NOTE: This uses "TBD-pgsql" as the protocol ID. We must register a
proper one with IANA before the release!

Author: Greg Stark, Heikki Linnakangas
Reviewed-by: Matthias van de Meent, Jacob Champion
2024-04-08 04:24:51 +03:00
d39a49c1e4 Support TLS handshake directly without SSLRequest negotiation
By skipping SSLRequest, you can eliminate one round-trip when
establishing a TLS connection. It is also more friendly to generic TLS
proxies that don't understand the PostgreSQL protocol.

This is disabled by default in libpq, because the direct TLS handshake
will fail with old server versions. It can be enabled with the
sslnegotation=direct option. It will still fall back to the negotiated
TLS handshake if the server rejects the direct attempt, either because
it is an older version or the server doesn't support TLS at all, but
the fallback can be disabled with the sslnegotiation=requiredirect
option.

Author: Greg Stark, Heikki Linnakangas
Reviewed-by: Matthias van de Meent, Jacob Champion
2024-04-08 04:24:49 +03:00
041b96802e Use streaming I/O in ANALYZE.
The ANALYZE command prefetches and reads sample blocks chosen by a
BlockSampler algorithm. Instead of calling [Prefetch|Read]Buffer() for
each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
2024-04-08 13:16:28 +12:00
72bd38cc99 Transform OR clauses to ANY expression
Replace (expr op C1) OR (expr op C2) ... with expr op ANY(ARRAY[C1, C2, ...])
on the preliminary stage of optimization when we are still working with the
expression tree.

Here Cn is a n-th constant expression, 'expr' is non-constant expression, 'op'
is an operator which returns boolean result and has a commuter (for the case
of reverse order of constant and non-constant parts of the expression,
like 'Cn op expr').

Sometimes it can lead to not optimal plan.  This is why there is a
or_to_any_transform_limit GUC.  It specifies a threshold value of length of
arguments in an OR expression that triggers the OR-to-ANY transformation.
Generally, more groupable OR arguments mean that transformation will be more
likely to win than to lose.

Discussion: https://postgr.es/m/567ED6CA.2040504%40sigaev.ru
Author: Alena Rybakina <lena.ribackina@yandex.ru>
Author: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Ranier Vilela <ranier.vf@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
2024-04-08 01:27:52 +03:00
b7b0f3f272 Use streaming I/O in sequential scans.
Instead of calling ReadBuffer() for each block, heap sequential scans
and TID range scans now use the streaming API introduced in b5a9b18cd0.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg%3DgEQ%40mail.gmail.com
2024-04-08 01:53:57 +12:00
6ed83d5fa5 Use bump memory context for tuplesorts
29f6a959c added a bump allocator type for efficient compact allocations.
Here we make use of this for non-bounded tuplesorts to store tuples.
This is very space efficient when storing narrow tuples due to bump.c
not having chunk headers.  This means we can fit more tuples in work_mem
before spilling to disk, or perform an in-memory sort touching fewer
cacheline.

Author: David Rowley
Reviewed-by: Nathan Bossart
Reviewed-by: Matthias van de Meent
Reviewed-by: Tomas Vondra
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
2024-04-08 00:32:26 +12:00
f3ff7bf83b Add XLogCtl->logInsertResult
This tracks the position of WAL that's been fully copied into WAL
buffers by all processes emitting WAL.  (For some reason we call that
"WAL insertion").  This is updated using atomic monotonic advance during
WaitXLogInsertionsToFinish, which is not when the insertions actually
occur, but it's the only place where we know where have all the
insertions have completed.

This value is useful in WALReadFromBuffers, which can verify that
callers don't try to read past what has been inserted.  (However, more
infrastructure is needed in order to actually use WAL after the flush
point, since it could be lost.)

The value is also useful in WaitXLogInsertionsToFinish() itself, since
we can now exit quickly when all WAL has been already inserted, without
even having to take any locks.
2024-04-07 14:06:30 +02:00
29f6a959cf Introduce a bump memory allocator
This introduces a bump MemoryContext type.  The bump context is best
suited for short-lived memory contexts which require only allocations
of memory and never a pfree or repalloc, which are unsupported.

Memory palloc'd into a bump context has no chunk header.  This makes
bump a useful context type when lots of small allocations need to be
done without any need to pfree those allocations.  Allocation sizes are
rounded up to the next MAXALIGN boundary, so with this and no chunk
header, allocations are very compact indeed.

Allocations are also very fast as bump does not check any freelists to
try and make use of previously free'd chunks.  It just checks if there
is enough room on the current block, and if so it bumps the freeptr
beyond this chunk and returns the value that the freeptr was previously
pointing to.  Simple and fast.  A new block is malloc'd when there's not
enough space in the current block.

Code using the bump allocator must take care never to call any functions
which could try to call realloc() (or variants), pfree(),
GetMemoryChunkContext() or GetMemoryChunkSpace() on a bump allocated
chunk.  Due to lack of chunk headers, these operations are unsupported.
To increase the chances of catching such issues, when compiled with
MEMORY_CONTEXT_CHECKING, bump allocated chunks are given a header and
any attempt to perform an unsupported operation will result in an ERROR.
Without MEMORY_CONTEXT_CHECKING, code attempting an unsupported
operation could result in a segfault.

A follow-on commit will implement the first user of bump.

Author: David Rowley
Reviewed-by: Nathan Bossart
Reviewed-by: Matthias van de Meent
Reviewed-by: Tomas Vondra
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
2024-04-08 00:02:43 +12:00
0ba8b75e7e Enlarge bit-space for MemoryContextMethodID
Reserve 4 bits for MemoryContextMethodID rather than 3.  3 bits did
technically allow a maximum of 8 memory context types, however, we've
opted to reserve some bit patterns which left us with only 4 slots, all
of which were used.

Here we add another bit which frees up 8 slots for future memory context
types.

In passing, adjust the enum names in MemoryContextMethodID to make it
more clear which ones can be used and which ones are reserved.

Author: Matthias van de Meent, David Rowley
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
2024-04-07 23:32:00 +12:00
41c51f0c68 Optimize visibilitymap_count() with AVX-512 instructions.
Commit 792752af4e added infrastructure for using AVX-512 intrinsic
functions, and this commit uses that infrastructure to optimize
visibilitymap_count().  Specificially, a new pg_popcount_masked()
function is introduced that applies a bitmask to every byte in the
buffer prior to calculating the population count, which is used to
filter out the all-visible or all-frozen bits as needed.  Platforms
without AVX-512 support should also see a nice speedup due to the
reduced number of calls to a function pointer.

Co-authored-by: Ants Aasma
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
2024-04-06 22:58:23 -05:00
792752af4e Optimize pg_popcount() with AVX-512 instructions.
Presently, pg_popcount() processes data in 32-bit or 64-bit chunks
when possible.  Newer hardware that supports AVX-512 instructions
can use 512-bit chunks, which provides a nice speedup, especially
for larger buffers.  This commit introduces the infrastructure
required to detect compiler and CPU support for the required
AVX-512 intrinsic functions, and it adds a new pg_popcount()
implementation that uses these functions.  If CPU support for this
optimized implementation is detected at runtime, a function pointer
is updated so that it is used by subsequent calls to pg_popcount().

Most of the existing in-tree calls to pg_popcount() should benefit
from these instructions, and calls with smaller buffers should at
least not regress compared to v16.  The new infrastructure
introduced by this commit can also be used to optimize
visibilitymap_count(), but that is left for a follow-up commit.

Co-authored-by: Paul Amonson, Ants Aasma
Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
2024-04-06 21:56:23 -05:00
04e72ed617 BitmapHeapScan: Push skip_fetch optimization into table AM
Commit 7c70996ebf0949b142 introduced an optimization to allow bitmap
scans to operate like index-only scans by not fetching a block from the
heap if none of the underlying data is needed and the block is marked
all visible in the visibility map.

With the introduction of table AMs, a FIXME was added to this code
indicating that the skip_fetch logic should be pushed into the table
AM-specific code, as not all table AMs may use a visibility map in the
same way.

This commit resolves this FIXME for the current block. The layering
violation is still present in BitmapHeapScans's prefetching code, which
uses the visibility map to decide whether or not to prefetch a block.
However, this can be addressed independently.

Author: Melanie Plageman
Reviewed-by: Andres Freund, Heikki Linnakangas, Tomas Vondra, Mark Dilger
Discussion: https://postgr.es/m/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com
2024-04-07 00:24:14 +02:00
87c21bb941 Implement ALTER TABLE ... SPLIT PARTITION ... command
This new DDL command splits a single partition into several parititions.
Just like ALTER TABLE ... MERGE PARTITIONS ... command, new patitions are
created using createPartitionTable() function with parent partition as the
template.

This commit comprises quite naive implementation which works in single process
and holds the ACCESS EXCLUSIVE LOCK on the parent table during all the
operations including the tuple routing.  This is why this new DDL command
can't be recommended for large partitioned tables under a high load.  However,
this implementation come in handy in certain cases even as is.
Also, it could be used as a foundation for future implementations with lesser
locking and possibly parallel.

Discussion: https://postgr.es/m/c73a1746-0cd0-6bdd-6b23-3ae0b7c0c582%40postgrespro.ru
Author: Dmitry Koval
Reviewed-by: Matthias van de Meent, Laurenz Albe, Zhihong Yu, Justin Pryzby
Reviewed-by: Alvaro Herrera, Robert Haas, Stephane Tachoires
2024-04-07 01:18:44 +03:00
1adf16b8fb Implement ALTER TABLE ... MERGE PARTITIONS ... command
This new DDL command merges several partitions into the one partition of the
target table.  The target partition is created using new
createPartitionTable() function with parent partition as the template.

This commit comprises quite naive implementation which works in single process
and holds the ACCESS EXCLUSIVE LOCK on the parent table during all the
operations including the tuple routing.  This is why this new DDL command
can't be recommended for large partitioned tables under a high load.  However,
this implementation come in handy in certain cases even as is.
Also, it could be used as a foundation for future implementations with lesser
locking and possibly parallel.

Discussion: https://postgr.es/m/c73a1746-0cd0-6bdd-6b23-3ae0b7c0c582%40postgrespro.ru
Author: Dmitry Koval
Reviewed-by: Matthias van de Meent, Laurenz Albe, Zhihong Yu, Justin Pryzby
Reviewed-by: Alvaro Herrera, Robert Haas, Stephane Tachoires
2024-04-07 01:18:43 +03:00
ee79928441 Clarify what is protected by WaitLSNLock
Not just WaitLSNState.waitersHeap, but also WaitLSNState.procInfos and
updating of WaitLSNState.minWaitedLSN is protected by WaitLSNLock.  There
is one now documented exclusion on fast-path checking of
WaitLSNProcInfo.inHeap flag.

Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
2024-04-07 00:49:53 +03:00
25f42429e2 Use an LWLock instead of a spinlock in waitlsn.c
This should prevent busy-waiting when number of waiting processes is high.

Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
Author: Alvaro Herrera
2024-04-07 00:49:53 +03:00
5bf748b86b Enhance nbtree ScalarArrayOp execution.
Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to decide how to execute ScalarArrayOp scans (when and where to start
the next primitive index scan) based on physical index characteristics.
This can be far more efficient.  All SAOP scans will now reliably avoid
duplicative leaf page accesses (just like any other nbtree index scan).
SAOP scans whose array keys are naturally clustered together now require
far fewer index descents, since we'll reliably avoid starting a new
primitive scan just to get to a later offset from the same leaf page.

The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value.  Required
scan key arrays (i.e. arrays from scan keys that can terminate the scan)
ratchet forward in lockstep with the index scan.  Non-required arrays
(i.e. arrays from scan keys that can only exclude non-matching tuples)
"advance" without the process ever rolling over to a higher-order array.

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, even index scans of a composite index with a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we won't mark required) now avoid repeating leaf page
accesses -- that benefit isn't limited to simpler equality-only cases.
In general, all nbtree index scans now output tuples as if they were one
continuous index scan -- even scans that mix a high-order inequality
with lower-order SAOP equalities reliably output tuples in index order.
This allows us to remove a couple of special cases that were applied
when building index paths with SAOP clauses during planning.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Affected queries can now exploit scan output order in all the usual ways
(e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early).

Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths, with path keys, but
without low-order SAOP index quals (filter quals were used instead).
We'll no longer generate these alternative paths, since they can no
longer offer any meaningful advantages over standard index qual paths.
Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  They can avoid extra heap
page accesses from using filter quals to exclude non-matching tuples
(index quals will never have that problem).  They can also skip over
irrelevant sections of the index in more cases (though only when nbtree
determines that starting another primitive scan actually makes sense).

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions.  Such an index AM could have the
same limitations around ordered SAOP scans as nbtree had up until now.
Adding a pro forma incompatibility item about the issue to the Postgres
17 release notes seems like a good idea.

Author: Peter Geoghegan <pg@bowt.ie>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
2024-04-06 11:47:10 -04:00
a365d9e2e8 Speed up tail processing when hashing aligned C strings, take two
After encountering the NUL terminator, the word-at-a-time loop exits
and we must hash the remaining bytes. Previously we calculated
the terminator's position and re-loaded the remaining bytes from
the input string. This was slower than the unaligned case for very
short strings. We already have all the data we need in a register,
so let's just mask off the bytes we need and hash them immediately.

In addition to endianness issues, the previous attempt upset valgrind
in the way it computed the mask. Whether by accident or by wisdom,
the author's proposed method passes locally with valgrind 3.22.

Ants Aasma, with cosmetic adjustments by me

Discussion: https://postgr.es/m/CANwKhkP7pCiW_5fAswLhs71-JKGEz1c1%2BPC0a_w1fwY4iGMqUA%40mail.gmail.com
2024-04-06 17:14:28 +07:00
0c25fee359 Teach fasthash_accum to use platform endianness for bytewise loads
This function previously used a mix of word-wise loads and bytewise
loads. The bytewise loads happened to be little-endian regardless of
platform. This in itself is not a problem. However, a future commit
will require the same result whether A) the input is loaded as a
word with the relevent bytes masked-off, or B) the input is loaded
one byte at a time.

While at it, improve debuggability of the internal hash state.

Discussion: https://postgr.es/m/CANWCAZZpuV1mES1mtSpAq8tWJewbrv4gEz6R_k4gzNG8GZ5gag%40mail.gmail.com
2024-04-06 17:14:28 +07:00
3bd8439ed6 Allow BufferAccessStrategy to limit pin count.
While pinning extra buffers to look ahead, users of strategies are in
danger of using too many buffers.  For some strategies, that means
"escaping" from the ring, and in others it means forcing dirty data to
disk very frequently with associated WAL flushing.  Since external code
has no insight into any of that, allow individual strategy types to
expose a clamp that should be applied when deciding how many buffers to
pin at once.

Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com
2024-04-06 23:11:45 +13:00
f956ecd035 Convert uses of hash_string_pointer to fasthash equivalent
Remove duplicate hash_string_pointer() function definitions by creating
a new inline function hash_string() for this purpose.

This has the added advantage of avoiding strlen() calls when doing hash
lookup. It's not clear how many of these are perfomance-sensitive
enough to benefit from that, but the simplification is worth it on
its own.

Reviewed by Jeff Davis

Discussion: https://postgr.es/m/CANWCAZbg_XeSeY0a_PqWmWqeRATvzTzUNYRLeT%2Bbzs%2BYQdC92g%40mail.gmail.com
2024-04-06 12:20:40 +07:00
db17594ad7 Add macro to disable address safety instrumentation
fasthash_accum_cstring_aligned() uses a technique, found in various
strlen() implementations, to detect a string's NUL terminator by
reading a word at at time. That triggers failures when testing with
"-fsanitize=address", at least with frontend code. To enable using
this function anywhere, add a function attribute macro to disable
such testing.

Reviewed by Jeff Davis

Discussion: https://postgr.es/m/CANWCAZbwvp7oUEkbw-xP4L0_S_WNKq-J-ucP4RCNDPJnrakUPw%40mail.gmail.com
2024-04-06 12:20:40 +07:00
4b968e2027 Fix incorrect return type
fasthash32() calculates a 32-bit hashcode, but the return
type was uint64. Change to uint32.

Noted by Jeff Davis

Discussion: https://postgr.es/m/b16c93e6c736a422d4de668343515375664eb05d.camel%40j-davis.com
2024-04-06 12:20:40 +07:00