wal_buffers_full has been introduced in pg_stat_wal in 8d9a935965f, as
some information providing metrics for the tuning of the GUC
wal_buffers. WalUsage has been introduced before that in df3b181499.
Moving this field is proving to be beneficial for several reasons:
- This information can now be made available in more layers, providing
more granularity than just pg_stat_wal, on a per-query basis: EXPLAIN,
pgss and VACUUM/ANALYZE logs.
- A patch is under discussion to provide statistics for WAL at backend
level, and this move simplifies a bit the handling of pending
statistics. The remaining data in PgStat_PendingWalStats now relates to
write/sync counters and times, with equivalents present in pg_stat_io,
that backend statistics are able to already track. So this should cut
all the dependencies between PgStat_PendingWalStats and WAL stats at
backend level.
As of this change, wal_buffers_full only shows in pg_stat_wal.
Author: Bertrand Drouvot
Reviewed-by: Ilia Evdokimov
Discussion: https://postgr.es/m/Z6SOha5YFFgvpwQY@ip-10-97-1-34.eu-west-3.compute.internal
Allow multiple backends to initialize WAL buffers concurrently. This way
`MemSet((char *) NewPage, 0, XLOG_BLCKSZ);` can run in parallel without
taking a single LWLock in exclusive mode.
The new algorithm works as follows:
* reserve a page for initialization using XLogCtl->InitializeReserved,
* ensure the page is written out,
* once the page is initialized, try to advance XLogCtl->InitializedUpTo and
signal to waiters using XLogCtl->InitializedUpToCondVar condition
variable,
* repeat previous steps until we reserve initialization up to the target
WAL position,
* wait until concurrent initialization finishes using a
XLogCtl->InitializedUpToCondVar.
Now, multiple backends can, in parallel, concurrently reserve pages,
initialize them, and advance XLogCtl->InitializedUpTo to point to the latest
initialized page.
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>
Many GUCs accept special values like -1 or an empty string to
disable the feature, use a system default, etc. While the
documentation consistently lists these special values, the GUC
descriptions do not. Many such descriptions fail to mention the
special values, and those that do vary in phrasing and placement.
This commit aims to bring some consistency to this area by applying
the following rules:
* Special values should be listed at the end of the long
description.
* Descriptions should use numerals (e.g., "0") instead of words
(e.g., "zero").
* Special value mentions should be concise and direct (e.g., "0
disables the timeout.", "An empty string means use the operating
system setting.").
* Multiple special values should be listed in ascending order.
Of course, there are exceptions, such as
max_pred_locks_per_relation and search_path, whose special values
are too complex to include. And there are cases like
listen_addresses, where the meaning of an empty string is arguably
too obvious to include. In those cases, I've refrained from adding
special value information to the GUC description.
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: "David G. Johnston" <david.g.johnston@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/Z6aIy4aywxUZHAo6%40nathan
RBTXN_PREPARE flag and rbtxn_prepared macro could be misinterpreted as
either indicating the transaction type (e.g. a prepared transaction or
a normal transaction) or its currentstate (e.g. skipped or its prepare
message is sent), especially after commit 072ee847ad4 introduced the
RBTXN_SENT_PREPARE flag and the rbtxn_sent_prepare macro.
The RBTXN_PREPARE flag (and its corresponding macro) have been renamed
to RBTXN_IS_PREPARE to explicitly indicate the transaction
type. Therefore, this commit also adds the RBTXN_IS_PREPARE flag to
the transaction that is a prepared transaction and has been skipped,
which previously had only the RBTXN_SKIPPED_PREPARE flag.
Reviewed-by: Amit Kapila, Peter Smith
Discussion: https://postgr.es/m/CAA4eK1KgNmBsG%3D155E7QQ6TX9RoWnM4z5Z20SvsbwxSe_QXYsg%40mail.gmail.com
Previously, transaction aborts were detected concurrently only during
system catalog scans while replaying a transaction in streaming mode.
This commit adds an additional CLOG lookup to check the transaction
status, allowing the logical decoding to skip changes also when it
doesn't touch system catalogs, if the transaction is already
aborted. This optimization enhances logical decoding performance,
especially for large transactions that have already been rolled back,
as it avoids unnecessary disk or network I/O.
To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction. The performance benchmark
results showed there is not noticeable performance regression due to
CLOG lookups.
Reviewed-by: Amit Kapila, Peter Smith, Vignesh C, Ajin Cherian
Reviewed-by: Dilip Kumar, Andres Freund
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
Remove some (char *) casts related to uses of the pg_checksum_page()
function. These casts are useless, because everything involved
already has the right type. Moreover, these casts actually silently
discarded a const qualifier. The declaration of a higher-level
function needs to be adjusted to fix that.
Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org
This commit adds the amount of time spent sleeping due to
cost-based delay to the pg_stat_progress_vacuum and
pg_stat_progress_analyze system views. A new configuration
parameter named track_cost_delay_timing, which is off by default,
controls whether this information is gathered. For vacuum, the
reported value includes the sleep time of any associated parallel
workers. However, parallel workers only report their sleep time
once per second to avoid overloading the leader process.
Bumps catversion.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Co-authored-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Sergei Kornilov <sk@zsrv.org>
Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
Aggressive vacuums must scan every unfrozen tuple in order to advance
the relfrozenxid/relminmxid. Because data is often vacuumed before it is
old enough to require freezing, relations may build up a large backlog
of pages that are set all-visible but not all-frozen in the visibility
map. When an aggressive vacuum is triggered, all of these pages must be
scanned. These pages have often been evicted from shared buffers and
even from the kernel buffer cache. Thus, aggressive vacuums often incur
large amounts of extra I/O at the expense of foreground workloads.
To amortize the cost of aggressive vacuums, eagerly scan some
all-visible but not all-frozen pages during normal vacuums.
All-visible pages that are eagerly scanned and set all-frozen in the
visibility map are counted as successful eager freezes and those not
frozen are counted as failed eager freezes.
If too many eager scans fail in a row, eager scanning is temporarily
suspended until a later portion of the relation. The number of failures
tolerated is configurable globally and per table.
To effectively amortize aggressive vacuums, we cap the number of
successes as well. Capping eager freeze successes also limits the amount
of potentially wasted work if these pages are modified again before the
next aggressive vacuum. Once we reach the maximum number of blocks
successfully eager frozen, eager scanning is disabled for the remainder
of the vacuum of the relation.
Original design idea from Robert Haas, with enhancements from
Andres Freund, Tomas Vondra, and me
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZF_KCzZuOrPrOqjGVe8iRVWEAJSpzMgRQs%3D5-v84cXUg%40mail.gmail.com
"I/O" seems more descriptive than "Asynchronous Behavior", given that some of
the GUCs in the section don't relate to anything asynchronous.
Most other abbreviations in the config sections are un-abbreviated, but
"Input/Output" seems less likely to be helpful than just IO or I/O.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/x3tlw2jk5gm3r3mv47hwrshffyw7halpczkfbk3peksxds7bvc@lguk43z3bsyq
Commit 1dc5ebc90 allowed PL/pgSQL to perform in-place updates
of expanded-object variables that are being updated with
assignments like "x := f(x, ...)". However this was allowed
only for a hard-wired list of functions f(), since we need to
be sure that f() will not modify the variable if it fails.
It was always envisioned that we should make that extensible,
but at the time we didn't have a good way to do so. Since
then we've invented the idea of "support functions" to allow
attaching specialized optimization knowledge to functions,
and that is a perfect mechanism for doing this.
Hence, adjust PL/pgSQL to use a support function request instead
of hard-wired logic to decide if in-place update is safe.
Preserve the previous optimizations by creating support functions
for the three functions that were previously hard-wired.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>
Discussion: https://postgr.es/m/CACxu=vJaKFNsYxooSnW1wEgsAO5u_v1XYBacfVJ14wgJV_PYeg@mail.gmail.com
If a read/write expanded variable is declared locally to the
assignment statement that is updating it, and it is referenced
exactly once in the assignment RHS, then we can optimize the
operation as a direct update of the expanded value, whether
or not the function(s) operating on it can be trusted not to
modify the value before throwing an error. This works because
if an error does get thrown, we no longer care what value the
variable has.
In cases where that doesn't work, fall back to the previous
rule that checks for safety of the top-level function.
In any case, postpone determination of whether these optimizations
are feasible until we are executing a Param referencing the target
variable and that variable holds a R/W expanded object. While the
previous incarnation of exec_check_rw_parameter was pretty cheap,
this is a bit less so, and our plan to invoke support functions
will make it even less so. So avoiding the check for variables
where it couldn't be useful should be a win.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>
Discussion: https://postgr.es/m/CACxu=vJaKFNsYxooSnW1wEgsAO5u_v1XYBacfVJ14wgJV_PYeg@mail.gmail.com
This commit adds fmtIdEnc() and fmtQualifiedIdEnc(), which allow to specify
the encoding as an explicit argument. Additionally setFmtEncoding() is
provided, which defines the encoding when no explicit encoding is provided, to
avoid breaking all code using fmtId().
All users of fmtId()/fmtQualifiedId() are either converted to the explicit
version or a call to setFmtEncoding() has been added.
This commit does not yet utilize the now well-defined encoding, that will
happen in a subsequent commit.
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Backpatch-through: 13
Security: CVE-2025-1094
There are cases where we cannot / do not want to error out for invalidly
encoded input. In such cases it can be useful to replace e.g. an incomplete
multi-byte characters with bytes that will trigger an error when getting
validated as part of a larger string.
Unfortunately, until now, for some encoding no such sequence existed. For
those encodings this commit removes one previously accepted input combination
- we consider that to be ok, as the chosen bytes are outside of the valid
ranges for the encodings, we just previously failed to detect that.
As we cannot add a new field to pg_wchar_table without breaking ABI, this is
implemented "in-line" in the newly added function.
Author: Noah Misch <noah@leadboat.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Backpatch-through: 13
Security: CVE-2025-1094
This is similar to d575051b9af9 but this time for the comments in
plannodes.h to avoid long lines, which is useful if adding per-field
annotations with pg_node_attr() to these planner structures.
Some patches are under discussion to add such properties to planner
fields, which is something that may or may not happen, and this change
makes future proposals easier to work on and review, which being more
consistent in style with the parse nodes.
Author: Sami Imseih
Discussion: https://postgr.es/m/Z5xTb5iBHVGns35R@paquier.xyz
This adds a new variant of generated columns that are computed on read
(like a view, unlike the existing stored generated columns, which are
computed on write, like a materialized view).
The syntax for the column definition is
... GENERATED ALWAYS AS (...) VIRTUAL
and VIRTUAL is also optional. VIRTUAL is the default rather than
STORED to match various other SQL products. (The SQL standard makes
no specification about this, but it also doesn't know about VIRTUAL or
STORED.) (Also, virtual views are the default, rather than
materialized views.)
Virtual generated columns are stored in tuples as null values. (A
very early version of this patch had the ambition to not store them at
all. But so much stuff breaks or gets confused if you have tuples
where a column in the middle is completely missing. This is a
compromise, and it still saves space over being forced to use stored
generated columns. If we ever find a way to improve this, a bit of
pg_upgrade cleverness could allow for upgrades to a newer scheme.)
The capabilities and restrictions of virtual generated columns are
mostly the same as for stored generated columns. In some cases, this
patch keeps virtual generated columns more restricted than they might
technically need to be, to keep the two kinds consistent. Some of
that could maybe be relaxed later after separate careful
considerations.
Some functionality that is currently not supported, but could possibly
be added as incremental features, some easier than others:
- index on or using a virtual column
- hence also no unique constraints on virtual columns
- extended statistics on virtual columns
- foreign-key constraints on virtual columns
- not-null constraints on virtual columns (check constraints are supported)
- ALTER TABLE / DROP EXPRESSION
- virtual column cannot have domain type
- virtual columns are not supported in logical replication
The tests in generated_virtual.sql have been copied over from
generated_stored.sql with the keyword replaced. This way we can make
sure the behavior is mostly aligned, and the differences can be
visible. Some tests for currently not supported features are
currently commented out.
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Tested-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/a368248e-69e4-40be-9c07-6c3b5880b0a6@eisentraut.org
This commit introduces changes to track unpruned relations explicitly,
making it possible for top-level plan nodes, such as ModifyTable and
LockRows, to avoid processing partitions pruned during initial
pruning. Scan-level nodes, such as Append and MergeAppend, already
avoid the unnecessary processing by accessing partition pruning
results directly via part_prune_index. In contrast, top-level nodes
cannot access pruning results directly and need to determine which
partitions remain unpruned.
To address this, this commit introduces a new bitmapset field,
es_unpruned_relids, which the executor uses to track the set of
unpruned relations. This field is referenced during plan
initialization to skip initializing certain nodes for pruned
partitions. It is initialized with PlannedStmt.unprunableRelids,
a new field that the planner populates with RT indexes of relations
that cannot be pruned during runtime pruning. These include relations
not subject to partition pruning and those required for execution
regardless of pruning.
PlannedStmt.unprunableRelids is computed during set_plan_refs() by
removing the RT indexes of runtime-prunable relations, identified
from PartitionPruneInfos, from the full set of relation RT indexes.
ExecDoInitialPruning() then updates es_unpruned_relids by adding
partitions that survive initial pruning.
To support this, PartitionedRelPruneInfo and PartitionedRelPruningData
now include a leafpart_rti_map[] array that maps partition indexes to
their corresponding RT indexes. The former is used in set_plan_refs()
when constructing unprunableRelids, while the latter is used in
ExecDoInitialPruning() to convert partition indexes returned by
get_matching_partitions() into RT indexes, which are then added to
es_unpruned_relids.
These changes make it possible for ModifyTable and LockRows nodes to
process only relations that remain unpruned after initial pruning.
ExecInitModifyTable() trims lists, such as resultRelations,
withCheckOptionLists, returningLists, and updateColnosLists, to
consider only unpruned partitions. It also creates ResultRelInfo
structs only for these partitions. Similarly, child RowMarks for
pruned relations are skipped.
By avoiding unnecessary initialization of structures for pruned
partitions, these changes improve the performance of updates and
deletes on partitioned tables during initial runtime pruning.
Due to ExecInitModifyTable() changes as described above, EXPLAIN on a
plan for UPDATE and DELETE that uses runtime initial pruning no longer
lists partitions pruned during initial pruning.
Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier versions)
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CA+HiwqFGkMSge6TgC9KQzde0ohpAycLQuV7ooitEEpbKB0O_mg@mail.gmail.com
One way autovacuum chooses tables to vacuum is by comparing the
number of updated or deleted tuples with a value calculated using
autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor.
The threshold specifies the base value for comparison, and the
scale factor specifies the fraction of the table size to add to it.
This strategy ensures that smaller tables are vacuumed after fewer
updates/deletes than larger tables, which is reasonable in many
cases but can result in infrequent vacuums on very large tables.
This is undesirable for a couple of reasons, such as very large
tables incurring a huge amount of bloat between vacuums.
This new parameter provides a way to set a limit on the value
calculated with autovacuum_vacuum_threshold and
autovacuum_vacuum_scale_factor so that very large tables are
vacuumed more frequently. By default, it is set to 100,000,000
tuples, but it can be disabled by setting it to -1. It can also be
adjusted for individual tables by changing storage parameters.
Author: Nathan Bossart <nathandbossart@gmail.com>
Co-authored-by: Frédéric Yhuel <frederic.yhuel@dalibo.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Michael Banck <mbanck@gmx.net>
Reviewed-by: Joe Conway <mail@joeconway.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Vinícius Abrahão <vinnix.bsd@gmail.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Alena Rybakina <a.rybakina@postgrespro.ru>
Discussion: https://postgr.es/m/956435f8-3b2f-47a6-8756-8c54ded61802%40dalibo.com
If exactly one relation type is requested in a command of the \dtisv
family, say "tables", "indexes", etc instead of "relations". This
should cover the majority of actual uses, without creating a huge
number of new translatable strings. The error messages for no
matching relations are adjusted as well.
In passing, invent "pg_log_error_internal()" to be used for frontend
error messages that don't seem to need translation, analogously to
errmsg_internal() in the backend. The implementation is a bit cheesy,
being just a macro to prevent xgettext from recognizing a trigger
keyword. This won't avoid a useless gettext lookup cycle at runtime
--- but surely we don't care about an extra microsecond or two in
what's supposed to be a can't-happen case. I (tgl) also made
"pg_fatal_internal()", though it's not used in this patch.
Author: Greg Sabino Mullane <htamfids@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAKAnmm+7o93fQV-RFkGaN1QnP-0D4d3JTykD+cLueqjDMKdfag@mail.gmail.com
It is possible for the inactive_since value of an invalid replication slot
to be updated multiple times, which is unexpected behavior like during the
release of the slot or at the time of restart. This is harmless because
invalid slots are not allowed to be accessed but it is not prudent to
update invalid slots. We are planning to invalidate slots due to other
reasons like idle time and it will look odd that the slot's inactive_since
displays the recent time in this field after invalidated due to idle time.
So, this patch ensures that the inactive_since field of slots is not
updated for invalid slots.
In the passing, ensure to use the same inactive_since time for all the
slots at restart while restoring them from the disk.
Author: Nisha Moond <nisha.moond412@gmail.com>
Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CABdArM7QdifQ_MHmMA=Cc4v8+MeckkwKncm2Nn6tX9wSCQ-+iw@mail.gmail.com
This commit adds WAL IO stats to both pg_stat_io view and per-backend IO
statistics (pg_stat_get_backend_io()). This change is possible since
f92c854cf406, as WAL IO is not counted in blocks in some code paths
where its stats data is measured (like WAL read in xlogreader.c).
IOContext gains IOCONTEXT_INIT and IOObject IOOBJECT_WAL, with the
following combinations allowed:
- IOOBJECT_WAL/IOCONTEXT_NORMAL is used to track I/O operations done on
already-created WAL segments.
- IOOBJECT_WAL/IOCONTEXT_INIT is used for tracking I/O operations done
when initializing WAL segments.
The core changes are done in pg_stat_io.c, backend statistics inherit
them. Backend statistics and pg_stat_io are now available for the WAL
writer, the WAL receiver and the WAL summarizer processes.
I/O timing data is controlled by the GUC track_io_timing, like the
existing data of pg_stat_io for consistency. The timings related to
IOOBJECT_WAL show up if the GUC is enabled (disabled by default).
Bump pgstats file version, due to the additions in IOObject and
IOContext, impacting the amount of data written for the fixed-numbered
IO stats kind in the pgstats file.
Author: Nazir Bilal Yavuz
Reviewed-by: Bertrand Drouvot, Nitin Jadhav, Amit Kapila, Michael
Paquier, Melanie Plageman, Bharath Rupireddy
Discussion: https://postgr.es/m/CAN55FZ3AiQ+ZMxUuXnBpd0Rrh1YhwJ5FudkHg=JU0P+-W8T4Vg@mail.gmail.com
This turns GistTranslateCompareType() into a callback function of the
gist index AM instead of a standalone function. The existing callers
are changed to use IndexAmTranslateCompareType(). This then makes
that code not hardcoded toward gist.
This means in particular that the temporal keys code is now
independent of gist. Also, this generalizes commit 74edabce7a3, so
other index access methods other than the previously hardcoded ones
could now work as REPLICA IDENTITY in a logical replication
subscriber.
Author: Mark Dilger <mark.dilger@enterprisedb.com>
Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
For each Index AM, provide a mapping between operator strategies and
the system-wide generic concept of a comparison type. For example,
for btree, BTLessStrategyNumber maps to and from COMPARE_LT. Numerous
places in the planner and executor think directly in terms of btree
strategy numbers (and a few in terms of hash strategy numbers.) These
should be converted over subsequent commits to think in terms of
CompareType instead. (This commit doesn't make any use of this API
yet.)
Author: Mark Dilger <mark.dilger@enterprisedb.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
Consistently use "Size" (or size_t, or in some places int64 or double)
as the type for variables holding memory allocation sizes. In most
places variables' data types were fine already, but we had an ancient
habit of computing bytes from kilobytes-units GUCs with code like
"work_mem * 1024L". That risks overflow on Win64 where they did not
make "long" as wide as "size_t". We worked around that by restricting
such GUCs' ranges, so you couldn't set work_mem et al higher than 2GB
on Win64. This patch removes that restriction, after replacing such
calculations with "work_mem * (Size) 1024" or variants of that.
It should be noted that this patch was constructed by searching
outwards from the GUCs that have MAX_KILOBYTES as upper limit.
So I can't positively guarantee there are no other places doing
memory-size arithmetic in int or long variables. I do however feel
pretty confident that increasing MAX_KILOBYTES on Win64 is safe now.
Also, nothing in our code should be dealing in multiple-gigabyte
allocations without authorization from a relevant GUC, so it seems
pretty likely that this search caught everything that could be at
risk of overflow.
Author: Vladlen Popolitov <v.popolitov@postgrespro.ru>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1a01f0-66ec2d80-3b-68487680@27595217
This commit builds on the prior change that moved PartitionPruneInfos
out of individual plan nodes into a list in PlannedStmt, making it
possible to initialize PartitionPruneStates without traversing the
plan tree and perform runtime initial pruning before ExecInitNode()
initializes the plan trees. These tasks are now handled in a new
routine, ExecDoInitialPruning(), which is called by InitPlan()
before calling ExecInitNode() on various plan trees.
ExecDoInitialPruning() performs the initial pruning and saves the
result -- a Bitmapset of indexes for surviving child subnodes -- in
es_part_prune_results, a list in EState.
PartitionPruneStates created for initial pruning are stored in
es_part_prune_states, another list in EState, for later use during
exec pruning. Both lists are parallel to es_part_prune_infos, which
holds the PartitionPruneInfos from PlannedStmt, enabling shared
indexing.
PartitionPruneStates initialized in ExecDoInitialPruning() now
include only the PartitionPruneContexts for initial pruning steps.
Exec pruning contexts are initialized later in
ExecInitPartitionExecPruning() when the parent plan node is
initialized, as the exec pruning step expressions depend on the parent
node's PlanState.
The existing function PartitionPruneFixSubPlanMap() has been
repurposed for this initialization to avoid duplicating a similar
loop structure for finding PartitionedRelPruningData to initialize
exec pruning contexts for. It has been renamed to
InitExecPruningContexts() to reflect its new primary responsibility.
The original logic to "fix subplan maps" remains intact but is now
encapsulated within the renamed function.
This commit removes two obsolete Asserts in partkey_datum_from_expr().
The ExprContext used for pruning expression evaluation is now
independent of the parent PlanState, making these Asserts unnecessary.
By centralizing pruning logic and decoupling it from the plan
initialization step (ExecInitNode()), this change sets the stage for
future patches that will use the result of initial pruning to
save the overhead of redundant processing for pruned partitions.
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CA+HiwqFGkMSge6TgC9KQzde0ohpAycLQuV7ooitEEpbKB0O_mg@mail.gmail.com
Once a replication slot is invalidated, it cannot be altered or used to
fetch changes. However, a process could still acquire an invalid slot and
fail later.
For example, if a process acquires a logical slot that was invalidated due
to wal_removed, it will eventually fail in CreateDecodingContext() when
attempting to access the removed WAL. Similarly, for physical replication
slots, even if the slot is invalidated and invalidation_reason is set to
wal_removed, the walsender does not currently check for invalidation when
starting physical replication. Instead, replication starts, and an error
is only reported later while trying to access WAL. Similarly, we prohibit
modifying slot properties for invalid slots but give the error for the
same after acquiring the slot.
This patch improves error handling by detecting invalid slots earlier at
the time of slot acquisition which is the first step. This also helped in
unifying different ERROR messages at different places and gave a
consistent message for invalid slots. This means that the message for
invalid slots will change to a generic message.
This will also be helpful for future patches where we are planning to
invalidate slots due to more reasons like idle_timeout because we don't
have to modify multiple places in such cases and avoid the chances of
missing out on a particular place.
Author: Nisha Moond <nisha.moond412@gmail.com>
Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CABdArM6pBL5hPnSQ+5nEVMANcF4FCH7LQmgskXyiLY75TMnKpw@mail.gmail.com
This allows users of the cumulative statistics to drop entries in the
shared hash stats table, deleting as well local references. Callers of
this function can optionally define a callback able to filter which
entries to drop, similarly to pgstat_reset_matching_entries() with its
callback do_reset().
pgstat_drop_all_entries() is refactored so as it uses this new function.
Author: Lukas Fitti
Discussion: https://postgr.es/m/CAP53PkwuFbo3NkwZgxwNRMjMfqPEqidD-SggaoQ4ijotBVLJAA@mail.gmail.com
This change adapts these functions to the machine's address width
without depending on "long" to be the right size. (It isn't on
Win64, for example.) While it seems unlikely anyone would care
to run with a stack depth limit exceeding 2GB, this is part of a
general push to avoid using type "long" to represent memory sizes.
It's convenient to use ssize_t rather than the perhaps-more-obvious
choice of size_t/Size, because the code involved depends on working
with a signed data type. Our MAX_KILOBYTES limit already ensures
that ssize_t will be sufficient to represent the maximum value of
max_stack_depth.
Extracted from a larger patch by Vladlen, plus additional hackery
by me.
Author: Vladlen Popolitov <v.popolitov@postgrespro.ru>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1a01f0-66ec2d80-3b-68487680@27595217
Instead of deciding at runtime whether to read from casetest.value
or caseValue_datum, split EEOP_CASE_TESTVAL into two opcodes and
make the decision during expression compilation. Similarly for
EEOP_DOMAIN_TESTVAL. This actually results in net less code,
mainly because llvmjit_expr.c's code for handling these opcodes
gets shorter. The performance gain is doubtless negligible, but
this seems worth changing anyway on grounds of simplicity and
understandability.
Author: Andreas Karlsson <andreas@proxel.se>
Co-authored-by: Xing Guo <higuoxing@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CACpMh+AiBYAWn+D1aU7Rsy-V1tox06Cbc0H3qA7rwL5zdJ=anQ@mail.gmail.com
This moves PartitionPruneInfo from plan nodes to PlannedStmt,
simplifying traversal by centralizing all PartitionPruneInfo
structures in a single list in it, which holds all instances for the
main query and its subqueries. Instead of plan nodes (Append or
MergeAppend) storing PartitionPruneInfo pointers, they now reference
an index in this list.
A bitmapset field is added to PartitionPruneInfo to store the RT
indexes corresponding to the apprelids field in Append or MergeAppend.
This allows execution pruning logic to verify that it operates on the
correct plan node, mainly to facilitate debugging.
Duplicated code in set_append_references() and
set_mergeappend_references() is refactored into a new function,
register_pruneinfo(). This updates RT indexes by applying rtoffet
and adds PartitionPruneInfo to the global list in PlannerGlobal.
By allowing pruning to be performed without traversing the plan tree,
this change lays the groundwork for runtime initial pruning to occur
independently of plan tree initialization.
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> (earlier version)
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CA+HiwqFGkMSge6TgC9KQzde0ohpAycLQuV7ooitEEpbKB0O_mg@mail.gmail.com
If a column is omitted in an INSERT, and there's no column default,
the code in preptlist.c generates a NULL Const to be inserted.
Furthermore, if the column is of a domain type, we wrap the Const
in CoerceToDomain, so as to throw a run-time error if the domain
has a NOT NULL constraint. That's fine as far as it goes, but
there are two problems:
1. We're being sloppy about the type/typmod that the Const is
labeled with. It really should have the domain's base type/typmod,
since it's the input to CoerceToDomain not the output. This can
result in coerce_to_domain inserting a useless length-coercion
function (useless because it's being applied to a null). The
coercion would typically get const-folded away later, but it'd
be better not to create it in the first place.
2. We're not applying expression preprocessing (specifically,
eval_const_expressions) to the resulting expression tree.
The planner's primary expression-preprocessing pass already happened,
so that means the length coercion step and CoerceToDomain node miss
preprocessing altogether.
This is at the least inefficient, since it means the length coercion
and CoerceToDomain will actually be executed for each inserted row,
though they could be const-folded away in most cases. Worse, it
seems possible that missing preprocessing for the length coercion
could result in an invalid plan (for example, due to failing to
perform default-function-argument insertion). I'm not aware of
any live bug of that sort with core datatypes, and it might be
unreachable for extension types as well because of restrictions of
CREATE CAST, but I'm not entirely convinced that it's unreachable.
Hence, it seems worth back-patching the fix (although I only went
back to v14, as the patch doesn't apply cleanly at all in v13).
There are several places in the rewriter that are building null
domain constants the same way as preptlist.c. While those are
before the planner and hence don't have any reachable bug, they're
still applying a length coercion that will be const-folded away
later, uselessly wasting cycles. Hence, make a utility routine
that all of these places can call to do it right.
Making this code more careful about the typmod assigned to the
generated NULL constant has visible but cosmetic effects on some
of the plans shown in contrib/postgres_fdw's regression tests.
Discussion: https://postgr.es/m/1865579.1738113656@sss.pgh.pa.us
Backpatch-through: 14
When running on Windows, canonicalize_path() converts '\' to '/'
to prevent confusing the Windows command processor. It was
doing that in a non-encoding-aware fashion; but in SJIS there
are valid two-byte characters whose second byte matches '\'.
So encoding corruption ensues if such a character is used in
the path.
We can fairly easily fix this if we know which encoding is
in use, but a lot of our utilities don't have much of a clue
about that. After some discussion we decided we'd settle for
fixing this only in psql, and assuming that its value of
client_encoding matches what the user is typing.
It seems hopeless to get the server to deal with the problematic
characters in database path names, so we'll just declare that
case to be unsupported. That means nothing need be done in
the server, nor in utility programs whose only contact with
file path names is for database paths. But psql frequently
deals with client-side file paths, so it'd be good if it
didn't mess those up.
Bug: #18735
Reported-by: Koichi Suzuki <koichi.suzuki@enterprisedb.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Koichi Suzuki <koichi.suzuki@enterprisedb.com>
Discussion: https://postgr.es/m/18735-4acdb3998bb9f2b1@postgresql.org
Backpatch-through: 13
This commit adds four fields to the statistics of relations, aggregating
the amount of time spent for each operation on a relation:
- total_vacuum_time, for manual vacuum.
- total_autovacuum_time, for vacuum done by the autovacuum daemon.
- total_analyze_time, for manual analyze.
- total_autoanalyze_time, for analyze done by the autovacuum daemon.
This gives users the option to derive the average time spent for these
operations with the help of the related "count" fields.
Bump catalog version (for the catalog changes) and PGSTAT_FILE_FORMAT_ID
(for the additions in PgStat_StatTabEntry).
Author: Sami Imseih
Reviewed-by: Bertrand Drouvot, Michael Paquier
Discussion: https://postgr.es/m/CAA5RZ0uVOGBYmPEeGF2d1B_67tgNjKx_bKDuL+oUftuoz+=Y1g@mail.gmail.com
The right mix of DDL and VACUUM could corrupt a catalog page header such
that PageIsVerified() durably fails, requiring a restore from backup.
This affects only catalogs that both have a syscache and have DDL code
that uses syscache tuples to construct updates. One of the test
permutations shows a variant not yet fixed.
This makes !TransactionIdIsValid(TM_FailureData.xmax) possible with
TM_Deleted. I think core and PGXN are indifferent to that.
Per bug #17821 from Alexander Lakhin. Back-patch to v13 (all supported
versions). The test case is v17+, since it uses INJECTION_POINT.
Discussion: https://postgr.es/m/17821-dd8c334263399284@postgresql.org
The main motivation for this change is to have a process that can serialize
stats after all other processes have terminated. Serializing stats already
happens in checkpointer, even though walsenders can be active longer.
The only reason the current shutdown sequence does not actively cause problems
is that walsender currently does not generate any stats. However, there is an
upcoming patch changing that.
Another need for this change originates in the AIO patchset, where IO
workers (which, in some edge cases, can emit stats of their own) need to run
while the shutdown checkpoint is being written.
This commit changes the shutdown sequence so checkpointer is signalled (via
SIGINT) to trigger writing the shutdown checkpoint without also causing
checkpointer to exit. Once checkpointer wrote the shutdown checkpoint it
notifies postmaster via PMSIGNAL_XLOG_IS_SHUTDOWN and waits for the
termination signal (SIGUSR2, as before). Checkpointer now is terminated after
all children, other than dead-end children and logger, have been terminated,
tracked using the new PM_WAIT_CHECKPOINTER PMState.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
Useful for caseless matching. Similar to LOWER(), but avoids edge-case
problems with using LOWER() for caseless matching.
For collations that support it, CASEFOLD() handles characters with
more than two case variations or multi-character case variations. Some
characters may fold to uppercase. The results of case folding are also
more stable across Unicode versions than LOWER() or UPPER().
Discussion: https://postgr.es/m/a1886ddfcd8f60cb3e905c93009b646b4cfb74c5.camel%40j-davis.com
Reviewed-by: Ian Lawrence Barwick