If ANALYZE found no repeated non-null entries in its sample, it set the
column's stadistinct value to -1.0, intending to indicate that the entries
are all distinct. But what this value actually means is that the number
of distinct values is 100% of the table's rowcount, and thus it was
overestimating the number of distinct values by however many nulls there
are. This could lead to very poor selectivity estimates, as for example
in a recent report from Andreas Joseph Krogh. We should discount the
stadistinct value by whatever we've estimated the nulls fraction to be.
(That is what will happen if we choose to use a negative stadistinct for
a column that does have repeated entries, so this code path was just
inconsistent.)
In addition to fixing the stadistinct entries stored by several different
ANALYZE code paths, adjust the logic where get_variable_numdistinct()
forces an "all distinct" estimate on the basis of finding a relevant unique
index. Unique indexes don't reject nulls, so there's no reason to assume
that the null fraction doesn't apply.
Back-patch to all supported branches. Back-patching is a bit of a judgment
call, but this problem seems to affect only a few users (else we'd have
identified it long ago), and it's bad enough when it does happen that
destabilizing plan choices in a worse direction seems unlikely.
Patch by me, with documentation wording suggested by Dean Rasheed
Report: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena>
Discussion: <16143.1470350371@sss.pgh.pa.us>
Discussion of commit 3e2f3c2e4 exposed a problem that is of longer
standing: since we don't detoast data while sticking it into a portal's
holdStore for PORTAL_ONE_RETURNING and PORTAL_UTIL_SELECT queries, and we
release the query's snapshot as soon as we're done loading the holdStore,
later readout of the holdStore can do TOAST fetches against data that can
no longer be seen by any of the session's live snapshots. This means that
a concurrent VACUUM could remove the TOAST data before we can fetch it.
Commit 3e2f3c2e4 exposed the problem by showing that sometimes we had *no*
live snapshots while fetching TOAST data, but we'd be at risk anyway.
I believe this code was all right when written, because our management of a
session's exposed xmin was such that the TOAST references were safe until
end of transaction. But that's no longer true now that we can advance or
clear our PGXACT.xmin intra-transaction.
To fix, copy the query's snapshot during FillPortalStore() and save it in
the Portal; release it only when the portal is dropped. This essentially
implements a policy that we must hold a relevant snapshot whenever we
access potentially-toasted data. We had already come to that conclusion
in other places, cf commits 08e261cbc94ce9a7 and ec543db77b6b72f2.
I'd have liked to add a regression test case for this, but I didn't see
a way to make one that's not unreasonably bloated; it seems to require
returning a toasted value to the client, and those will be big.
In passing, improve PortalRunUtility() so that it positively verifies
that its ending PopActiveSnapshot() call will pop the expected snapshot,
removing a rather shaky assumption about which utility commands might
do their own PopActiveSnapshot(). There's no known bug here, but now
that we're actively referencing the snapshot it's almost free to make
this code a bit more bulletproof.
We might want to consider back-patching something like this into older
branches, but it would be prudent to let it prove itself more in HEAD
beforehand.
Discussion: <87vazemeda.fsf@credativ.de>
tqual.h is included in some front-end compiles, and a static inline
breaks on buildfarm member castoroides. Since the macro is never
referenced, it should dodge that problem, although this doesn't
seem like the cleanest way of hiding things from front-end compiles.
Report and review by Tom Lane; patch by me.
Previously, we tested for MVCC snapshots to see whether they were too
old, but not TOAST snapshots, which can lead to complaints about missing
TOAST chunks if those chunks are subject to early pruning. Ideally,
the threshold lsn and timestamp for a TOAST snapshot would be that of
the corresponding MVCC snapshot, but since we have no way of deciding
which MVCC snapshot was used to fetch the TOAST pointer, use the oldest
active or registered snapshot instead.
Reported by Andres Freund, who also sketched out what the fix should
look like. Patch by me, reviewed by Amit Kapila.
Previously, if an INSERT with multiple rows of VALUES had indirection
(array subscripting or field selection) in its target-columns list, the
parser handled that by applying transformAssignedExpr() to each element
of each VALUES row independently. This led to having ArrayRef assignment
nodes or FieldStore nodes in each row of the VALUES RTE. That works for
simple cases, but in bug #14265 Nuri Boardman points out that it fails
if there are multiple assignments to elements/fields of the same target
column. For such cases to work, rewriteTargetListIU() has to nest the
ArrayRefs or FieldStores together to produce a single expression to be
assigned to the column. But it failed to find them in the top-level
targetlist and issued an error about "multiple assignments to same column".
We could possibly fix this by teaching the rewriter to apply
rewriteTargetListIU to each VALUES row separately, but that would be messy
(it would change the output rowtype of the VALUES RTE, for example) and
inefficient. Instead, let's fix the parser so that the VALUES RTE outputs
are just the user-specified values, cast to the right type if necessary,
and then the ArrayRefs or FieldStores are applied in the top-level
targetlist to Vars representing the RTE's outputs. This is the same
parsetree representation already used for similar cases with INSERT/SELECT
syntax, so it allows simplifications in ruleutils.c, which no longer needs
to treat INSERT-from-multiple-VALUES as its own special case.
This implementation works by applying transformAssignedExpr to the VALUES
entries as before, and then stripping off any ArrayRefs or FieldStores it
adds. With lots of VALUES rows it would be noticeably more efficient to
not add those nodes in the first place. But that's just an optimization
not a bug fix, and there doesn't seem to be any good way to do it without
significant refactoring. (A non-invasive answer would be to apply
transformAssignedExpr + stripping to just the first VALUES row, and then
just forcibly cast remaining rows to the same data types exposed in the
first row. But this way would lead to different, not-INSERT-specific
errors being reported in casting failure cases, so it doesn't seem very
nice.) So leave that for later; this patch at least isn't making the
per-row parsing work worse, and it does make the finished parsetree
smaller, saving rewriter and planner work.
Catversion bump because stored rules containing such INSERTs would need
to change. Because of that, no back-patch, even though this is a very
long-standing bug.
Report: <20160727005725.7438.26021@wrigleys.postgresql.org>
Discussion: <9578.1469645245@sss.pgh.pa.us>
ParallelMessagePending *must* be marked volatile, because it's set
by a signal handler. On the other hand, it's pointless for
HandleParallelMessageInterrupt to save/restore errno; that must be,
and is, done at the outer level of the SIGUSR1 signal handler.
Calling CHECK_FOR_INTERRUPTS() inside HandleParallelMessages, which itself
is called from CHECK_FOR_INTERRUPTS(), seems both useless and hazardous.
The comment claiming that this is needed to handle the error queue going
away is certainly misguided, in any case.
Improve a couple of error message texts, and use
ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE to report loss of parallel worker
connection, since that's what's used in e.g. tqueue.c. (Maybe it would be
worth inventing a dedicated ERRCODE for this type of failure? But I do not
think ERRCODE_INTERNAL_ERROR is appropriate.)
Minor stylistic cleanups.
This coding pattern creates a race condition, because if an interesting
interrupt happens after we've checked InterruptPending but before we reset
our latch, the latch-setting done by the signal handler would get lost,
and then we might block at WaitLatch in the next iteration without ever
noticing the interrupt condition. You can put the CHECK_FOR_INTERRUPTS
before WaitLatch or after ResetLatch, but not between them.
Aside from fixing the bugs, add some explanatory comments to latch.h
to perhaps forestall the next person from making the same mistake.
In HEAD, also replace gather_readnext's direct call of
HandleParallelMessages with CHECK_FOR_INTERRUPTS. It does not seem clean
or useful for this one caller to bypass ProcessInterrupts and go straight
to HandleParallelMessages; not least because that fails to consider the
InterruptPending flag, resulting in useless work both here
(if InterruptPending isn't set) and in the next CHECK_FOR_INTERRUPTS call
(if it is).
This thinko seems to have been introduced in the initial coding of
storage/ipc/shm_mq.c (commit ec9037df2), and then blindly copied into all
the subsequent parallel-query support logic. Back-patch relevant hunks
to 9.4 to extirpate the error everywhere.
Discussion: <1661.1469996911@sss.pgh.pa.us>
The document specifies that pg_replication_origin_xact_reset function
doesn't have any argument variables. But previously it was actually
defined so as to have two argument variables, though they were not
used at all. That is, the pg_proc entry for that function was incorrect.
This patch fixes the pg_proc entry and removes those two arguments
from the function definition.
No back-patch because this change needs a catalog version bump
although the issue exists in 9.5 as well. Instead, a note about those
unused argument variables will be added to 9.5 document later.
Catalog version bumped due to the change of pg_proc.
TupleQueueReaderNext() leaks like a sieve if it has to do any tuple
disassembly/reconstruction. While we could try to clean up its allocations
piecemeal, it seems like a better idea just to insist that it should be run
in a short-lived memory context, so that any transient space goes away
automatically. I chose to have nodeGather.c switch into its existing
per-tuple context before the call, rather than inventing a separate
context inside tqueue.c.
This is sufficient to stop all leakage in the simple case I exhibited
earlier today (see link below), but it does not deal with leaks induced
in more complex cases by tqueue.c's insistence on using TopMemoryContext
for data that it's not actually trying hard to keep track of. That issue
is intertwined with another major source of inefficiency, namely failure
to cache lookup results across calls, so it seems best to deal with it
separately.
In passing, improve some comments, and modify gather_readnext's method for
deciding when it's visited all the readers so that it's more obviously
correct. (I'm not actually convinced that the previous code *is*
correct in the case of a reader deletion; it certainly seems fragile.)
Discussion: <32763.1469821037@sss.pgh.pa.us>
An evident copy-and-pasteo in commit 2bd9e412f broke the non-blocking
aspect of pq_putmessage_noblock(), causing it to behave identically to
pq_putmessage(). That function is nowadays used only in walsender.c,
so that the net effect was to cause walsenders to hang up waiting for
the receiver in situations where they should not.
Kyotaro Horiguchi
Patch: <20160728.185228.58375982.horiguchi.kyotaro@lab.ntt.co.jp>
Now that we've nailed down the principle that NullTest with !argisrow
is fully equivalent to SQL's IS [NOT] DISTINCT FROM NULL, let's teach
the parser about it. This produces a slightly more compact parse tree
and is much more amenable to optimization than a DistinctExpr, since
the planner knows a good deal about NullTest and next to nothing about
DistinctExpr.
I'm not sure that there are all that many queries in the wild that could
be improved by this, but at least one source of such cases is the patch
just made to postgres_fdw to emit IS [NOT] DISTINCT FROM NULL when
IS [NOT] NULL isn't semantically correct.
No back-patch, since to the extent that this does affect planning results,
it might be considered undesirable plan destabilization.
Commits 4452000f3 et al established semantics for NullTest.argisrow that
are a bit different from its initial conception: rather than being merely
a cache of whether we've determined the input to have composite type,
the flag now has the further meaning that we should apply field-by-field
testing as per the standard's definition of IS [NOT] NULL. If argisrow
is false and yet the input has composite type, the construct instead has
the semantics of IS [NOT] DISTINCT FROM NULL. Update the comments in
primnodes.h to clarify this, and fix ruleutils.c and deparse.c to print
such cases correctly. In the case of ruleutils.c, this merely results in
cosmetic changes in EXPLAIN output, since the case can't currently arise
in stored rules. However, it represents a live bug for deparse.c, which
would formerly have sent a remote query that had semantics different
from the local behavior. (From the user's standpoint, this means that
testing a remote nested-composite column for null-ness could have had
unexpected recursive behavior much like that fixed in 4452000f3.)
In a related but somewhat independent fix, make plancat.c set argisrow
to false in all NullTest expressions constructed to represent "attnotnull"
constructs. Since attnotnull is actually enforced as a simple null-value
check, this is a more accurate representation of the semantics; we were
previously overpromising what it meant for composite columns, which might
possibly lead to incorrect planner optimizations. (It seems that what the
SQL spec expects a NOT NULL constraint to mean is an IS NOT NULL test, so
arguably we are violating the spec and should fix attnotnull to do the
other thing. If we ever do, this part should get reverted.)
Back-patch, same as the previous commit.
Discussion: <10682.1469566308@sss.pgh.pa.us>
These functions were added in commits fbe5a3fb7 and a104a017f,
but commit 45639a052 removed their only callers. Put the related
code in foreign.c back to the way it was in 9.5, to avoid pointless
cross-version diffs.
Etsuro Fujita
Patch: <d674a3f1-6b63-519c-ef3f-f3188ed6a178@lab.ntt.co.jp>
Since a892234 & fd31cd265 the visibilitymap's freeze bit is used to
avoid vacuuming the whole relation in anti-wraparound vacuums. Doing so
correctly relies on not adding xids to the heap without also unsetting
the visibilitymap flag. Tuple locking related code has not done so.
To allow selectively resetting all-frozen - to avoid pessimizing
heap_lock_tuple - allow to selectively reset the all-frozen with
visibilitymap_clear(). To avoid having to use
visibilitymap_get_status (e.g. via VM_ALL_FROZEN) inside a critical
section, have visibilitymap_clear() return whether any bits have been
reset.
There's a remaining issue (denoted by XXX): After the PageIsAllVisible()
check in heap_lock_tuple() and heap_lock_updated_tuple_rec() the page
status could theoretically change. Practically that currently seems
impossible, because updaters will hold a page level pin already. Due to
the next beta coming up, it seems better to get the required WAL magic
bump done before resolving this issue.
The added flags field fields to xl_heap_lock and xl_heap_lock_updated
require bumping the WAL magic. Since there's already been a catversion
bump since the last beta, that's not an issue.
Reviewed-By: Robert Haas, Amit Kapila and Andres Freund
Author: Masahiko Sawada, heavily revised by Andres Freund
Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com
Backpatch: -
This should have been done in commit 73c986adde5d73a5 which added several
new fields to pg_control, and again in commit 5028f22f6eb05798 which
changed the CRC algorithm, but it wasn't. It's far too late to fix it in
the 9.5 branch, but let's do so in 9.6, so that if a 9.6 postmaster is
started against a 9.4-era pg_control it will complain about a versioning
problem rather than a CRC failure. We already forced initdb/pg_upgrade
for beta3, so there's no downside to doing this now.
Discussion: <7615.1468598094@sss.pgh.pa.us>
We must not push down a foreign join when the foreign tables involved
should be accessed under different user mappings. Previously we tried
to enforce that rule literally during planning, but that meant that the
resulting plans were dependent on the current contents of the
pg_user_mapping catalog, and we had to blow away all cached plans
containing any remote join when anything at all changed in pg_user_mapping.
This could have been improved somewhat, but the fact that a syscache inval
callback has very limited info about what changed made it hard to do better
within that design. Instead, let's change the planner to not consider user
mappings per se, but to allow a foreign join if both RTEs have the same
checkAsUser value. If they do, then they necessarily will use the same
user mapping at runtime, and we don't need to know specifically which one
that is. Post-plan-time changes in pg_user_mapping no longer require any
plan invalidation.
This rule does give up some optimization ability, to wit where two foreign
table references come from views with different owners or one's from a view
and one's directly in the query, but nonetheless the same user mapping
would have applied. We'll sacrifice the first case, but to not regress
more than we have to in the second case, allow a foreign join involving
both zero and nonzero checkAsUser values if the nonzero one is the same as
the prevailing effective userID. In that case, mark the plan as only
runnable by that userID.
The plancache code already had a notion of plans being userID-specific,
in order to support RLS. It was a little confused though, in particular
lacking clarity of thought as to whether it was the rewritten query or just
the finished plan that's dependent on the userID. Rearrange that code so
that it's clearer what depends on which, and so that the same logic applies
to both RLS-injected role dependency and foreign-join-injected role
dependency.
Note that this patch doesn't remove the other issue mentioned in the
original complaint, which is that while we'll reliably stop using a foreign
join if it's disallowed in a new context, we might fail to start using a
foreign join if it's now allowed, but we previously created a generic
cached plan that didn't use one. It was agreed that the chance of winning
that way was not high enough to justify the much larger number of plan
invalidations that would have to occur if we tried to cause it to happen.
In passing, clean up randomly-varying spelling of EXPLAIN commands in
postgres_fdw.sql, and fix a COSTS ON example that had been allowed to
leak into the committed tests.
This reverts most of commits fbe5a3fb7 and 5d4171d1c, which were the
previous attempt at ensuring we wouldn't push down foreign joins that
span permissions contexts.
Etsuro Fujita and Tom Lane
Discussion: <d49c1e5b-f059-20f4-c132-e9752ee0113e@lab.ntt.co.jp>
GiST index build could go into an infinite loop when presented with boxes
(or points, circles or polygons) containing NaN component values. This
happened essentially because the code assumed that x == x is true for any
"double" value x; but it's not true for NaNs. The looping behavior was not
the only problem though: we also attempted to sort the items using simple
double comparisons. Since NaNs violate the trichotomy law, qsort could
(in principle at least) get arbitrarily confused and mess up the sorting of
ordinary values as well as NaNs. And we based splitting choices on box size
calculations that could produce NaNs, again resulting in undesirable
behavior.
To fix, replace all comparisons of doubles in this logic with
float8_cmp_internal, which is NaN-aware and is careful to sort NaNs
consistently, higher than any non-NaN. Also rearrange the box size
calculation to not produce NaNs; instead it should produce an infinity
for a box with NaN on one side and not-NaN on the other.
I don't by any means claim that this solves all problems with NaNs in
geometric values, but it should at least make GiST index insertion work
reliably with such data. It's likely that the index search side of things
still needs some work, and probably regular geometric operations too.
But with this patch we're laying down a convention for how such cases
ought to behave.
Per bug #14238 from Guang-Dih Lei. Back-patch to 9.2; the code used before
commit 7f3bd86843e5aad8 is quite different and doesn't lock up on my simple
test case, nor on the submitter's dataset.
Report: <20160708151747.1426.60150@wrigleys.postgresql.org>
Discussion: <28685.1468246504@sss.pgh.pa.us>
We have, for a very long time, allowed the same subplan (same member of the
PlannedStmt.subplans list) to be referenced by more than one SubPlan node;
this avoids problems for cases such as subplans within an IndexScan's
indxqual and indxqualorig fields. However, EXPLAIN had not gotten the memo
and would print each reference as though it were an independent identical
subplan. To fix, track plan_ids of subplans we've printed and don't print
the same plan_id twice. Per report from Pavel Stehule.
BTW: the particular case of IndexScan didn't cause visible duplication
in a plain EXPLAIN, only EXPLAIN ANALYZE, because in the former case we
short-circuit executor startup before the indxqual field is processed by
ExecInitExpr. That seems like it could easily lead to other EXPLAIN
problems in future, but it's not clear how to avoid it without breaking
the "EXPLAIN a plan using hypothetical indexes" use-case. For now I've
left that issue alone.
Although this is a longstanding bug, it's purely cosmetic (no great harm
is done by the repeat printout) and we haven't had field complaints before.
So I'm hesitant to back-patch it, especially since there is some small risk
of ABI problems due to the need to add a new field to ExplainState.
In passing, rearrange order of fields in ExplainState to be less random,
and update some obsolete comments about when/where to initialize them.
Report: <CAFj8pRAimq+NK-menjt+3J4-LFoodDD8Or6=Lc_stcFD+eD4DA@mail.gmail.com>
Per discussion on pgsql-hackers, conninfo is better as the column name
because it's more commonly used in PostgreSQL.
Catalog version bumped due to the change of pg_proc.
Author: Michael Paquier
In the previous design, the GetForeignUpperPaths FDW callback hook was
called before we got around to labeling upper relations with the proper
consider_parallel flag; this meant that any upper paths created by an FDW
would be marked not-parallel-safe. While that's probably just as well
right now, we aren't going to want it to be true forever. Hence, abandon
the idea that FDWs should be allowed to inject upper paths before the core
code has gotten around to creating the relevant upper relation. (Well,
actually they still can, but it's on their own heads how well it works.)
Instead, adopt the same API already designed for create_upper_paths_hook:
we call GetForeignUpperPaths after each upperrel has been created and
populated with the paths the core planner knows how to make.
Commit 3fc6e2d7f5b652b417fa6937c34de2438d60fa9f introduced new "upper"
RelOptInfo structures but didn't set consider_parallel for them
correctly, a point I completely missed when reviewing it. Later,
commit e06a38965b3bcdaa881e7e06892d4d8ab6c2c980 made the situation
worse by doing it incorrectly for the grouping relation. Try to
straighten all of that out. Along the way, get rid of the annoying
wholePlanParallelSafe flag, which was only necessarily because of
the fact that upper planning stages didn't use paths at the time
that code was written.
The most important immediate impact of these changes is that
force_parallel_mode will provide useful test coverage in quite a few
more scenarios than it did previously, but it's also necessary
preparation for fixing some problems related to subqueries.
Patch by me, reviewed by Tom Lane.
Previously, workers sent data to the leader using the client encoding.
That mostly worked, but the leader the converted the data back to the
server encoding. Since not all encoding conversions are reversible,
that could provoke failures. Fix by using the database encoding for
all communication between worker and leader.
Also, while temporary changes to GUC settings, as from the SET clause
of a function, are in general OK for parallel query, changing
client_encoding this way inside of a parallel worker is not OK.
Previously, that would have confused the leader; with these changes,
it would not confuse the leader, but it wouldn't do anything either.
So refuse such changes in parallel workers.
Also, the previous code naively assumed that when it received a
NotifyResonse from the worker, it could pass that directly back to the
user. But now that worker-to-leader communication always uses the
database encoding, that's clearly no longer correct - though,
actually, the old way was always broken for V2 clients. So
disassemble and reconstitute the message instead.
Issues reported by Peter Eisentraut. Patch by me, reviewed by
Peter Eisentraut.
In non-text output formats, parallelized aggregates were reporting
"Partial" or "Finalize" as a field named "Operation", which might be all
right in the absence of any context --- but other plan node types use that
field to report SQL-visible semantics, such as Select/Insert/Update/Delete.
So that naming choice didn't seem good to me. I changed it to "Partial
Mode".
Also, the field did not appear at all for a non-parallelized Agg plan node,
which is contrary to expectation in non-text formats. We're notionally
producing objects that conform to a schema, so the set of fields for a
given node type and EXPLAIN mode should be well-defined. I set it up to
fill in "Simple" in such cases.
Other fields that were added for parallel query, namely "Parallel Aware"
and Gather's "Single Copy", had not gotten the word on that point either.
Make them appear always in non-text output.
Also, the latter two fields were nominally producing boolean output, but
were getting it wrong, because bool values shouldn't be quoted in JSON or
YAML. Somehow we'd not needed an ExplainPropertyBool formatting subroutine
before 9.6; but now we do, so invent it.
Discussion: <16002.1466972724@sss.pgh.pa.us>
Commit b1a9bad9e744 introduced a stats view to provide insight into the
running WAL receiver, but neglected to include the connection string in
it, as reported by Michaël Paquier. This commit fixes that omission.
(Any security-sensitive information is not disclosed).
While at it, close the mild security hole that we were exposing the
password in the connection string in shared memory. This isn't
user-accessible, but it still looks like a good idea to avoid having the
cleartext password in memory.
Author: Michaël Paquier, Álvaro Herrera
Review by: Vik Fearing
Discussion: https://www.postgresql.org/message-id/CAB7nPqStg4M561obo7ryZ5G+fUydG4v1Ajs1xZT1ujtu+woRag@mail.gmail.com
Previously, these commands always planned the given query and went through
executor startup before deciding not to actually run the query if WITH NO
DATA is specified. This behavior is problematic for pg_dump because it
may cause errors to be raised that we would rather not see before a
REFRESH MATERIALIZED VIEW command is issued. See for example bug #13907
from Marian Krucina. This change is not sufficient to fix that particular
bug, because we also need to tweak pg_dump to issue the REFRESH later,
but it's a necessary step on the way.
A user-visible side effect of doing things this way is that the returned
command tag for WITH NO DATA cases will now be "CREATE MATERIALIZED VIEW"
or "CREATE TABLE AS", not "SELECT 0". We could preserve the old behavior
but it would take more code, and arguably that was just an implementation
artifact not intended behavior anyhow.
In 9.5 and HEAD, also get rid of the static variable CreateAsReladdr, which
was trouble waiting to happen; there is not any prohibition on nested
CREATE commands.
Back-patch to 9.3 where CREATE MATERIALIZED VIEW was introduced.
Michael Paquier and Tom Lane
Report: <20160202161407.2778.24659@wrigleys.postgresql.org>
<-> operator now have higher predecence than & (AND) operator. This change
was motivated by unexpected difference of similar queries:
'a & b <-> c'::tsquery and 'b <-> c & a'. Before first query means
(a & b) <-> c and second one - '(b <-> c) & a', now phrase operator evaluates
first.
Per suggestion from Tom Lane 32260.1465402409@sss.pgh.pa.us
If there is no positional information of lexemes then phrase operator will not
fallback to AND operator. This change makes needing to modify TS_execute()
interface, because somewhere (in indexes, for example) positional information
is unaccesible and in this cases we need to force fallback to AND.
Per discussion c19fcfec308e6ccd952cdde9e648b505@mail.gmail.com
It's rather silly to make a separate pass over the tlist + HAVING qual,
and a separate set of visits to the syscache, when get_agg_clause_costs
already has all the required information in hand. This nets out as less
code as well as fewer cycles.
The original coding had three separate booleans representing partial
aggregation behavior, which was confusing, unreadable, and error-prone,
not least because the booleans weren't always listed in the same order.
It was also inadequate for the allegedly-desirable future extension to
support intermediate partial aggregation, because we'd need separate
markers for serialization and deserialization in such a case.
Merge these bools into an enum "AggSplit" to provide symbolic names for
the supported operating modes (and document what those are). By assigning
the values of the enum constants carefully, we can treat AggSplit values
as options bitmasks so that tests of what to do aren't noticeably more
expensive than before.
While at it, get rid of Aggref.aggoutputtype. That's not needed since
commit 59a3795c2 got rid of setrefs.c's special-purpose Aggref comparison
code, and it likewise seemed more confusing than helpful.
Assorted comment cleanup as well (there's still more that I want to do
in that line).
catversion bump for change in Aggref node contents. Should be the last
one for partial-aggregation changes.
Discussion: <29309.1466699160@sss.pgh.pa.us>
Commit e06a38965's original coding for constructing the execution-time
expression tree for a combining aggregate was rather messy, involving
duplicating quite a lot of code in setrefs.c so that it could inject
a nonstandard matching rule for Aggrefs. Get rid of that in favor of
explicitly constructing a combining Aggref with a partial Aggref as input,
then allowing setref's normal matching logic to match the partial Aggref
to the output of the lower plan node and hence replace it with a Var.
In passing, rename and redocument make_partialgroup_input_target to have
some connection to what it actually does.
After pg_upgrade, it is possible that some tuples' Xmax have multixacts
corresponding to the old installation; such multixacts cannot have
running members anymore. In many code sites we already know not to read
them and clobber them silently, but at least when VACUUM tries to freeze
a multixact or determine whether one needs freezing, there's an attempt
to resolve it to its member transactions by calling GetMultiXactIdMembers,
and if the multixact value is "in the future" with regards to the
current valid multixact range, an error like this is raised:
ERROR: MultiXactId 123 has not been created yet -- apparent wraparound
and vacuuming fails. Per discussion with Andrew Gierth, it is completely
bogus to try to resolve multixacts coming from before a pg_upgrade,
regardless of where they stand with regards to the current valid
multixact range.
It's possible to get from under this problem by doing SELECT FOR UPDATE
of the problem tuples, but if tables are large, this is slow and
tedious, so a more thorough solution is desirable.
To fix, we realize that multixacts in xmax created in 9.2 and previous
have a specific bit pattern that is never used in 9.3 and later (we
already knew this, per comments and infomask tests sprinkled in various
places, but we weren't leveraging this knowledge appropriately).
Whenever the infomask of the tuple matches that bit pattern, we just
ignore the multixact completely as if Xmax wasn't set; or, in the case
of tuple freezing, we act as if an unwanted value is set and clobber it
without decoding. This guarantees that no errors will be raised, and
that the values will be progressively removed until all tables are
clean. Most callers of GetMultiXactIdMembers are patched to recognize
directly that the value is a removable "empty" multixact and avoid
calling GetMultiXactIdMembers altogether.
To avoid changing the signature of GetMultiXactIdMembers() in back
branches, we keep the "allow_old" boolean flag but rename it to
"from_pgupgrade"; if the flag is true, we always return an empty set
instead of looking up the multixact. (I suppose we could remove the
argument in the master branch, but I chose not to do so in this commit).
This was broken all along, but the error-facing message appeared first
because of commit 8e9a16ab8f7f and was partially fixed in a25c2b7c4db3.
This fix, backpatched all the way back to 9.3, goes approximately in the
same direction as a25c2b7c4db3 but should cover all cases.
Bug analysis by Andrew Gierth and Álvaro Herrera.
A number of public reports match this bug:
https://www.postgresql.org/message-id/20140330040029.GY4582@tamriel.snowman.nethttps://www.postgresql.org/message-id/538F3D70.6080902@publicrelay.comhttps://www.postgresql.org/message-id/556439CF.7070109@pscs.co.ukhttps://www.postgresql.org/message-id/SG2PR06MB0760098A111C88E31BD4D96FB3540@SG2PR06MB0760.apcprd06.prod.outlook.comhttps://www.postgresql.org/message-id/20160615203829.5798.4594@wrigleys.postgresql.org
When the index is predicted to need more than NBuffers buckets,
CREATE INDEX attempts to sort the index entries by hash key before
insertion, so as to reduce thrashing. This code path got broken by
commit 9f03ca915196dfc8, which overlooked that _hash_form_tuple() is not
just an alias for index_form_tuple(). The index got built anyway, but
with garbage data, so that searches for pre-existing tuples always
failed. Fix by refactoring to separate construction of the indexable
data from calling index_form_tuple().
Per bug #14210 from Daniel Newman. Back-patch to 9.5 where the
bug was introduced.
Report: <20160623162507.17237.39471@wrigleys.postgresql.org>
The original specification for this called for the deserialization function
to have signature "deserialize(serialtype) returns transtype", which is a
security violation if transtype is INTERNAL (which it always would be in
practice) and serialtype is not (which ditto). The patch blithely overrode
the opr_sanity check for that, which was sloppy-enough work in itself,
but the indisputable reason this cannot be allowed to stand is that CREATE
FUNCTION will reject such a signature and thus it'd be impossible for
extensions to create parallelizable aggregates.
The minimum fix to make the signature type-safe is to add a second, dummy
argument of type INTERNAL. But to lock it down a bit more and make misuse
of INTERNAL-accepting functions less likely, let's get rid of the ability
to specify a "serialtype" for an aggregate and just say that the only
useful serialtype is BYTEA --- which, in practice, is the only interesting
value anyway, due to the usefulness of the send/recv infrastructure for
this purpose. That means we only have to allow "serialize(internal)
returns bytea" and "deserialize(bytea, internal) returns internal" as
the signatures for these support functions.
In passing fix bogus signature of int4_avg_combine, which I found thanks
to adding an opr_sanity check on combinefunc signatures.
catversion bump due to removing pg_aggregate.aggserialtype and adjusting
signatures of assorted built-in functions.
David Rowley and Tom Lane
Discussion: <27247.1466185504@sss.pgh.pa.us>
The original upper-planner-pathification design (commit 3fc6e2d7f5b652b4)
assumed that we could always determine during Path formation whether or not
we would need a Result plan node to perform projection of a targetlist.
That turns out not to work very well, though, because createplan.c still
has some responsibilities for choosing the specific target list associated
with sorting/grouping nodes (in particular it might choose to add resjunk
columns for sorting). We might not ever refactor that --- doing so would
push more work into Path formation, which isn't attractive --- and we
certainly won't do so for 9.6. So, while create_projection_path and
apply_projection_to_path can tell for sure what will happen if the subpath
is projection-capable, they can't tell for sure when it isn't. This is at
least a latent bug in apply_projection_to_path, which might think it can
apply a target to a non-projecting node when the node will end up computing
something different.
Also, I'd tied the creation of a ProjectionPath node to whether or not a
Result is needed, but it turns out that we sometimes need a ProjectionPath
node anyway to avoid modifying a possibly-shared subpath node. Callers had
to use create_projection_path for such cases, and we added code to them
that knew about the potential omission of a Result node and attempted to
adjust the cost estimates for that. That was uncertainly correct and
definitely ugly/unmaintainable.
To fix, have create_projection_path explicitly check whether a Result
is needed and adjust its cost estimate accordingly, though it creates
a ProjectionPath in either case. apply_projection_to_path is now mostly
just an optimized version that can avoid creating an extra Path node when
the input is known to not be shared with any other live path. (There
is one case that create_projection_path doesn't handle, which is pushing
parallel-safe expressions below a Gather node. We could make it do that
by duplicating the GatherPath, but there seems no need as yet.)
create_projection_plan still has to recheck the tlist-match condition,
which means that if the matching situation does get changed by createplan.c
then we'll have made a slightly incorrect cost estimate. But there seems
no help for that in the near term, and I doubt it occurs often enough,
let alone would change planning decisions often enough, to be worth
stressing about.
I added a "dummypp" field to ProjectionPath to track whether
create_projection_path thinks a Result is needed. This is not really
necessary as-committed because create_projection_plan doesn't look at the
flag; but it seems like a good idea to remember what we thought when
forming the cost estimate, if only for debugging purposes.
In passing, get rid of the target_parallel parameter added to
apply_projection_to_path by commit 54f5c5150. I don't think that's a good
idea because it involves callers in what should be an internal decision,
and opens us up to missing optimization opportunities if callers think they
don't need to provide a valid flag, as most don't. For the moment, this
just costs us an extra has_parallel_hazard call when planning a Gather.
If that starts to look expensive, I think a better solution would be to
teach PathTarget to carry/cache knowledge of parallel-safety of its
contents.
This patch provides a new implementation of the logic added by commit
137805f89 and later removed by 77ba61080. It differs from the original
primarily in expending much less effort per joinrel in large queries,
which it accomplishes by doing most of the matching work once per query not
once per joinrel. Hopefully, it's also less buggy and better commented.
The never-documented enable_fkey_estimates GUC remains gone.
There remains work to be done to make the selectivity estimates account
for nulls in FK referencing columns; but that was true of the original
patch as well. We may be able to address this point later in beta.
In the meantime, any error should be in the direction of overestimating
rather than underestimating joinrel sizes, which seems like the direction
we want to err in.
Tomas Vondra and Tom Lane
Discussion: <31041.1465069446@sss.pgh.pa.us>
When doing partial aggregation, the args list of the upper (combining)
Aggref node is replaced by a Var representing the output of the partial
aggregation steps, which has either the aggregate's transition data type
or a serialized representation of that. However, nodeAgg.c blindly
continued to use the args list as an indication of the user-level argument
types. This broke resolution of polymorphic transition datatypes at
executor startup (though it accidentally failed to fail for the ANYARRAY
case, which is likely the only one anyone had tested). Moreover, the
constructed FuncExpr passed to the finalfunc contained completely wrong
information, which would have led to bogus answers or crashes for any case
where the finalfunc examined that information (which is only likely to be
with polymorphic aggregates using a non-polymorphic transition type).
As an independent bug, apply_partialaggref_adjustment neglected to resolve
a polymorphic transition datatype before assigning it as the output type
of the lower-level Aggref node. This again accidentally failed to fail
for ANYARRAY but would be unlikely to work in other cases.
To fix the first problem, record the user-level argument types in a
separate OID-list field of Aggref, and look to that rather than the args
list when asking what the argument types were. (It turns out to be
convenient to include any "direct" arguments in this list too, although
those are not currently subject to being overwritten.)
Rather than adding yet another resolve_aggregate_transtype() call to fix
the second problem, add an aggtranstype field to Aggref, and store the
resolved transition type OID there when the planner first computes it.
(By doing this in the planner and not the parser, we can allow the
aggregate's transition type to change from time to time, although no DDL
support yet exists for that.) This saves nothing of consequence for
simple non-polymorphic aggregates, but for polymorphic transition types
we save a catalog lookup during executor startup as well as several
planner lookups that are new in 9.6 due to parallel query planning.
In passing, fix an error that was introduced into count_agg_clauses_walker
some time ago: it was applying exprTypmod() to something that wasn't an
expression node at all, but a TargetEntry. exprTypmod silently returned
-1 so that there was not an obvious failure, but this broke the intended
sensitivity of aggregate space consumption estimates to the typmod of
varchar and similar data types. This part needs to be back-patched.
Catversion bump due to change of stored Aggref nodes.
Discussion: <8229.1466109074@sss.pgh.pa.us>
This requires some core changes as well so that we can properly
WAL-log the truncation. Specifically, it changes the format of the
XLOG_SMGR_TRUNCATE WAL record, so bump XLOG_PAGE_MAGIC.
Patch by me, reviewed but not fully endorsed by Andres Freund.
Commit 04ae11f62e643e07c411c4935ea6af46cb112aa9 removed some broken
code to apply the scan/join target to partial paths, but its theory
that this processing step is totally unnecessary turns out to be wrong.
Put similar code back again, but this time, check for parallel-safety
and avoid in-place modifications to paths that may already have been
used as part of some other path.
(This is not an entirely elegant solution to this problem; it might
be better, for example, to postpone generate_gather_paths for the
topmost scan/join rel until after the scan/join target has been
applied. But this is not the time for such redesign work.)
Amit Kapila and Robert Haas
If you really want to vacuum every single page in the relation,
regardless of apparent visibility status or anything else, you can use
this option. In previous releases, this behavior could be achieved
using VACUUM (FREEZE), but because we can now recognize all-frozen
pages as not needing to be frozen again, that no longer works. There
should be no need for routine use of this option, but maybe bugs or
disaster recovery will necessitate its use.
Patch by me, reviewed by Andres Freund.
The main point of doing this is to allow the cutoff to be set very small,
even zero, to allow parallel-query behavior to be tested on relatively
small tables such as we typically use in the regression tests. But it
might be of use to users too. The number-of-workers scaling behavior in
create_plain_partial_paths() is pretty ad-hoc and subject to change, so
we won't expose anything about that, but the notion of not considering
parallel query at all for tables below size X seems reasonably stable.
Amit Kapila, per a suggestion from me
Discussion: <17170.1465830165@sss.pgh.pa.us>
Commit a892234f830e832110f63fc0a2afce2fb21d1584 added a new bit per
page to the visibility map fork indicating whether the page is
all-frozen, but incorrectly assumed that if lazy_scan_heap chose to
freeze a tuple then that tuple would not need to later be frozen
again. This turns out to be false, because xmin and xmax (and
conceivably xvac, if dealing with tuples from very old releases) could
be frozen at separate times.
Thanks to Andres Freund for help in uncovering and tracking down this
issue.
currtid() and currtid2() call GetLatestSnapshot(), which fails in
parallel mode. pg_export_snapshot() calls ExportSnapshot() which
attempts to assign an XID for the current transaction if it does not
already have one; that, too, will fail in parallel mode.
Andreas Seltenreich
The struct definition for PathTarget specifies that a NULL sortgrouprefs
pointer means no sortgroupref labels. While it's likely that there
should always be at least one labeled column in the places that were
unconditionally fetching through the pointer, it seems wiser to adhere to
the data structure specification and test first. Add a macro to make this
convenient. Per experimentation with running the regression tests with a
very small parallelization threshold --- the crash I observed may well
represent a bug elsewhere, but still this coding was not very robust.
Report: <20756.1465834072@sss.pgh.pa.us>
While beneficial, both for throughput and average/worst case latency, in
a significant number of workloads, there are other workloads in which
backend_flush_after can cause significant performance regressions in
comparison to < 9.6 releases. The regression is most likely when the hot
data set is bigger than shared buffers, but significantly smaller than
the operating system's page cache.
I personally think that the benefit of enabling backend flush control is
considerably bigger than the potential downsides, but a fair argument
can be made that not regressing is more important than improving
performance/latency. As the latter is the consensus, change the default
to 0.
The other settings introduced in 428b1d6b2 do not have the same
potential for regressions, so leave them enabled.
Benchmarks leading up to changing the default have been performed by
Mithun Cy, Ashutosh Sharma and Robert Haas.
Discussion: CAD__OuhPmc6XH=wYRm_+Q657yQE88DakN4=Ybh2oveFasHkoeA@mail.gmail.com
Commit b12fd41c6 added a "reltarget_has_non_vars" field to RelOptInfo,
but failed to maintain it accurately. Since its only purpose was to skip
calls to has_parallel_hazard() in the simple case where a rel's targetlist
is all Vars, and that call is really pretty cheap in that case anyway, it
seems like this is just a case of premature optimization. Let's drop the
flag and do the calls unconditionally until it's proven that we need more
smarts here.