Commit Graph

9583 Commits

Author SHA1 Message Date
867e2c91a0 Implement "distributed" checkpoints in which the checkpoint I/O is spread
over a fairly long period of time, rather than being spat out in a burst.
This happens only for background checkpoints carried out by the bgwriter;
other cases, such as a shutdown checkpoint, are still done at full speed.

Remove the "all buffers" scan in the bgwriter, and associated stats
infrastructure, since this seems no longer very useful when the checkpoint
itself is properly throttled.

Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas,
and some minor API editorialization by me.
2007-06-28 00:02:40 +00:00
80f3b5ad2e Remove unused "caller" argument from stringToQualifiedNameList. 2007-06-26 16:48:09 +00:00
a03e8ad266 Remove unused BAD_LOCATION definition. 2007-06-25 17:12:07 +00:00
bae0b56880 Improve autovacuum launcher's ability to detect a problem in worker startup,
by having the postmaster signal it when certain failures occur.  This requires
the postmaster setting a flag in shared memory, but should be as safe as the
pmsignal.c code is.

Also make sure the launcher honor's a postgresql.conf change turning it off
on SIGHUP.
2007-06-25 16:09:03 +00:00
46379d6e60 Separate parse-analysis for utility commands out of parser/analyze.c
(which now deals only in optimizable statements), and put that code
into a new file parser/parse_utilcmd.c.  This helps clarify and enforce
the design rule that utility statements shouldn't be processed during
the regular parse analysis phase; all interpretation of their meaning
should happen after they are given to ProcessUtility to execute.
(We need this because we don't retain any locks for a utility statement
that's in a plan cache, nor have any way to detect that it's stale.)

We are also able to simplify the API for parse_analyze() and related
routines, because they will now always return exactly one Query structure.

In passing, fix bug #3403 concerning trying to add a serial column to
an existing temp table (this is largely Heikki's work, but we needed
all that restructuring to make it safe).
2007-06-23 22:12:52 +00:00
6e07228728 Code review for log_lock_waits patch. Don't try to issue log messages from
within a signal handler (this might be safe given the relatively narrow code
range in which the interrupt is enabled, but it seems awfully risky); do issue
more informative log messages that tell what is being waited for and the exact
length of the wait; minor other code cleanup.  Greg Stark and Tom Lane
2007-06-19 20:13:22 +00:00
4c310eca2e Arrange for quote_identifier() and pg_dump to not quote keywords that are
unreserved according to the grammar.  The list of unreserved words has gotten
extensive enough that the unnecessary quoting is becoming a bit of an eyesore.
To do this, add knowledge of the keyword category to keywords.c's table.
(Someday we might be able to generate keywords.c's table and the keyword lists
in gram.y from a common source.)  For the moment, lie about WITH's status in
the table so it will still get quoted --- this is because of the expectation
that WITH will become reserved when the SQL recursive-queries patch gets done.

I didn't force initdb because this affects nothing on-disk; but note that a
few regression tests have changed expected output.
2007-06-18 21:40:58 +00:00
23347231a5 Tweak the API for per-datatype typmodin functions so that they are passed
an array of strings rather than an array of integers, and allow any simple
constant or identifier to be used in typmods; for example
	create table foo (f1 widget(42,'23skidoo',point));
Of course the typmodin function has still got to pack this info into a
non-negative int32 for storage, but it's still a useful improvement in
flexibility, especially considering that you can do nearly anything if you
are willing to keep the info in a side table.  We can get away with this
change since we have not yet released a version providing user-definable
typmods.  Per discussion.
2007-06-15 20:56:52 +00:00
bd2cb9aaa5 Implement a chunking protocol for writes to the syslogger pipe, with messages
reassembled in the syslogger before writing to the log file. This prevents
partial messages from being written, which mucks up log rotation, and
messages from different backends being interleaved, which causes garbled
logs. Backport as far as 8.0, where the syslogger was introduced.

Tom Lane and Andrew Dunstan
2007-06-14 01:48:51 +00:00
8be9b50ab4 Minor comment fixes. 2007-06-12 16:01:31 +00:00
a9545b3aef Improve UPDATE/DELETE WHERE CURRENT OF so that they can be used from plpgsql
with a plpgsql-defined cursor.  The underlying mechanism for this is that the
main SQL engine will now take "WHERE CURRENT OF $n" where $n is a refcursor
parameter.  Not sure if we should document that fact or consider it an
implementation detail.  Per discussion with Pavel Stehule.
2007-06-11 22:22:42 +00:00
6808f1b1de Support UPDATE/DELETE WHERE CURRENT OF cursor_name, per SQL standard.
Along the way, allow FOR UPDATE in non-WITH-HOLD cursors; there may once
have been a reason to disallow that, but it seems to work now, and it's
really rather necessary if you want to select a row via a cursor and then
update it in a concurrent-safe fashion.

Original patch by Arul Shaji, rather heavily editorialized by Tom Lane.
2007-06-11 01:16:30 +00:00
85d72f0516 Teach heapam code to know the difference between a real seqscan and the
pseudo HeapScanDesc created for a bitmap heap scan.  This avoids some useless
overhead during a bitmap scan startup, in particular invoking the syncscan
code.  (We might someday want to do that, but right now it's merely useless
contention for shared memory, to say nothing of possibly pushing useful
entries out of syncscan's small LRU list.)  This also allows elimination of
ugly pgstat_discount_heap_scan() kluge.
2007-06-09 18:49:55 +00:00
a04a423599 Arrange for large sequential scans to synchronize with each other, so that
when multiple backends are scanning the same relation concurrently, each page
is (ideally) read only once.

Jeff Davis, with review by Heikki and Tom.
2007-06-08 18:23:53 +00:00
24ee8af573 Rework temp_tablespaces patch so that temp tablespaces are assigned separately
for each temp file, rather than once per sort or hashjoin; this allows
spreading the data of a large sort or join across multiple tablespaces.
(I remain dubious that this will make any difference in practice, but certain
people insisted.)  Arrange to cache the results of parsing the GUC variable
instead of recomputing from scratch on every demand, and push usage of the
cache down to the bottommost fd.c level.
2007-06-07 19:19:57 +00:00
2d4db3675f Fix up text concatenation so that it accepts all the reasonable cases that
were accepted by prior Postgres releases.  This takes care of the loose end
left by the preceding patch to downgrade implicit casts-to-text.  To avoid
breaking desirable behavior for array concatenation, introduce a new
polymorphic pseudo-type "anynonarray" --- the added concatenation operators
are actually text || anynonarray and anynonarray || text.
2007-06-06 23:00:50 +00:00
31edbadf4a Downgrade implicit casts to text to be assignment-only, except for the ones
from the other string-category types; this eliminates a lot of surprising
interpretations that the parser could formerly make when there was no directly
applicable operator.

Create a general mechanism that supports casts to and from the standard string
types (text,varchar,bpchar) for *every* datatype, by invoking the datatype's
I/O functions.  These new casts are assignment-only in the to-string direction,
explicit-only in the other, and therefore should create no surprising behavior.
Remove a bunch of thereby-obsoleted datatype-specific casting functions.

The "general mechanism" is a new expression node type CoerceViaIO that can
actually convert between *any* two datatypes if their external text
representations are compatible.  This is more general than needed for the
immediate feature, but might be useful in plpgsql or other places in future.

This commit does nothing about the issue that applying the concatenation
operator || to non-text types will now fail, often with strange error messages
due to misinterpreting the operator as array concatenation.  Since it often
(not always) worked before, we should either make it succeed or at least give
a more user-friendly error; but details are still under debate.

Peter Eisentraut and Tom Lane
2007-06-05 21:31:09 +00:00
1120b99445 The session_replication_role actually can be changed at will during
a session regardless of the existence of cached plans. The plancache
only needs to be invalidated so that rules affected by the new setting
will be reflected in the new query plans.

Jan
2007-06-05 20:00:41 +00:00
acfce502ba Create a GUC parameter temp_tablespaces that allows selection of the
tablespace(s) in which to store temp tables and temporary files.  This is a
list to allow spreading the load across multiple tablespaces (a random list
element is chosen each time a temp object is to be created).  Temp files are
not stored in per-database pgsql_tmp/ directories anymore, but per-tablespace
directories.

Jaime Casanova and Albert Cervera, with review by Bernd Helmle and Tom Lane.
2007-06-03 17:08:34 +00:00
f086be3d39 Allow leading and trailing whitespace in the input to the boolean
type. Also, add explicit casts between boolean and text/varchar. Both
of these changes are for conformance with SQL:2003.

Update the regression tests, bump the catversion.
2007-06-01 23:40:19 +00:00
bd0a260928 Make CREATE/DROP/RENAME DATABASE wait a little bit to see if other backends
will exit before failing because of conflicting DB usage.  Per discussion,
this seems a good idea to help mask the fact that backend exit takes nonzero
time.  Remove a couple of thereby-obsoleted sleeps in contrib and PL
regression test sequences.
2007-06-01 19:38:07 +00:00
bd2c980b22 Buy back some of the cycles spent in more-expensive hash functions by
selecting power-of-2, rather than prime, numbers of buckets in hash joins.
If the hash functions are doing their jobs properly by making all hash bits
equally random, this is good enough, and it saves expensive integer division
and modulus operations.
2007-06-01 17:38:44 +00:00
1f559b7d3a Fix several hash functions that were taking chintzy shortcuts instead of
delivering a well-randomized hash value.  I got religion on this after
observing that performance of multi-batch hash join degrades terribly if the
higher-order bits of hash values aren't random, as indeed was true for say
hashes of small integer values.  It's now expected and documented that hash
functions should use hash_any or some comparable method to ensure that all
bits of their output are about equally random.

initdb forced because this change invalidates existing hash indexes.  For the
same reason, this isn't back-patchable; the hash join performance problem
will get a band-aid fix in the back branches.
2007-06-01 15:33:19 +00:00
10f719af33 Change build_index_pathkeys() so that the expressions it builds to represent
index key columns always have the type expected by the index's associated
operators, ie, we add RelabelType nodes when dealing with binary-compatible
index opclasses.  This is needed to get varchar indexes to play nicely with
the new EquivalenceClass machinery, as per recent gripe from Josh Berkus that
CVS HEAD was failing to match a varchar index column to a constant restriction
in the query.

It seems likely that this change will allow removal of a lot of ugly ad-hoc
RelabelType-stripping that the planner has traditionally done while matching
expressions to other expressions, but I'll worry about that some other day.
2007-05-31 16:57:34 +00:00
d526575f89 Make large sequential scans and VACUUMs work in a limited-size "ring" of
buffers, rather than blowing out the whole shared-buffer arena.  Aside from
avoiding cache spoliation, this fixes the problem that VACUUM formerly tended
to cause a WAL flush for every page it modified, because we had it hacked to
use only a single buffer.  Those flushes will now occur only once per
ring-ful.  The exact ring size, and the threshold for seqscans to switch into
the ring usage pattern, remain under debate; but the infrastructure seems
done.  The key bit of infrastructure is a new optional BufferAccessStrategy
object that can be passed to ReadBuffer operations; this replaces the former
StrategyHintVacuum API.

This patch also changes the buffer usage-count methodology a bit: we now
advance usage_count when first pinning a buffer, rather than when last
unpinning it.  To preserve the behavior that a buffer's lifetime starts to
decrease when it's released, the clock sweep code is modified to not decrement
usage_count of pinned buffers.

Work not done in this commit: teach GiST and GIN indexes to use the vacuum
BufferAccessStrategy for vacuum-driven fetches.

Original patch by Simon, reworked by Heikki and again by Tom.
2007-05-30 20:12:03 +00:00
14c4d3dea9 Fix trivial misspelling in comment. 2007-05-30 16:16:32 +00:00
6af04882de Fix a bug in input processing for the "interval" type. Previously,
"microsecond" and "millisecond" units were not considered valid input
by themselves, which caused inputs like "1 millisecond" to be rejected
erroneously.

Update the docs, add regression tests, and backport to 8.2 and 8.1
2007-05-29 04:58:43 +00:00
97d12b434f Ooops, I was too busy worrying about getting the transactional infrastructure
right to think carefully about how insert and delete counts map to
n_live_tuples.  Of course a deletion should reduce n_live_tuples.
2007-05-27 17:28:36 +00:00
8d675c85c5 pgstat's on-proc-exit hook has to execute after the last transaction commit
or abort within a backend; rearrange InitPostgres processing to make it so.
Revealed by just-added Asserts along with ECPG regression tests (hm, I wonder
why the core regression tests didn't expose it?).  This possibly is another
reason for missing stats updates ...
2007-05-27 05:37:50 +00:00
77947c51c0 Fix up pgstats counting of live and dead tuples to recognize that committed
and aborted transactions have different effects; also teach it not to assume
that prepared transactions are always committed.

Along the way, simplify the pgstats API by tying counting directly to
Relations; I cannot detect any redeeming social value in having stats
pointers in HeapScanDesc and IndexScanDesc structures.  And fix a few
corner cases in which counts might be missed because the relation's
pgstat_info pointer hadn't been set.
2007-05-27 03:50:39 +00:00
604ffd280b Create hooks to let a loadable plugin monitor (or even replace) the planner
and/or create plans for hypothetical situations; in particular, investigate
plans that would be generated using hypothetical indexes.  This is a
heavily-rewritten version of the hooks proposed by Gurjeet Singh for his
Index Advisor project.  In this formulation, the index advisor can be
entirely a loadable module instead of requiring a significant part to be
in the core backend, and plans can be generated for hypothetical indexes
without requiring the creation and rolling-back of system catalog entries.

The index advisor patch as-submitted is not compatible with these hooks,
but it needs significant work anyway due to other 8.2-to-8.3 planner
changes.  With these hooks in the core backend, development of the advisor
can proceed as a pgfoundry project.
2007-05-25 17:54:25 +00:00
11086f2f2b Repair planner bug introduced in 8.2 by ability to rearrange outer joins:
in cases where a sub-SELECT inserts a WHERE clause between two outer joins,
that clause may prevent us from re-ordering the two outer joins.  The code
was considering only the joins' own ON-conditions in determining reordering
safety, which is not good enough.  Add a "delay_upper_joins" flag to
OuterJoinInfo to flag that we have detected such a clause and higher-level
outer joins shouldn't be permitted to commute with this one.  (This might
seem overly coarse, but given the current rules for OJ reordering, it's
sufficient AFAICT.)

The failure case is actually pretty narrow: it needs a WHERE clause within
the RHS of a left join that checks the RHS of a lower left join, but is not
strict for that RHS (else we'd have simplified the lower join to a plain
join).  Even then no failure will be manifest unless the planner chooses to
rearrange the join order.

Per bug report from Adam Terrey.
2007-05-22 23:23:58 +00:00
d7153c5fad Fix best_inner_indexscan to return both the cheapest-total-cost and
cheapest-startup-cost innerjoin indexscans, and make joinpath.c consider
both of these (when different) as the inside of a nestloop join.  The
original design was based on the assumption that indexscan paths always
have negligible startup cost, and so total cost is the only important
figure of merit; an assumption that's obviously broken by bitmap
indexscans.  This oversight could lead to choosing poor plans in cases
where fast-start behavior is more important than total cost, such as
LIMIT and IN queries.  8.1-vintage brain fade exposed by an example from
Chuck D.
2007-05-22 01:40:33 +00:00
2415ad9831 Teach tuplestore.c to throw away data before the "mark" point when the caller
is using mark/restore but not rewind or backward-scan capability.  Insert a
materialize plan node between a mergejoin and its inner child if the inner
child is a sort that is expected to spill to disk.  The materialize shields
the sort from the need to do mark/restore and thereby allows it to perform
its final merge pass on-the-fly; while the materialize itself is normally
cheap since it won't spill to disk unless the number of tuples with equal
key values exceeds work_mem.

Greg Stark, with some kibitzing from Tom Lane.
2007-05-21 17:57:35 +00:00
3963574d13 XPath fixes:
- Function renamed to "xpath".
 - Function is now strict, per discussion.
 - Return empty array in case when XPath expression detects nothing
   (previously, NULL was returned in such case), per discussion.
 - (bugfix) Work with fragments with prologue: select xpath('/a',
   '<?xml version="1.0"?><a /><b />'); // now XML datum is always wrapped
   with dummy <x>...</x>, XML prologue simply goes away (if any).
 - Some cleanup.

Nikolay Samokhvalov

Some code cleanup and documentation work by myself.
2007-05-21 17:10:29 +00:00
a8d539f124 To support external compression of archived WAL data, add a flag bit to
WAL records that shows whether it is safe to remove full-page images
(ie, whether or not an on-line backup was in progress when the WAL entry
was made).  Also make provision for an XLOG_NOOP record type that can be
used to fill in the extra space when decompressing the data for restore.

This is the portion of Koichi Suzuki's "full page writes" patch that
has to go into the core database.  The remainder of that work is two
external compression and decompression programs, which for the time being
will undergo separate development on pgfoundry.  Per discussion.

Also, twiddle the handling of BTREE_SPLIT records to ensure it'll be
possible to compress them (the previous coding caused essential info
to be omitted).  The other commonly-used record types seem OK already,
with the possible exception of GIN and GIST WAL records, which I don't
understand well enough to opine on.
2007-05-20 21:08:19 +00:00
b40776d221 Have CLUSTER advance the table's relfrozenxid. The new frozen point is the
FreezeXid introduced in a recent commit, so there isn't any data loss in this
approach.

Doing it causes ALTER TABLE (or rather, the forms of it that cause a full table
rewrite) to be affected as well.  In this case, the frozen point is RecentXmin,
because after the rewrite all the tuples are relabeled with the rewriting
transaction's Xid.

TOAST tables are fixed automatically as well, as fallout of the way they were
already being handled in the respective code paths.

With this patch, there is no longer need to VACUUM tables for Xid wraparound
purposes that have been cleaned up via TRUNCATE or CLUSTER.
2007-05-18 23:19:42 +00:00
dbb769352d Temporary fix for the problem that pg_stat_activity, inet_client_addr(),
and inet_server_addr() fail if the client connected over a "scoped" IPv6
address.  In this case getnameinfo() will return a string ending with
a poorly-standardized "%something" zone specifier, which these functions
try to feed to network_in(), which won't take it.  So that we don't lose
functionality altogether, suppress the zone specifier before giving the
string to network_in().  Per report from Brian Hirt.

TODO: probably someday the inet type should support scoped IPv6 addresses,
and then this patch should be reverted.

Backpatch to 8.2 ... is it worth going further?
2007-05-17 23:31:49 +00:00
b11123b675 Fix parameter recalculation for Limit nodes: during a ReScan call we must
recompute the limit/offset immediately, so that the updated values are
available when the child's ReScan function is invoked.  Add a regression
test for this, too.  Bug is new in HEAD (due to the bounded-sorting patch)
so no need for back-patch.

I did not do anything about merging this signaling with chgParam processing,
but if we were to do that we'd still need to compute the updated values
at this point rather than during the first ProcNode call.

Per observation and test case from Greg Stark, though I didn't use his patch.
2007-05-17 19:35:08 +00:00
3b0347b36e Move the tuple freezing point in CLUSTER to a point further back in the past,
to avoid losing useful Xid information in not-so-old tuples.  This makes
CLUSTER behave the same as VACUUM as far a tuple-freezing behavior goes
(though CLUSTER does not yet advance the table's relfrozenxid).

While at it, move the actual freezing operation in rewriteheap.c to a more
appropriate place, and document it thoroughly.  This part of the patch from
Tom Lane.
2007-05-17 15:28:29 +00:00
90cbc63fd1 Have TRUNCATE advance the affected table's relfrozenxid to RecentXmin, to
avoid a later needless VACUUM for Xid-wraparound purposes.  We can do this
since the table is known to be left empty, so no Xid remains on it.

Per discussion.
2007-05-16 17:28:20 +00:00
178214d2ae Update comments for PG_DETOAST_PACKED and VARDATA_ANY on a structures
that require alignment.

Add a paragraph to the "User-Defined Types" chapter on using these
macros since it seems like they're a hit.

Gregory Stark
2007-05-15 17:39:54 +00:00
9aa3c782c9 Fix the problem that creating a user-defined type named _foo, followed by one
named foo, would work but the other ordering would not.  If a user-specified
type or table name collides with an existing auto-generated array name, just
rename the array type out of the way by prepending more underscores.  This
should not create any backward-compatibility issues, since the cases in which
this will happen would have failed outright in prior releases.

Also fix an oversight in the arrays-of-composites patch: ALTER TABLE RENAME
renamed the table's rowtype but not its array type.
2007-05-12 00:55:00 +00:00
d8326119c8 Fix my oversight in enabling domains-of-domains: ALTER DOMAIN ADD CONSTRAINT
needs to check the new constraint against columns of derived domains too.

Also, make it error out if the domain to be modified is used within any
composite-type columns.  Eventually we should support that case, but it seems
a bit painful, and not suitable for a back-patch.  For the moment just let the
user know we can't do it.

Backpatch to 8.2, which is the only released version that allows nested
domains.  Possibly the other part should be back-patched further.
2007-05-11 20:17:15 +00:00
bc8036fc66 Support arrays of composite types, including the rowtypes of regular tables
and views (but not system catalogs, nor sequences or toast tables).  Get rid
of the hardwired convention that a type's array type is named exactly "_type",
instead using a new column pg_type.typarray to provide the linkage.  (It still
will be named "_type", though, except in odd corner cases such as
maximum-length type names.)

Along the way, make tracking of owner and schema dependencies for types more
uniform: a type directly created by the user has these dependencies, while a
table rowtype or auto-generated array type does not have them, but depends on
its parent object instead.

David Fetter, Andrew Dunstan, Tom Lane
2007-05-11 17:57:14 +00:00
5b7cf08d33 Reserve some pg_statistic "kind" codes for use by the ESRI ST_Geometry
datatype project.  Per request from Ale Raza (araza at esri.com).
2007-05-08 19:13:52 +00:00
ade493e02d Add a hash function for "numeric". Mark the equality operator for
numerics as "oprcanhash", and make the corresponding system catalog
updates. As a result, hash indexes, hashed aggregation, and hash
joins can now be used with the numeric type. Bump the catversion.

The only tricky aspect to doing this is writing a correct hash
function: it's possible for two Numerics to be equal according to
their equality operator, but have different in-memory bit patterns.
To cope with this, the hash function doesn't consider the Numeric's
"scale" or "sign", and explictly skips any leading or trailing
zeros in the Numeric's digit buffer (the current implementation
should suppress any such zeros, but it seems unwise to rely upon
this). See discussion on pgsql-patches for more details.
2007-05-08 18:56:48 +00:00
d2a4a4069f Add a line to the EXPLAIN ANALYZE output for a Sort node, showing the
actual sort strategy and amount of space used.  By popular demand.
2007-05-04 21:29:53 +00:00
c7464720a3 tas() support for Renesas' M32R processor. Kazuhiro Inaoka 2007-05-04 15:20:52 +00:00
79ca7ffeb6 A few fixups in error handling: mark pg_re_throw() as noreturn for gcc,
and for other compilers, insert a dummy exit() call so that they understand
PG_RE_THROW() doesn't return.  Insert fflush(stderr) in ExceptionalCondition,
per recent buildfarm evidence that that might not happen automatically on some
platforms.  And const-ify ExceptionalCondition's declaration while at it.
2007-05-04 02:01:02 +00:00