mirror of
https://git.postgresql.org/git/postgresql.git
synced 2026-02-06 01:37:33 +08:00
We still require AccessExclusiveLock on the partition itself, because otherwise an insert that violates the newly-imposed partition constraint could be in progress at the same time that we're changing that constraint; only the lock level on the parent relation is weakened. To make this safe, we have to cope with (at least) three separate problems. First, relevant DDL might commit while we're in the process of building a PartitionDesc. If so, find_inheritance_children() might see a new partition while the RELOID system cache still has the old partition bound cached, and even before invalidation messages have been queued. To fix that, if we see that the pg_class tuple seems to be missing or to have a null relpartbound, refetch the value directly from the table. We can't get the wrong value, because DETACH PARTITION still requires AccessExclusiveLock throughout; if we ever want to change that, this will need more thought. In testing, I found it quite difficult to hit even the null-relpartbound case; the race condition is extremely tight, but the theoretical risk is there. Second, successive calls to RelationGetPartitionDesc might not return the same answer. The query planner will get confused if lookup up the PartitionDesc for a particular relation does not return a consistent answer for the entire duration of query planning. Likewise, query execution will get confused if the same relation seems to have a different PartitionDesc at different times. Invent a new PartitionDirectory concept and use it to ensure consistency. This ensures that a single invocation of either the planner or the executor sees the same view of the PartitionDesc from beginning to end, but it does not guarantee that the planner and the executor see the same view. Since this allows pointers to old PartitionDesc entries to survive even after a relcache rebuild, also postpone removing the old PartitionDesc entry until we're certain no one is using it. For the most part, it seems to be OK for the planner and executor to have different views of the PartitionDesc, because the executor will just ignore any concurrently added partitions which were unknown at plan time; those partitions won't be part of the inheritance expansion, but invalidation messages will trigger replanning at some point. Normally, this happens by the time the very next command is executed, but if the next command acquires no locks and executes a prepared query, it can manage not to notice until a new transaction is started. We might want to tighten that up, but it's material for a separate patch. There would still be a small window where a query that started just after an ATTACH PARTITION command committed might fail to notice its results -- but only if the command starts before the commit has been acknowledged to the user. All in all, the warts here around serializability seem small enough to be worth accepting for the considerable advantage of being able to add partitions without a full table lock. Although in general the consequences of new partitions showing up between planning and execution are limited to the query not noticing the new partitions, run-time partition pruning will get confused in that case, so that's the third problem that this patch fixes. Run-time partition pruning assumes that indexes into the PartitionDesc are stable between planning and execution. So, add code so that if new partitions are added between plan time and execution time, the indexes stored in the subplan_map[] and subpart_map[] arrays within the plan's PartitionedRelPruneInfo get adjusted accordingly. There does not seem to be a simple way to generalize this scheme to cope with partitions that are removed, mostly because they could then get added back again with different bounds, but it works OK for added partitions. This code does not try to ensure that every backend participating in a parallel query sees the same view of the PartitionDesc. That currently doesn't matter, because we never pass PartitionDesc indexes between backends. Each backend will ignore the concurrently added partitions which it notices, and it doesn't matter if different backends are ignoring different sets of concurrently added partitions. If in the future that matters, for example because we allow writes in parallel query and want all participants to do tuple routing to the same set of partitions, the PartitionDirectory concept could be improved to share PartitionDescs across backends. There is a draft patch to serialize and restore PartitionDescs on the thread where this patch was discussed, which may be a useful place to start. Patch by me. Thanks to Alvaro Herrera, David Rowley, Simon Riggs, Amit Langote, and Michael Paquier for discussion, and to Alvaro Herrera for some review. Discussion: http://postgr.es/m/CA+Tgmobt2upbSocvvDej3yzokd7AkiT+PvgFH+a9-5VV1oJNSQ@mail.gmail.com Discussion: http://postgr.es/m/CA+TgmoZE0r9-cyA-aY6f8WFEROaDLLL7Vf81kZ8MtFCkxpeQSw@mail.gmail.com Discussion: http://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@mail.gmail.com
155 lines
5.8 KiB
C
155 lines
5.8 KiB
C
/*--------------------------------------------------------------------
|
|
* execPartition.h
|
|
* POSTGRES partitioning executor interface
|
|
*
|
|
* Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
*
|
|
* IDENTIFICATION
|
|
* src/include/executor/execPartition.h
|
|
*--------------------------------------------------------------------
|
|
*/
|
|
|
|
#ifndef EXECPARTITION_H
|
|
#define EXECPARTITION_H
|
|
|
|
#include "nodes/execnodes.h"
|
|
#include "nodes/parsenodes.h"
|
|
#include "nodes/plannodes.h"
|
|
#include "partitioning/partprune.h"
|
|
|
|
/* See execPartition.c for the definitions. */
|
|
typedef struct PartitionDispatchData *PartitionDispatch;
|
|
typedef struct PartitionTupleRouting PartitionTupleRouting;
|
|
|
|
/*
|
|
* PartitionRoutingInfo
|
|
*
|
|
* Additional result relation information specific to routing tuples to a
|
|
* table partition.
|
|
*/
|
|
typedef struct PartitionRoutingInfo
|
|
{
|
|
/*
|
|
* Map for converting tuples in root partitioned table format into
|
|
* partition format, or NULL if no conversion is required.
|
|
*/
|
|
TupleConversionMap *pi_RootToPartitionMap;
|
|
|
|
/*
|
|
* Map for converting tuples in partition format into the root partitioned
|
|
* table format, or NULL if no conversion is required.
|
|
*/
|
|
TupleConversionMap *pi_PartitionToRootMap;
|
|
|
|
/*
|
|
* Slot to store tuples in partition format, or NULL when no translation
|
|
* is required between root and partition.
|
|
*/
|
|
TupleTableSlot *pi_PartitionTupleSlot;
|
|
} PartitionRoutingInfo;
|
|
|
|
/*
|
|
* PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
|
|
* of partitions. For a multilevel partitioned table, we have one of these
|
|
* for the topmost partition plus one for each non-leaf child partition.
|
|
*
|
|
* subplan_map[] and subpart_map[] have the same definitions as in
|
|
* PartitionedRelPruneInfo (see plannodes.h); though note that here,
|
|
* subpart_map contains indexes into PartitionPruningData.partrelprunedata[].
|
|
*
|
|
* subplan_map Subplan index by partition index, or -1.
|
|
* subpart_map Subpart index by partition index, or -1.
|
|
* present_parts A Bitmapset of the partition indexes that we
|
|
* have subplans or subparts for.
|
|
* context Contains the context details required to call
|
|
* the partition pruning code.
|
|
* pruning_steps List of PartitionPruneSteps used to
|
|
* perform the actual pruning.
|
|
* do_initial_prune true if pruning should be performed during
|
|
* executor startup (for this partitioning level).
|
|
* do_exec_prune true if pruning should be performed during
|
|
* executor run (for this partitioning level).
|
|
*/
|
|
typedef struct PartitionedRelPruningData
|
|
{
|
|
int *subplan_map;
|
|
int *subpart_map;
|
|
Bitmapset *present_parts;
|
|
PartitionPruneContext context;
|
|
List *pruning_steps;
|
|
bool do_initial_prune;
|
|
bool do_exec_prune;
|
|
} PartitionedRelPruningData;
|
|
|
|
/*
|
|
* PartitionPruningData - Holds all the run-time pruning information for
|
|
* a single partitioning hierarchy containing one or more partitions.
|
|
* partrelprunedata[] is an array ordered such that parents appear before
|
|
* their children; in particular, the first entry is the topmost partition,
|
|
* which was actually named in the SQL query.
|
|
*/
|
|
typedef struct PartitionPruningData
|
|
{
|
|
int num_partrelprunedata; /* number of array entries */
|
|
PartitionedRelPruningData partrelprunedata[FLEXIBLE_ARRAY_MEMBER];
|
|
} PartitionPruningData;
|
|
|
|
/*
|
|
* PartitionPruneState - State object required for plan nodes to perform
|
|
* run-time partition pruning.
|
|
*
|
|
* This struct can be attached to plan types which support arbitrary Lists of
|
|
* subplans containing partitions, to allow subplans to be eliminated due to
|
|
* the clauses being unable to match to any tuple that the subplan could
|
|
* possibly produce.
|
|
*
|
|
* execparamids Contains paramids of PARAM_EXEC Params found within
|
|
* any of the partprunedata structs. Pruning must be
|
|
* done again each time the value of one of these
|
|
* parameters changes.
|
|
* other_subplans Contains indexes of subplans that don't belong to any
|
|
* "partprunedata", e.g UNION ALL children that are not
|
|
* partitioned tables, or a partitioned table that the
|
|
* planner deemed run-time pruning to be useless for.
|
|
* These must not be pruned.
|
|
* prune_context A short-lived memory context in which to execute the
|
|
* partition pruning functions.
|
|
* do_initial_prune true if pruning should be performed during executor
|
|
* startup (at any hierarchy level).
|
|
* do_exec_prune true if pruning should be performed during
|
|
* executor run (at any hierarchy level).
|
|
* num_partprunedata Number of items in "partprunedata" array.
|
|
* partprunedata Array of PartitionPruningData pointers for the plan's
|
|
* partitioned relation(s), one for each partitioning
|
|
* hierarchy that requires run-time pruning.
|
|
*/
|
|
typedef struct PartitionPruneState
|
|
{
|
|
Bitmapset *execparamids;
|
|
Bitmapset *other_subplans;
|
|
MemoryContext prune_context;
|
|
bool do_initial_prune;
|
|
bool do_exec_prune;
|
|
int num_partprunedata;
|
|
PartitionPruningData *partprunedata[FLEXIBLE_ARRAY_MEMBER];
|
|
} PartitionPruneState;
|
|
|
|
extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(EState *estate,
|
|
ModifyTableState *mtstate,
|
|
Relation rel);
|
|
extern ResultRelInfo *ExecFindPartition(ModifyTableState *mtstate,
|
|
ResultRelInfo *rootResultRelInfo,
|
|
PartitionTupleRouting *proute,
|
|
TupleTableSlot *slot,
|
|
EState *estate);
|
|
extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
|
|
PartitionTupleRouting *proute);
|
|
extern PartitionPruneState *ExecCreatePartitionPruneState(PlanState *planstate,
|
|
PartitionPruneInfo *partitionpruneinfo);
|
|
extern Bitmapset *ExecFindMatchingSubPlans(PartitionPruneState *prunestate);
|
|
extern Bitmapset *ExecFindInitialMatchingSubPlans(PartitionPruneState *prunestate,
|
|
int nsubplans);
|
|
|
|
#endif /* EXECPARTITION_H */
|