From ecfc320dd6cfcf8ebab7f5715079c95dea6e442d Mon Sep 17 00:00:00 2001 From: Chunling Wang Date: Fri, 30 Sep 2022 16:34:36 +0800 Subject: [PATCH] issue#I5UDM6 LogAccessExclusiveLock() when primary node acquire AccessExclusiveLock for systable MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 数据库中表(例如系统表pg_database在大量drop database情况下)因为vacuum时会触发 truncate操作,备机replay truncate操作的同时,有监控程序查询该表( xlog_block_smgr_redo_truncate采用LockRelFileNode给该表的relfilenode加锁了,但 scan操作是通过LockRelationOid给relation的oid加锁,导致replay truncate和select操 作并行了),使被truncate操作invalid的buffer再次被加载到缓冲池中,在之后新的数据 插入到此原本应该已经invalid的数据页时产生错误。query在initscan时通过smgrnblocks 拿到nblocks,如果此时replay trucate log先做了InvalidateBuffer,再做 CacheInvalidateSmgr、smgr_truncate,会导致scan时再次讲invalid的buffer加载到 bufferpool中,产生错误。 回看pg代码,发现在lock relation时候,主机加AccessExclusiveLock时候会记录一条 XLOG_STANDBY_LOCK的日志,当备机回放到该日志时会加上AccessExclusiveLock。但是在 og中,因为原本分布式代码的残留,只在用户表时记了该日志,所以导致系统表会出现上 述问题。 --- src/gausskernel/storage/lmgr/lock.cpp | 26 ++------------------------ 1 file changed, 2 insertions(+), 24 deletions(-) diff --git a/src/gausskernel/storage/lmgr/lock.cpp b/src/gausskernel/storage/lmgr/lock.cpp index b0dc2b148..66cc00bf8 100644 --- a/src/gausskernel/storage/lmgr/lock.cpp +++ b/src/gausskernel/storage/lmgr/lock.cpp @@ -754,30 +754,8 @@ static LockAcquireResult LockAcquireExtendedXC(const LOCKTAG *locktag, LOCKMODE */ if (lockmode >= AccessExclusiveLock && locktag->locktag_type == LOCKTAG_RELATION && !RecoveryInProgress() && XLogStandbyInfoActive()) { - /* - * In a scenario like: - * - * 1, openGauss run vacuum full or autovacuum pg_class, insert AccessExclusiveLock xlog. - * 2, datanode crash, vacuum full abort. - * 3, datanode restart in pending mode, start recovery. - * 4, startup thread acquire pg_class's AccessExclusiveLock. - * 5, startup thread complete recovery and wait for notify. - * 6, cm agent connect datanode, need to init relcache file. - * 7, cm agent connect want to acquire pg_class's AccessShareLock, - * but the AccessExclusiveLock lock is hold by startup thread. - * 8, dead lock, datanode hang. - * - * Other system tables like pg_attribute/pg_type.. also have such problem. - * - * To solve this problem, we don't insert AccessExclusiveLock xlog for system tables. - * This change may cause exception when primary run vacuum full system table while - * standby access the system table at the same time. - * - */ - if (locktag->locktag_field2 > FirstNormalObjectId) { - LogAccessExclusiveLockPrepare(); - log_lock = true; - } + LogAccessExclusiveLockPrepare(); + log_lock = true; } /*