当前位置：首页 > news >正文

postgres 源码解析46 可见性映射表VM

news 来源：原创 2024/11/18 6:14:21

简介

Postgres 为实现多版本并发控制技术，当事务删除或者更新元组时，并非从物理上进行删除，而是将其进行逻辑删除[具体实现通过设置元组头信息xmax/infomask等标志位信息],随着业务的累增，表会越来越膨胀，对于执行计划的生成/最优路径的选择会产生干扰。为解决这一问题，可以通过调用VACUUM来清理这些无效元组。但是一个表可能有很多页组成，如何快速定位到含有无效元组的数据页在高并发场景显得尤为重要，幸运的是pg为表新增对应的附属文件—可见性映射表（VM）,来加速判断heap块是否存在无效元祖。

VM 文件结构

在这里插入图片描述

VM中为每个HEAP page设置两个比特位 (all-visible and all-frozen)，分别对应于该页是否存在无效元祖、该页元组是否全部冻结。
all-visible 比特位的设置表明页内所有元组对于后续所有的事务都是可见的，因此该页无需进行 vacuum操作；
all-frozen 比特位的设置表明页内所有的元组已被冻结，在进行全表扫描vacuum请求时也无需进行vacuum操作。
NOTES: all-frozen 比特位的设置必须建立在该页已设置过 all-visible比特位。

简单介绍下标识位的写/更新逻辑：

在这里插入图片描述
其中比特位的含义如下：
all-visible 比特位： 0 ==> 含有无效元祖 1 ==> 元组均可见，不含无效元祖
all-frozen 比特位： 0 ==> 含有非冻结元祖 1 ==> 元组均冻结可见
方便讲述，取自页内的第一个字节示例：
字节对应的二进制信息： 00 00 00 10
根据上述内容可知，heap表的第一页至第三页含有无效元祖，第四页没有无效元祖
场景：对heap表进行vacuum操作，块1无效元祖被清除，需要设置 all-visible比特位,而块4所有元组冻结
在这里插入图片描述

读取数据是以字节为单位，因此通过 char *map数组读取出页内容首地址，通过偏移量确定all-visible 与 all-frozen比特位
1 Block-1对应的比特位为 00，设置all-visible后更新为 10；
2 Block-4对应的比特位为 10，设置all-frozen后更新为 11；

宏定义与数据结构

/* Number of bits for one heap page */
#define BITS_PER_HEAPBLOCK 2             // 每个heap块对应 2bits

/* Flags for bit map */
#define VISIBILITYMAP_ALL_VISIBLE	0x01	// all_visible
#define VISIBILITYMAP_ALL_FROZEN	0x02    // all_frozen 
#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
											 * flags bits */
*
 * Size of the bitmap on each visibility map page, in bytes. There's no
 * extra headers, so the whole page minus the standard page header is
 * used for the bitmap.
 */
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))    // map页大小

/* Number of heap blocks we can represent in one byte */
#define HEAPBLOCKS_PER_BYTE (BITS_PER_BYTE / BITS_PER_HEAPBLOCK)  // 1 字节对应 4个heap块

/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)  // 一个map 对应的heap块数量

/* Mapping from heap block number to the right bit in the visibility map */
#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)

/* Masks for counting subsets of bits in the visibility map. */
#define VISIBLE_MASK64	UINT64CONST(0x5555555555555555) /* The lower bit of each
														 * bit pair */
#define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
														 * bit pair */
// 读取没有 line pointers文件页的访问方法，尤其适合于VM文件页
/*
 1. PageGetContents
 2. 	To be used in cases where the page does not contain line pointers.
 3.  4. Note: prior to 8.3 this was not guaranteed to yield a MAXALIGN'd result.
 5. Now it is.  Beware of old code that might think the offset to the contents
 6. is just SizeOfPageHeaderData rather than MAXALIGN(SizeOfPageHeaderData).
 */
#define PageGetContents(page) \
	((char *) (page) + MAXALIGN(SizeOfPageHeaderData))

接口函数

1 visibilitymap_set
该函数的主要功能是设置可见性标识位，其执行流程如下：
1）首先进行安全性校验，判断传入的heap buf 和 vmbuf是否有效以及buf中缓存页是否一一对应；
2）获取VM页内容首地址（跳过PageHeaderData），获取vmbuf的 BUFFER_LOCK_EXCLUSIVE；
3）如果之前没有设置过相应的标识位，进行如下操作：
(1) 进入临界区，在指定bit位设置信息，将vmbuf标记为脏；
(2) 写WAL日志，如果开启wal_log_hints，需要将此日志号的LSN更新至heap 页后中；最后更新vmbuf缓存页的LSN，并退出临界。
4）释放vmbuf 持有的排他锁。

/*
 *	visibilitymap_set - set bit(s) on a previously pinned page
 *
 * recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
 * or InvalidXLogRecPtr in normal running.  The page LSN is advanced to the
 * one provided; in normal running, we generate a new XLOG record and set the
 * page LSN to that value.  cutoff_xid is the largest xmin on the page being
 * marked all-visible; it is needed for Hot Standby, and can be
 * InvalidTransactionId if the page contains no tuples.  It can also be set
 * to InvalidTransactionId when a page that is already all-visible is being
 * marked all-frozen.
 *
 * 在recovery时 recptr为XLOG 记录的LSN，正常运行时为 InvalidXLogRecPtr。
 * cutoff_xid为进行标记操作的最大事务号；在备机上如果页内没有元组则为 InvalidTransactionId
 * 在页标记为 all-frozen时其 cutoff_xid 为 InvalidTransactionId
 * 
 * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
 * this function. Except in recovery, caller should also pass the heap
 * buffer. When checksums are enabled and we're not in recovery, we must add
 * the heap buffer to the WAL chain to protect it from being torn.
 *
 * You must pass a buffer containing the correct map page to this function.
 * Call visibilitymap_pin first to pin the right one. This function doesn't do
 * any I/O.
 */
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
				  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
				  uint8 flags)
{
	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
	Page		page;
	uint8	   *map;

#ifdef TRACE_VISIBILITYMAP
	elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
#endif

	Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
	Assert(InRecovery || BufferIsValid(heapBuf));
	Assert(flags & VISIBILITYMAP_VALID_BITS);

	/* Check that we have the right heap page pinned, if present */
	if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
		elog(ERROR, "wrong heap buffer passed to visibilitymap_set");

	/* Check that we have the right VM page pinned */
	if (!BufferIsValid(vmBuf) || BufferGetBlockNumber(vmBuf) != mapBlock)
		elog(ERROR, "wrong VM buffer passed to visibilitymap_set");

	page = BufferGetPage(vmBuf);
	map = (uint8 *) PageGetContents(page);
	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);

	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
	{
		START_CRIT_SECTION();

		map[mapByte] |= (flags << mapOffset);
		MarkBufferDirty(vmBuf);

		if (RelationNeedsWAL(rel))
		{
			if (XLogRecPtrIsInvalid(recptr))
			{
				Assert(!InRecovery);
				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
										  cutoff_xid, flags);

				/*
				 * If data checksums are enabled (or wal_log_hints=on), we
				 * need to protect the heap page from being torn.
				 */
				if (XLogHintBitIsNeeded())
				{
					Page		heapPage = BufferGetPage(heapBuf);

					/* caller is expected to set PD_ALL_VISIBLE first */
					Assert(PageIsAllVisible(heapPage));
					PageSetLSN(heapPage, recptr);
				}
			}
			PageSetLSN(page, recptr);
		}

		END_CRIT_SECTION();
	}

	LockBuffer(vmBuf, BUFFER_LOCK_UNLOCK);
}

2 visibilitymap_get_status

首先判断vmbuf是否有效，如果有效，则进一步其缓存的页是否为heap块对应页，若对应关系不匹配，则释放vmbuf pin;
若无效，则调用 vm_readbuf 将vm页加载至缓冲块中并返回vmbuf,若返回vmbuf无效，则返回false后退出；
3）紧接着读取vm页首地址，根据偏移量读取相应的标识位信息；
这里只需要pin 机制，无需加 BUFFER_LOCK_SHARE

/*
 *	visibilitymap_get_status - get status of bits
 *
 * Are all tuples on heapBlk visible to all or are marked frozen, according
 * to the visibility map?
 *
 * On entry, *buf should be InvalidBuffer or a valid buffer returned by an
 * earlier call to visibilitymap_pin or visibilitymap_get_status on the same
 * relation. On return, *buf is a valid buffer with the map page containing
 * the bit for heapBlk, or InvalidBuffer. The caller is responsible for
 * releasing *buf after it's done testing and setting bits.
 *
 * NOTE: This function is typically called without a lock on the heap page,
 * so somebody else could change the bit just after we look at it.  In fact,
 * since we don't lock the visibility map page either, it's even possible that
 * someone else could have changed the bit just before we look at it, but yet
 * we might see the old value.  It is the caller's responsibility to deal with
 * all concurrency issues!
 */
uint8
visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{
	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
	char	   *map;
	uint8		result;

#ifdef TRACE_VISIBILITYMAP
	elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
#endif

	/* Reuse the old pinned buffer if possible */
	if (BufferIsValid(*buf))
	{
		if (BufferGetBlockNumber(*buf) != mapBlock)
		{
			ReleaseBuffer(*buf);
			*buf = InvalidBuffer;
		}
	}

	if (!BufferIsValid(*buf))
	{
		*buf = vm_readbuf(rel, mapBlock, false);
		if (!BufferIsValid(*buf))
			return false;
	}

	map = PageGetContents(BufferGetPage(*buf));

	/*
	 * A single byte read is atomic.  There could be memory-ordering effects
	 * here, but for performance reasons we make it the caller's job to worry
	 * about that.
	 */
	 //单一字节的读取是原子的 
	result = ((map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
	return result;
}

3 vm_readbuf

vm_readbuf 函数的功能是负责将指定VM页加载至缓冲区中，若有需要会进行extend生成新页并进行初始化。其执行流程图如下：
在这里插入图片描述

/*
 * Read a visibility map page.
 *
 * If the page doesn't exist, InvalidBuffer is returned, or if 'extend' is
 * true, the visibility map file is extended.
 */
static Buffer
vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
{
	Buffer		buf;
	SMgrRelation reln;

	/*
	 * Caution: re-using this smgr pointer could fail if the relcache entry
	 * gets closed.  It's safe as long as we only do smgr-level operations
	 * between here and the last use of the pointer.
	 */
	reln = RelationGetSmgr(rel);

	/*
	 * If we haven't cached the size of the visibility map fork yet, check it
	 * first.
	 */
	 // 首先检查 是否cached 对应fork （vm）页
	if (reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber)
	{
		if (smgrexists(reln, VISIBILITYMAP_FORKNUM))    // 判断是否存在，存在即cached
			smgrnblocks(reln, VISIBILITYMAP_FORKNUM);
		else
			reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
	}

	/* Handle requests beyond EOF */
	// 申请的页号超出对应 fork现有最大页号，且指定扩展，则调用 vm_extend进行新建，反之返回InvalidBuffer 
	if (blkno >= reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM])
	{
		if (extend)
			vm_extend(rel, blkno + 1);
		else
			return InvalidBuffer;
	}

	/*
	 * Use ZERO_ON_ERROR mode, and initialize the page if necessary. It's
	 * always safe to clear bits, so it's better to clear corrupt pages than
	 * error out.
	 *
	 * The initialize-the-page part is trickier than it looks, because of the
	 * possibility of multiple backends doing this concurrently, and our
	 * desire to not uselessly take the buffer lock in the normal path where
	 * the page is OK.  We must take the lock to initialize the page, so
	 * recheck page newness after we have the lock, in case someone else
	 * already did it.  Also, because we initially check PageIsNew with no
	 * lock, it's possible to fall through and return the buffer while someone
	 * else is still initializing the page (i.e., we might see pd_upper as set
	 * but other page header fields are still zeroes).  This is harmless for
	 * callers that will take a buffer lock themselves, but some callers
	 * inspect the page without any lock at all.  The latter is OK only so
	 * long as it doesn't depend on the page header having correct contents.
	 * Current usage is safe because PageGetContents() does not require that.
	 */
	 // 常规流程 ==》 从共享缓冲池选择一个缓冲块缓存指定的VM页面，如果是新NEW页，获取
	 // BUFFER_LOCK_EXCLUSIVE，后再次检查页面是否为NEW[进行两次判断其是否为新页，
	 // 是因为有其他进程在本进程申请锁时已经完成了初始化]
	buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
							 RBM_ZERO_ON_ERROR, NULL);
	if (PageIsNew(BufferGetPage(buf)))
	{
		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
		if (PageIsNew(BufferGetPage(buf)))
			PageInit(BufferGetPage(buf), BLCKSZ, 0);
		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
	}
	return buf;
}

4 vm_extend

当访问的vm页在文件中不存在时，此时需调用vm_extend函数扩展新页并完成相应的初始化工作，其执行流程图如下：
在这里插入图片描述

首先页面初始化，填充PageHeader结构体pd_lower、pd_upper/和flag初始信息；
2）获取relation的extension锁，防止其他进程进行同样的扩展工作；
3）如果文件不存在，则调用 smgrcreate进行创建，反之进入第4）步；
4）获取当前vm块号，如果当前块号小于指定快号，则需在此调用vm_extend进行扩展（递归调用）；
5）向其他进程发送无效消息强制其关闭对rel的引用，其目的是避免其他进程对此文件的create或者extension,因为这写操作容易发生。
6）最后释放锁资源；

/*
 * Ensure that the visibility map fork is at least vm_nblocks long, extending
 * it if necessary with zeroed pages.
 */
static void
vm_extend(Relation rel, BlockNumber vm_nblocks)
{
	BlockNumber vm_nblocks_now;
	PGAlignedBlock pg;
	SMgrRelation reln;

	PageInit((Page) pg.data, BLCKSZ, 0);

	/*
	 * We use the relation extension lock to lock out other backends trying to
	 * extend the visibility map at the same time. It also locks out extension
	 * of the main fork, unnecessarily, but extending the visibility map
	 * happens seldom enough that it doesn't seem worthwhile to have a
	 * separate lock tag type for it.
	 *
	 * Note that another backend might have extended or created the relation
	 * by the time we get the lock.
	 */
	LockRelationForExtension(rel, ExclusiveLock);

	/*
	 * Caution: re-using this smgr pointer could fail if the relcache entry
	 * gets closed.  It's safe as long as we only do smgr-level operations
	 * between here and the last use of the pointer.
	 */
	reln = RelationGetSmgr(rel);

	/*
	 * Create the file first if it doesn't exist.  If smgr_vm_nblocks is
	 * positive then it must exist, no need for an smgrexists call.
	 */
	if ((reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == 0 ||
		 reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) &&
		!smgrexists(reln, VISIBILITYMAP_FORKNUM))
		smgrcreate(reln, VISIBILITYMAP_FORKNUM, false);

	/* Invalidate cache so that smgrnblocks() asks the kernel. */
	reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
	vm_nblocks_now = smgrnblocks(reln, VISIBILITYMAP_FORKNUM);

	/* Now extend the file */
	while (vm_nblocks_now < vm_nblocks)
	{
		PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);

		smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
		vm_nblocks_now++;
	}

	/*
	 * Send a shared-inval message to force other backends to close any smgr
	 * references they may have for this rel, which we are about to change.
	 * This is a useful optimization because it means that backends don't have
	 * to keep checking for creation or extension of the file, which happens
	 * infrequently.
	 */
	CacheInvalidateSmgr(reln->smgr_rnode);

	UnlockRelationForExtension(rel, ExclusiveLock);
}