postgres 源码解析46 可见性映射表VM
简介
Postgres 为实现多版本并发控制技术,当事务删除或者更新元组时,并非从物理上进行删除,而是将其进行逻辑删除[具体实现通过设置元组头信息xmax/infomask等标志位信息],随着业务的累增,表会越来越膨胀,对于执行计划的生成/最优路径的选择会产生干扰。为解决这一问题,可以通过调用VACUUM来清理这些无效元组。但是一个表可能有很多页组成,如何快速定位到含有无效元组的数据页在高并发场景显得尤为重要,幸运的是pg为表新增对应的附属文件—可见性映射表(VM),来加速判断heap块是否存在无效元祖。
VM 文件结构
VM中为每个HEAP page设置两个比特位 (all-visible and all-frozen),分别对应于该页是否存在无效元祖、该页元组是否全部冻结。
all-visible 比特位的设置表明页内所有元组对于后续所有的事务都是可见的,因此该页无需进行 vacuum操作;
all-frozen 比特位的设置表明页内所有的元组已被冻结,在进行全表扫描vacuum请求时也无需进行vacuum操作。
NOTES: all-frozen 比特位的设置必须建立在该页已设置过 all-visible比特位。
简单介绍下标识位的写/更新逻辑:
其中比特位的含义如下:
all-visible 比特位: 0 ==> 含有无效元祖 1 ==> 元组均可见,不含无效元祖
all-frozen 比特位: 0 ==> 含有非冻结元祖 1 ==> 元组均冻结可见
方便讲述,取自页内的第一个字节示例:
字节对应的二进制信息: 00 00 00 10
根据上述内容可知,heap表的第一页至第三页含有无效元祖,第四页没有无效元祖
场景:对heap表进行vacuum操作,块1无效元祖被清除,需要设置 all-visible比特位,而块4所有元组冻结
读取数据是以字节为单位,因此通过 char *map数组读取出页内容首地址,通过偏移量确定all-visible 与 all-frozen比特位
1 Block-1对应的比特位为 00, 设置all-visible后更新为 10;
2 Block-4对应的比特位为 10, 设置all-frozen后更新为 11;
宏定义与数据结构
/* Number of bits for one heap page */
#define BITS_PER_HEAPBLOCK 2 // 每个heap块对应 2bits
/* Flags for bit map */
#define VISIBILITYMAP_ALL_VISIBLE 0x01 // all_visible
#define VISIBILITYMAP_ALL_FROZEN 0x02 // all_frozen
#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visibilitymap
* flags bits */
*
* Size of the bitmap on each visibility map page, in bytes. There's no
* extra headers, so the whole page minus the standard page header is
* used for the bitmap.
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData)) // map页大小
/* Number of heap blocks we can represent in one byte */
#define HEAPBLOCKS_PER_BYTE (BITS_PER_BYTE / BITS_PER_HEAPBLOCK) // 1 字节对应 4个heap块
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE) // 一个map 对应的heap块数量
/* Mapping from heap block number to the right bit in the visibility map */
#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
/* Masks for counting subsets of bits in the visibility map. */
#define VISIBLE_MASK64 UINT64CONST(0x5555555555555555) /* The lower bit of each
* bit pair */
#define FROZEN_MASK64 UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
* bit pair */
// 读取没有 line pointers文件页的访问方法,尤其适合于VM文件页
/*
1. PageGetContents
2. To be used in cases where the page does not contain line pointers.
3. 4. Note: prior to 8.3 this was not guaranteed to yield a MAXALIGN'd result.
5. Now it is. Beware of old code that might think the offset to the contents
6. is just SizeOfPageHeaderData rather than MAXALIGN(SizeOfPageHeaderData).
*/
#define PageGetContents(page) \
((char *) (page) + MAXALIGN(SizeOfPageHeaderData))
接口函数
1 visibilitymap_set
该函数的主要功能是设置可见性标识位,其执行流程如下:
1)首先进行安全性校验,判断传入的heap buf 和 vmbuf是否有效以及buf中缓存页是否一一对应;
2)获取VM页内容首地址(跳过PageHeaderData),获取vmbuf的 BUFFER_LOCK_EXCLUSIVE;
3)如果之前没有设置过相应的标识位,进行如下操作:
(1) 进入临界区,在指定bit位设置信息,将vmbuf标记为脏;
(2) 写WAL日志,如果开启wal_log_hints,需要将此日志号的LSN更新至heap 页后中;最后更新vmbuf缓存页的LSN,并退出临界。
4)释放vmbuf 持有的排他锁。
/*
* visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
* one provided; in normal running, we generate a new XLOG record and set the
* page LSN to that value. cutoff_xid is the largest xmin on the page being
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples. It can also be set
* to InvalidTransactionId when a page that is already all-visible is being
* marked all-frozen.
*
* 在recovery时 recptr为XLOG 记录的LSN,正常运行时为 InvalidXLogRecPtr。
* cutoff_xid为进行标记操作的最大事务号;在备机上如果页内没有元组则为 InvalidTransactionId
* 在页标记为 all-frozen时其 cutoff_xid 为 InvalidTransactionId
*
* Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
* this function. Except in recovery, caller should also pass the heap
* buffer. When checksums are enabled and we're not in recovery, we must add
* the heap buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
* any I/O.
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
Page page;
uint8 *map;
#ifdef TRACE_VISIBILITYMAP
elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
Assert(flags & VISIBILITYMAP_VALID_BITS);
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
elog(ERROR, "wrong heap buffer passed to visibilitymap_set");
/* Check that we have the right VM page pinned */
if (!BufferIsValid(vmBuf) || BufferGetBlockNumber(vmBuf) != mapBlock)
elog(ERROR, "wrong VM buffer passed to visibilitymap_set");
page = BufferGetPage(vmBuf);
map = (uint8 *) PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
{
START_CRIT_SECTION();
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
{
if (XLogRecPtrIsInvalid(recptr))
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
* need to protect the heap page from being torn.
*/
if (XLogHintBitIsNeeded())
{
Page heapPage = BufferGetPage(heapBuf);
/* caller is expected to set PD_ALL_VISIBLE first */
Assert(PageIsAllVisible(heapPage));
PageSetLSN(heapPage, recptr);
}
}
PageSetLSN(page, recptr);
}
END_CRIT_SECTION();
}
LockBuffer(vmBuf, BUFFER_LOCK_UNLOCK);
}
2 visibilitymap_get_status
- 首先判断vmbuf是否有效,如果有效,则进一步其缓存的页是否为heap块对应页,若对应关系不匹配,则释放vmbuf pin;
- 若无效,则调用 vm_readbuf 将vm页加载至缓冲块中并返回vmbuf,若返回vmbuf无效,则返回false后退出;
3)紧接着读取vm页首地址,根据偏移量读取相应的标识位信息;
这里只需要pin 机制,无需加 BUFFER_LOCK_SHARE
/*
* visibilitymap_get_status - get status of bits
*
* Are all tuples on heapBlk visible to all or are marked frozen, according
* to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
* releasing *buf after it's done testing and setting bits.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
* since we don't lock the visibility map page either, it's even possible that
* someone else could have changed the bit just before we look at it, but yet
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
uint8
visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
char *map;
uint8 result;
#ifdef TRACE_VISIBILITYMAP
elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
if (BufferIsValid(*buf))
{
if (BufferGetBlockNumber(*buf) != mapBlock)
{
ReleaseBuffer(*buf);
*buf = InvalidBuffer;
}
}
if (!BufferIsValid(*buf))
{
*buf = vm_readbuf(rel, mapBlock, false);
if (!BufferIsValid(*buf))
return false;
}
map = PageGetContents(BufferGetPage(*buf));
/*
* A single byte read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
//单一字节的读取是原子的
result = ((map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
return result;
}
3 vm_readbuf
vm_readbuf 函数的功能是负责将指定VM页加载至缓冲区中,若有需要会进行extend生成新页并进行初始化。其执行流程图如下:
/*
* Read a visibility map page.
*
* If the page doesn't exist, InvalidBuffer is returned, or if 'extend' is
* true, the visibility map file is extended.
*/
static Buffer
vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
{
Buffer buf;
SMgrRelation reln;
/*
* Caution: re-using this smgr pointer could fail if the relcache entry
* gets closed. It's safe as long as we only do smgr-level operations
* between here and the last use of the pointer.
*/
reln = RelationGetSmgr(rel);
/*
* If we haven't cached the size of the visibility map fork yet, check it
* first.
*/
// 首先检查 是否cached 对应fork (vm)页
if (reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber)
{
if (smgrexists(reln, VISIBILITYMAP_FORKNUM)) // 判断是否存在,存在即cached
smgrnblocks(reln, VISIBILITYMAP_FORKNUM);
else
reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
}
/* Handle requests beyond EOF */
// 申请的页号超出对应 fork现有最大页号,且指定扩展,则调用 vm_extend进行新建,反之返回InvalidBuffer
if (blkno >= reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM])
{
if (extend)
vm_extend(rel, blkno + 1);
else
return InvalidBuffer;
}
/*
* Use ZERO_ON_ERROR mode, and initialize the page if necessary. It's
* always safe to clear bits, so it's better to clear corrupt pages than
* error out.
*
* The initialize-the-page part is trickier than it looks, because of the
* possibility of multiple backends doing this concurrently, and our
* desire to not uselessly take the buffer lock in the normal path where
* the page is OK. We must take the lock to initialize the page, so
* recheck page newness after we have the lock, in case someone else
* already did it. Also, because we initially check PageIsNew with no
* lock, it's possible to fall through and return the buffer while someone
* else is still initializing the page (i.e., we might see pd_upper as set
* but other page header fields are still zeroes). This is harmless for
* callers that will take a buffer lock themselves, but some callers
* inspect the page without any lock at all. The latter is OK only so
* long as it doesn't depend on the page header having correct contents.
* Current usage is safe because PageGetContents() does not require that.
*/
// 常规流程 ==》 从共享缓冲池选择一个缓冲块缓存指定的VM页面,如果是新NEW页,获取
// BUFFER_LOCK_EXCLUSIVE,后再次检查页面是否为NEW[进行两次判断其是否为新页,
// 是因为有其他进程在本进程申请锁时已经完成了初始化]
buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
RBM_ZERO_ON_ERROR, NULL);
if (PageIsNew(BufferGetPage(buf)))
{
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
if (PageIsNew(BufferGetPage(buf)))
PageInit(BufferGetPage(buf), BLCKSZ, 0);
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
}
return buf;
}
4 vm_extend
当访问的vm页在文件中不存在时,此时需调用vm_extend函数扩展新页并完成相应的初始化工作,其执行流程图如下:
- 首先页面初始化,填充PageHeader结构体pd_lower、pd_upper/和flag初始信息;
2)获取relation的extension锁,防止其他进程进行同样的扩展工作;
3)如果文件不存在,则调用 smgrcreate进行创建,反之进入第4)步;
4)获取当前vm块号,如果当前块号小于指定快号,则需在此调用vm_extend进行扩展(递归调用);
5)向其他进程发送无效消息强制其关闭对rel的引用,其目的是避免其他进程对此文件的create或者extension,因为这写操作容易发生。
6)最后释放锁资源;
/*
* Ensure that the visibility map fork is at least vm_nblocks long, extending
* it if necessary with zeroed pages.
*/
static void
vm_extend(Relation rel, BlockNumber vm_nblocks)
{
BlockNumber vm_nblocks_now;
PGAlignedBlock pg;
SMgrRelation reln;
PageInit((Page) pg.data, BLCKSZ, 0);
/*
* We use the relation extension lock to lock out other backends trying to
* extend the visibility map at the same time. It also locks out extension
* of the main fork, unnecessarily, but extending the visibility map
* happens seldom enough that it doesn't seem worthwhile to have a
* separate lock tag type for it.
*
* Note that another backend might have extended or created the relation
* by the time we get the lock.
*/
LockRelationForExtension(rel, ExclusiveLock);
/*
* Caution: re-using this smgr pointer could fail if the relcache entry
* gets closed. It's safe as long as we only do smgr-level operations
* between here and the last use of the pointer.
*/
reln = RelationGetSmgr(rel);
/*
* Create the file first if it doesn't exist. If smgr_vm_nblocks is
* positive then it must exist, no need for an smgrexists call.
*/
if ((reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == 0 ||
reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) &&
!smgrexists(reln, VISIBILITYMAP_FORKNUM))
smgrcreate(reln, VISIBILITYMAP_FORKNUM, false);
/* Invalidate cache so that smgrnblocks() asks the kernel. */
reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
vm_nblocks_now = smgrnblocks(reln, VISIBILITYMAP_FORKNUM);
/* Now extend the file */
while (vm_nblocks_now < vm_nblocks)
{
PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
vm_nblocks_now++;
}
/*
* Send a shared-inval message to force other backends to close any smgr
* references they may have for this rel, which we are about to change.
* This is a useful optimization because it means that backends don't have
* to keep checking for creation or extension of the file, which happens
* infrequently.
*/
CacheInvalidateSmgr(reln->smgr_rnode);
UnlockRelationForExtension(rel, ExclusiveLock);
}