Skip to content

oldoldman/f2fs-design

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Summary

this repo is notes of Linux f2fs file system in my preparation of porting f2fs to Windows.

Table of Contents

  1. F2FS
    1. disk layout
    2. checkpoint
    3. NAT/SIT/SSA
    4. node
    5. node config, file size, etc.
  2. Linux Implementation
    1. Locks
    2. Node Manager
    3. Segment Manager
    4. Main Processes
      1. nid allocation
      2. block allocation
      3. checking point
      4. victim selection
      5. segment selection
      6. gc
  3. Porting Gap
    1. Gap1

F2FS

disk layout

figuredescription
  1. meta data
    • Super Block
    • Checkpoint
    • SIT, segment information table
    • NAT, node address table
    • SSA, segment summary area
  2. frequently accessed meta datas(Super Block / Checkpoint / SIT / NAT, except SSA) are versioned(there are 2 versions)
    • the purpose of meta data versioning is to balance the wirte of meta area
    • version switching happened at checking point time
  3. Super Block, has following important information
    • magic number, is used to recognize f2fs volume
    • major and minor version of f2fs
    • sector size, block size
    • sectors per block, blocks per segment (512), segments per section (usually 1, settable by mkfs.f2fs), sections per zone (usually 1)
  4. SIT is basicly an array of SIT entries , indexed by Segment No
  5. SIT entry has following information
    • allocated block count of the segment
    • bitmap of allocated blocks
    • segment temperature
    • the average access time of the segment
    • refer to nat/sit/ssa for detail
  6. NAT is basicly an array of NAT entries , indexed by Node ID
  7. NAT entry has following information
    • version
    • I Node identifier
    • block address
    • refer to nat/sit/ssa for detail
  8. SSA is basicly an array of SSA entry, indexed by Segment No. the size of SSA entry is 4K
  9. SSA entry has following information
    • summary entries
    • journal
    • segment type
    • check sum
    • refer to nat/sit/ssa for detail

checkpoint

figuredescription
  1. Header and Footer, they are indentical if this is a valid checkpoint, have following information
    • size (in unit of block) of checkpoint : from Header to Footer
    • current segments : segment No, next block offset, allocation type etc.
    • NAT version bitmap
    • SIT version bitmap
  2. Payload, if Header can not accommodate the SIT version bitmap, it will be stored in Payload, or else this area is empty
  3. Orphan inode, if any, or else this area is empty
  4. DataSummary, summaries of the current data segments, has 2 formats
    • compact format : if size of NAT journal + SIT journal + SSA summary entries is less than 3 blocks
      1. NAT journal
      2. SIT journal
      3. SSA summary entries for hot data
      4. SSA summary entries for warm data
      5. SSA summary entries for cold data
    • normal format
      1. SSA entry for hot data
      2. SSA entry for warm data
      3. SSA entry for cold data
  5. NodeSummary : summaries of the current node segments if checking point reason is CP_UMOUNT or CP_FASTBOOT, or else this area is empty
    1. SSA entry for hot node
    2. SSA entry for warm node
    3. SSA entry for cold node
  6. NATBits, if enabled, or else this area is empty

NAT/SIT/SSA

figuredescription
  1. NAT is organized in 4K sized blocks, an NAT block accommodates 455 NAT entries
  2. Version, every time the BlkAddr is changed from non-NULL_ADDR to NULL_ADDR , Version will increase by 1
  3. INO
  4. BlkAddr
    • NULL_ADDR, the NAT entry is free for allocating
    • NEW_ADDR, the NAT entry is allocated but node is not allocated
    • COMPRESS_ADDR
    • Used (none of the above), the NAT entry is allocated and node is allocated, BlkAddr is the address of the node
  1. SIT is organized in 4K sized blocks, a SIT block accommodates 55 SIT entries
  2. Blocks, is consist of the following components
    • bits 0-9, is the allocated block count in the segment
    • bits 10-15, is the segment temperature : hot/warm/cold
  3. BlockBitmap, every allocated block in this segment has its bit set in this bitmap
  4. Mtime, average access time of the segment
    • when a block in this segment is allocated or freed, the access time is calculated and averaged with Mtime
    • is used in victim segment selection
  1. SSA summary entry is the summary of a block in a segment, there are 512 such entries
    • NID
    • Version
      • copy of NAT entry version of Direct Node if this is a data block
      • 0 if this is a node block
    • Offset
      • offset in Direct Node if this is a data block
      • 0 if this is a node block
  2. Footer
    • Type: data or node. f2fs does not mix data block and node block in the same segment

node

figuredescription
  1. NID is the node identifier
  2. INO is the I Node identifier the node belong to. If NID==INO this is an I Node
  3. Flag, is consist of the following components
    • bit 0 : cold flag
    • bit 1 : fsync flag
    • bit 2 : dentry flag
    • bit 3-31 : offset, refer to node config for detail
  4. CkpVer
  5. NextBlkAddr
  1. InlineDataAddrs, this is an array of 923 entries , each entry is 4 bytes
    • optionally, this array can be used to store extra attributes inline (ExtraAttr)
    • optionally, this array can also be used to store extension attributes inline (InlineXAttr)
    • the other entries (InlineDataAddrs) are used to store addresses of inlined data or inlined data
  2. NID, if file size can not fit in InlineDataAddrs completely, f2fs will allocate additional nodes, refer to node config
  1. a direct node has 1018 block address entries , it will cover 1018*4K bytes of data
  1. an indirect node has 1018 nid entries, it will cover 1018 direct nodes or 1018 indirect nodes. in the former an indirect node will cover 1018*1018*4K bytes of data, in the latter, an indirect node will cover 1018*1018*1018*4K bytes of data

node config

figuredescription
  1. f2fs use 5 configurations to accommodate different file sizes, this makes up a tree structure
    • blue circle is I Node (file)
    • light green circle is Indirect Node (I)
    • green circle is Direct Node (D)
  2. file size and configuration(Isz is the size that can be inlined, N==1018)
  3. file sizenode config (cumulative)NID Index
    <=IszI Node-
    <=Isz+N*4KD0
    <=Isz+N*4K*2D1
    <=Isz+N*N*4KI+D2
    <=Isz+N*N*4K*2I+D3
    <=Isz+N*N*N*4KI+I+D4
  4. for example, if file size is between Isz and Isz+N*4K, f2fs will allocate the first Direct Node. portion of the file size equal to Isz is covered by I Node, portion greater than Isz is covered by the first Direct Node.
  5. if files size is between Isz+N*4K and Isz+N*4K*2, f2fs will allocate the first and second Direct Node, etc.
  6. there are 2 kind of offsets in a file : node offset and data offset
    • node offset, is the numbering of the tree structure from top to down and left to right : the offset of I Node is 0, offset of the first and second Direct Node are 1 and 2, offset of the first and second Indirect Node are 3 and 4+1018, and so on...
    • data offset

Linux Implementation

Locks

nametypedescription
cp_lockspindescription
cp_global_semRWdescription
cp_rwsemRWwrite, when checking point
read, when fs operation
locks on Node Manager-Node Manager
locks on Segment Manager-Segment Manager

Node Manager

figuredescription
  1. locks
  2. nametypeprotecting
    nid_list_lockspinFreeNID Cache
    FreeNIDBitmaps
    nat_list_lockspinNatE Cache entry LRU
    nat_tree_lockRWNatE Cache
    build_lockmutexNATBlockBitmap
  3. data flows
    • green arrow is the data flow of FreeNID Cache building
    • red arrow is the data flow of checkpoiting
    • light blue arrow is the data flow of NatE Cache loading
  4. FreeNIDBitmaps, an array of bitmap, indexed by NAT block No, each bitmap is indexed by offset of nid in NAT block
    • is guarded by NATBlockBitmap : to update bit in bitmap of an NAT block, bit of which in NATBlockBitmap must be set
    • updated in NAT scanning (the light green arrow)
    • updated from NATBits (the blue arrow)
    • updated at checking point time
    • updated at pre-allocation time (clear)
    • updated in fail function call (set)
    • used in run time FreeNID Cache building
  5. NATBits, is consist of 2 bitmaps : FullBitmap and EmptyBitmap, both indexed by NAT block No
    • is enabled when CP_NAT_BITS_FLAG is set
    • if an NAT block is empty (all NAT entries are free, NULL_ADDR), the bit in EmptyBitmap will be set
    • if an NAT block is full (all NAT entries are used, non-NULL_ADDR), the bit in FullBitmap will be set
    • a dirty NAT block will has its bit set neither in EmptyBitmap nor in FullBitmap
    • updated from FreeNIDBitmaps (the purple arrow) the first time NATBits is enabled
    • updated at checking point time
  6. NATBlockBitmap, indexed by NAT block No
    • used as a guard for updating FreeNIDBitmaps
    • used as NAT scanning hint : NAT block set in this bitmap will be skipped
    • updated in process of scanning NAT (orange arrow)
    • updated from NATBits (orange arrow)
  7. FreeNID Cache
    • updated at mount time : f2fs will scan the NAT and NAT Journal
    • updated at run time : when there is not enough free nid, f2fs will scan the FreeNIDBitmaps, NAT Jounal and NAT
    • updated at checking point time : if an NatE Cache entry becomes free
    • updated when f2fs decide that there are too many FreeNID Cache, portion of the entries will be deleted, but the bits of which in FreeNIDBitmaps will keep set, in case of there is not enough FreeNID Cache, the deleted entries can be brought back quickly by scanning FreeNIDBitmaps
  8. NatE Cache
    • an NatE Cache entry is added on-demand
      • added in the code path of f2fs_get_node_info
      • added in the code path of set_node_addr
    • an entry becomes dirty when its address changed
  1. FreeNID Cache entries are organized in a balanced tree with node id as its key
  2. at the same time, these entries are linked into a list, entry will be allocated from the head of the list. the allocation is divided into 2 stages
    1. the first stage is pre-allocation stage : entry is deleted from the list (1)
    2. the second stage is succeeding/failing stage
      • if f2fs decide that the pre-allocation is succeeded, it will call the done function, node manager will delete the entry from the tree (3)
      • if f2fs decide that the pre-allocation is failed, it will call the fail function, node manager will delete the entry from the tree(3) or append it back to the list (2)

Segment Manager

figuredescription
  1. locks
  2. nametypeprotect
    segmap_lockspinFreeSegBitmap
    FreeSecBitmap
    sentry_lockRWSitE Cache
    journal_rwsemRWNAT/SIT journal in CurSegs[n]
    seglist_lockmutexDirtySegBitmaps
    DirtySecBitmap
    VictimSecBitmap
    curseg_mutexmutexCurSegs[n]
  3. data flows
    • red arrow is the data flow of checkpointing
    • green arrow is the data flow of SitE Cache loadinig
    • light blue arrow is the data flow of segment to section bitmap rollup / dirty SitE bitmap update
    • blue arrow is the data flow of segment switching
    • orange arrow is the data flow of dirty segment update
  4. DirtySegBitmaps is consist of 8 bitmaps, all are indexed by segment No
    • the DIRTY bitmap, is the union of the following 6 sub-DIRTY bitmaps, any 2 sub-DIRTY bitmaps have no intersection
      • hot data dirty bitmap
      • warm data dirty bitmap
      • cold data dirty bitmap
      • hot node dirty bitmap
      • warm node dirty bitmap
      • cold node dirty bitmap
    • the PRE bitmap, the prefreed segment has its bit set in this bitmap. dirty segment may become free in the process of block address release or GC. these kind of segment is recorded in this PRE bitmap. at checking point time, the prefreed segment(s) is moved to free segment(s) (PRE DirtySegBitmap->FreeSegBitmap)
  5. CurSegs, an array of current segments, f2fs allocate node or data from current segments, there are 8 types of current segment
    • hot data
    • warm data
    • cold data
    • hot node, from which Direct Node is allocated
    • warm node
    • cold node, from which Indriect Node is allocated
    • pinned cold data
    • atgc

main processes

nid allocation

block allocation

checking point

victim selection

segment selection

when there is no free block for a current segment, f2fs will select a new segment. if there is need for SSR(Slack Space Recycling), f2fs will select from dirty segments, or else f2fs will select from free segments.

free segment selection

dirty segment selection

gc

gc is the process of cleaning dirty segment, blocks are moved out from dirty segment to other segments, which makes more free segments.

Porting Gap

Gap1

  • f2fs use page cache to read node into memory , and will lock the page cache from concurrent access. it has the exclusive semantics.
  • for example, in f2fs_get_dnode_of_data, f2fs will read I Node , if needed, f2fs will allocate a Direct Node, and fill its nid into I Node. if the page cache is not locked, 2 thread might overwrite each other.
  • Windows lacks similar facility.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published