Files
Multi-File-Container-Spec-V5/specification_V5.md
2025-12-13 13:50:28 +08:00

8.2 KiB

MFPK-ENC-V5 Format Specification

Encrypted Multi-File Container Format (Binary V5)

1. Introduction

MFPK-ENC-V5 is a compact binary container format for storing files and directories with authenticated encryption. The format features:

  • Argon2id key derivation for modern, memory-hard security.
  • Compact binary structure optimized for streaming operations.
  • Authenticated encryption using AES-256-GCM.
  • Strong resistance against GPU/ASIC-based attacks.
  • Encrypted metadata including file paths and timestamps.

2. Notational Conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document indicate requirement levels.

All multi-byte integers are network byte order (big-endian).

3. Cryptographic Primitives

  • Key Derivation: Argon2id
    • Salt length: 32 bytes
    • Output length: 32 bytes (256 bits)
    • Iterations: 3
    • Memory cost: 64 MiB (65536 KiB)
    • Lanes: 4
    • Password encoding: UTF-8
  • AEAD: AES-256-GCM
    • IV (nonce) length: 12 bytes
    • Tag length: 16 bytes
    • Associated Data (AAD): none in this version
  • Randomness: IVs and salts MUST be generated using a CSPRNG.

4. Container Overview

A container is a binary file consisting of:

  • A fixed global header containing magic, version, salt, and a password verification record.
  • A sequence of entries (files or directories). Each entry begins with a 4-byte binary sync word allowing scanning and recovery.

The format is append-only. Removal and password change are performed by rewriting to a new file.

5. Global Header (fixed size: 72 bytes)

Offset Size Description
0 4 MAGIC_VERSION = 0x89 'M' 'F' 0x05 (bytes: 89 4D 46 05)
4 32 SALT (32 bytes)
36 12 PWV_IV (AES-GCM 12-byte IV for password verification)
48 24 PWV_CT (AES-GCM ciphertext+tag of an 8-byte marker)

Notes:

  • The password verification plaintext marker is exactly 8 bytes: PWV_MARKER = 50 57 56 35 4D 41 52 4B (ASCII "PWV5MARK"). This marker is NEVER stored in plaintext; only the IV and the ciphertext+tag are stored.
  • Total header size = 4 + 32 + 12 + 24 = 72 bytes.

6. Entry Record

Each entry begins with a 4-byte sync word followed by a compact entry header describing the lengths of encrypted fields and the logical file size. All path strings are encrypted; no plaintext path appears in the container.

Constants:

  • SYNC_WORD = A4 45 4E 54 (bytes: 0xA4, 'E', 'N', 'T')
  • ENTRY_TYPE: 1 byte
    • 0x00 = file
    • 0x01 = directory
  • CHUNK_SIZE = 1,048,576 bytes (1 MiB)
  • IV_SIZE = 12 bytes, TAG_SIZE = 16 bytes

Entry header layout (all big-endian):

Offset Size Field
0 4 SYNC_WORD (A4 45 4E 54)
4 1 ENTRY_TYPE (0x00 file, 0x01 dir)
5 3 RESERVED (MUST be zero)
8 4 FULLPATH_LEN (uint32) - length of ENCRYPTED_FULL_PATH
12 8 SIZE (uint64) - logical plaintext file size in bytes. For directories, SIZE MUST be 0
20 4 BASEPATH_LEN (uint32) - length of ENCRYPTED_BASE_PATH
24 2 TS_LEN (uint16) - length of ENCRYPTED_TIMESTAMP (0 for dir)
26 6 RESERVED (MUST be zero)
32 N1 ENCRYPTED_FULL_PATH (N1 bytes)
32+N1 N2 ENCRYPTED_BASE_PATH (N2 bytes)
32+N1+N2 N3 ENCRYPTED_TIMESTAMP (N3 bytes, N3 == 0 for directories)
... ... CONTENT (files only): sequence of chunks

Encrypted fields use AES-GCM and are stored as IV || CIPHERTEXT+TAG, where IV = 12 bytes and TAG is 16 bytes. The ciphertext length equals plaintext length.

7. File Content Chunking

  • Files are stored as a sequence of chunks after the entry header.
  • For each plaintext chunk of up to CHUNK_SIZE bytes:
    • Generate a fresh, random 12-byte IV.
    • Store IV || AES-GCM(key, chunk) where ciphertext length is chunk_len + TAG_SIZE.
  • To skip content without decryption:
    num_chunks = ceil(SIZE / CHUNK_SIZE)
    encrypted_size = SIZE + num_chunks * (IV_SIZE + TAG_SIZE)
    
    and advance the file pointer by encrypted_size bytes.

8. Paths and Timestamps

  • FULL_PATH format: POSIX-like absolute path, e.g., "/dir/a.txt".
  • BASE_PATH is the parent directory path of FULL_PATH.
  • All paths are UTF-8 strings and are stored only in encrypted form.
  • TIMESTAMP (files only): big-endian IEEE-754 float64 of Unix mtime, then encrypted as a single AES-GCM blob (length TS_LEN bytes).
  • Directories MUST have TS_LEN = 0 and no content.

9. Root Directory Entry

A well-formed container SHOULD include an explicit root directory entry during initialization:

  • ENTRY_TYPE = 0x01 (directory)
  • FULL_PATH = "/"
  • SIZE = 0
  • BASE_PATH = "/"
  • TS_LEN = 0 (no timestamp)

10. Indexing and Scanning

  • Readers MAY scan sequentially starting after the 72-byte header:
    • Read 4 bytes; if not SYNC_WORD, implement resynchronization by scanning forward for the next SYNC_WORD boundary.
    • Read ENTRY_TYPE and header fields (FULLPATH_LEN, SIZE, BASEPATH_LEN, TS_LEN) and then the corresponding encrypted byte ranges.
    • Decrypt FULL_PATH and BASE_PATH to reconstruct the directory tree.
    • For files, decrypt TIMESTAMP to obtain mtime. Content decryption is optional for indexing.
    • Compute content start position immediately after ENCRYPTED_TIMESTAMP.
    • For files, skip content using the SIZE-based calculation above.
  • Decryption is REQUIRED only to recover plaintext paths and (for files) timestamps. Content decryption is not required for indexing.

11. Password Verification

  • To verify a password:
    • Read SALT from the header.
    • Derive a key via Argon2id with parameters:
      • iterations = 3
      • memory_cost = 65536 (64 MiB)
      • lanes = 4
      • salt = SALT
      • length = 32 bytes
    • Read PWV_IV and PWV_CT from the header and attempt decryption.
    • Successful decryption yielding the 8-byte PWV_MARKER indicates the password is correct.

12. Error Handling and Robustness

  • Readers SHOULD tolerate trailing or partial entries by:
    • Resynchronizing on SYNC_WORD when an entry parse fails.
    • Validating that all declared lengths are present.
    • Treating incomplete or malformed entries as end-of-container or skipping them safely.
  • Writers SHOULD flush after chunk writes to reduce risk of partial loss on interruption.
  • Removal operations SHOULD rewrite to a temporary file and replace on success to maintain atomicity.

13. Security Considerations

  • AES-GCM requires unique IVs per key; this format uses a fresh random IV per encrypted field and per content chunk. Implementations MUST use a CSPRNG.
  • Full and base paths are always encrypted to avoid leaking names or structure. Only the 1-byte ENTRY_TYPE is plaintext.
  • The password verification marker confirms correct key derivation without exposing the key or file contents. The marker value is not a secret; secrecy is provided by encryption.
  • Argon2id parameters are chosen to provide strong resistance against both CPU and GPU/ASIC-based attacks:
    • iterations = 3 provides reasonable iteration count
    • memory_cost = 65536 (64 MiB) makes GPU attacks expensive
    • lanes = 4 allows efficient use of modern multi-core systems
  • The salt length of 32 bytes provides 256 bits of entropy, which is sufficient for all practical purposes.

14. File Format

  • File extension: ".mfpk" (conventional).

15. Versioning

  • MAGIC_VERSION embeds the format version in byte 4 (0x05 for V5).
  • Backward compatibility with previous versions is NOT provided by V5.
  • V5 containers are incompatible with V4 and earlier implementations.

16. Constants Summary

  • MAGIC_VERSION: 89 4D 46 05
  • HEADER_SIZE: 72 bytes
  • SALT_SIZE: 32
  • PWV_MARKER (plaintext, encrypted in header): "PWV5MARK" (8 bytes)
  • SYNC_WORD: A4 45 4E 54
  • ENTRY_TYPE_FILE: 0x00
  • ENTRY_TYPE_DIRECTORY: 0x01
  • IV_SIZE: 12
  • TAG_SIZE: 16
  • CHUNK_SIZE: 1,048,576
  • Argon2id parameters:
    • iterations: 3
    • memory_cost: 65536 (64 MiB)
    • lanes: 4
    • length: 32 bytes

17. Example Layout (File Entry, Schematic)

[GLOBAL HEADER 72B]
A4 45 4E 54 | 00 | 00 00 00 | [N1=fullpath_len u32] |
[SIZE u64] | [N2=basepath_len u32] | [N3=ts_len u16] | 00..00 (6B)
ENC(full_path) [N1] | ENC(base_path) [N2] | ENC(timestamp) [N3] |
(IV || ENC(chunk1)) ... (IV || ENC(chunkN))