Files
Multi-File-Container-Spec-V5/specification_V5.md
2025-12-13 13:50:28 +08:00

197 lines
8.2 KiB
Markdown

# MFPK-ENC-V5 Format Specification
## Encrypted Multi-File Container Format (Binary V5)
## 1. Introduction
MFPK-ENC-V5 is a compact binary container format for storing files and directories with authenticated encryption. The format features:
- Argon2id key derivation for modern, memory-hard security.
- Compact binary structure optimized for streaming operations.
- Authenticated encryption using AES-256-GCM.
- Strong resistance against GPU/ASIC-based attacks.
- Encrypted metadata including file paths and timestamps.
## 2. Notational Conventions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document indicate requirement levels.
All multi-byte integers are network byte order (big-endian).
## 3. Cryptographic Primitives
- **Key Derivation**: Argon2id
- Salt length: 32 bytes
- Output length: 32 bytes (256 bits)
- Iterations: 3
- Memory cost: 64 MiB (65536 KiB)
- Lanes: 4
- Password encoding: UTF-8
- **AEAD**: AES-256-GCM
- IV (nonce) length: 12 bytes
- Tag length: 16 bytes
- Associated Data (AAD): none in this version
- **Randomness**: IVs and salts MUST be generated using a CSPRNG.
## 4. Container Overview
A container is a binary file consisting of:
- A fixed global header containing magic, version, salt, and a password verification record.
- A sequence of entries (files or directories). Each entry begins with a 4-byte binary sync word allowing scanning and recovery.
The format is append-only. Removal and password change are performed by rewriting to a new file.
## 5. Global Header (fixed size: 72 bytes)
| Offset | Size | Description |
|--------|------|-------------|
| 0 | 4 | MAGIC_VERSION = 0x89 'M' 'F' 0x05 (bytes: 89 4D 46 05) |
| 4 | 32 | SALT (32 bytes) |
| 36 | 12 | PWV_IV (AES-GCM 12-byte IV for password verification) |
| 48 | 24 | PWV_CT (AES-GCM ciphertext+tag of an 8-byte marker) |
**Notes:**
- The password verification plaintext marker is exactly 8 bytes: PWV_MARKER = 50 57 56 35 4D 41 52 4B (ASCII "PWV5MARK"). This marker is NEVER stored in plaintext; only the IV and the ciphertext+tag are stored.
- Total header size = 4 + 32 + 12 + 24 = 72 bytes.
## 6. Entry Record
Each entry begins with a 4-byte sync word followed by a compact entry header describing the lengths of encrypted fields and the logical file size. All path strings are encrypted; no plaintext path appears in the container.
**Constants:**
- SYNC_WORD = A4 45 4E 54 (bytes: 0xA4, 'E', 'N', 'T')
- ENTRY_TYPE: 1 byte
- 0x00 = file
- 0x01 = directory
- CHUNK_SIZE = 1,048,576 bytes (1 MiB)
- IV_SIZE = 12 bytes, TAG_SIZE = 16 bytes
**Entry header layout (all big-endian):**
| Offset | Size | Field |
|--------|------|---------------------------------------------|
| 0 | 4 | SYNC_WORD (A4 45 4E 54) |
| 4 | 1 | ENTRY_TYPE (0x00 file, 0x01 dir) |
| 5 | 3 | RESERVED (MUST be zero) |
| 8 | 4 | FULLPATH_LEN (uint32) - length of ENCRYPTED_FULL_PATH |
| 12 | 8 | SIZE (uint64) - logical plaintext file size in bytes. For directories, SIZE MUST be 0 |
| 20 | 4 | BASEPATH_LEN (uint32) - length of ENCRYPTED_BASE_PATH |
| 24 | 2 | TS_LEN (uint16) - length of ENCRYPTED_TIMESTAMP (0 for dir) |
| 26 | 6 | RESERVED (MUST be zero) |
| 32 | N1 | ENCRYPTED_FULL_PATH (N1 bytes) |
| 32+N1 | N2 | ENCRYPTED_BASE_PATH (N2 bytes) |
| 32+N1+N2 | N3 | ENCRYPTED_TIMESTAMP (N3 bytes, N3 == 0 for directories) |
| ... | ... | CONTENT (files only): sequence of chunks |
Encrypted fields use AES-GCM and are stored as IV || CIPHERTEXT+TAG, where IV = 12 bytes and TAG is 16 bytes. The ciphertext length equals plaintext length.
## 7. File Content Chunking
- Files are stored as a sequence of chunks after the entry header.
- For each plaintext chunk of up to CHUNK_SIZE bytes:
- Generate a fresh, random 12-byte IV.
- Store IV || AES-GCM(key, chunk) where ciphertext length is chunk_len + TAG_SIZE.
- To skip content without decryption:
```
num_chunks = ceil(SIZE / CHUNK_SIZE)
encrypted_size = SIZE + num_chunks * (IV_SIZE + TAG_SIZE)
```
and advance the file pointer by encrypted_size bytes.
## 8. Paths and Timestamps
- FULL_PATH format: POSIX-like absolute path, e.g., "/dir/a.txt".
- BASE_PATH is the parent directory path of FULL_PATH.
- All paths are UTF-8 strings and are stored only in encrypted form.
- TIMESTAMP (files only): big-endian IEEE-754 float64 of Unix mtime, then encrypted as a single AES-GCM blob (length TS_LEN bytes).
- Directories MUST have TS_LEN = 0 and no content.
## 9. Root Directory Entry
A well-formed container SHOULD include an explicit root directory entry during initialization:
- ENTRY_TYPE = 0x01 (directory)
- FULL_PATH = "/"
- SIZE = 0
- BASE_PATH = "/"
- TS_LEN = 0 (no timestamp)
## 10. Indexing and Scanning
- Readers MAY scan sequentially starting after the 72-byte header:
- Read 4 bytes; if not SYNC_WORD, implement resynchronization by scanning forward for the next SYNC_WORD boundary.
- Read ENTRY_TYPE and header fields (FULLPATH_LEN, SIZE, BASEPATH_LEN, TS_LEN) and then the corresponding encrypted byte ranges.
- Decrypt FULL_PATH and BASE_PATH to reconstruct the directory tree.
- For files, decrypt TIMESTAMP to obtain mtime. Content decryption is optional for indexing.
- Compute content start position immediately after ENCRYPTED_TIMESTAMP.
- For files, skip content using the SIZE-based calculation above.
- Decryption is REQUIRED only to recover plaintext paths and (for files) timestamps. Content decryption is not required for indexing.
## 11. Password Verification
- To verify a password:
- Read SALT from the header.
- Derive a key via Argon2id with parameters:
- iterations = 3
- memory_cost = 65536 (64 MiB)
- lanes = 4
- salt = SALT
- length = 32 bytes
- Read PWV_IV and PWV_CT from the header and attempt decryption.
- Successful decryption yielding the 8-byte PWV_MARKER indicates the password is correct.
## 12. Error Handling and Robustness
- Readers SHOULD tolerate trailing or partial entries by:
- Resynchronizing on SYNC_WORD when an entry parse fails.
- Validating that all declared lengths are present.
- Treating incomplete or malformed entries as end-of-container or skipping them safely.
- Writers SHOULD flush after chunk writes to reduce risk of partial loss on interruption.
- Removal operations SHOULD rewrite to a temporary file and replace on success to maintain atomicity.
## 13. Security Considerations
- AES-GCM requires unique IVs per key; this format uses a fresh random IV per encrypted field and per content chunk. Implementations MUST use a CSPRNG.
- Full and base paths are always encrypted to avoid leaking names or structure. Only the 1-byte ENTRY_TYPE is plaintext.
- The password verification marker confirms correct key derivation without exposing the key or file contents. The marker value is not a secret; secrecy is provided by encryption.
- Argon2id parameters are chosen to provide strong resistance against both CPU and GPU/ASIC-based attacks:
- iterations = 3 provides reasonable iteration count
- memory_cost = 65536 (64 MiB) makes GPU attacks expensive
- lanes = 4 allows efficient use of modern multi-core systems
- The salt length of 32 bytes provides 256 bits of entropy, which is sufficient for all practical purposes.
## 14. File Format
- File extension: ".mfpk" (conventional).
## 15. Versioning
- MAGIC_VERSION embeds the format version in byte 4 (0x05 for V5).
- Backward compatibility with previous versions is NOT provided by V5.
- V5 containers are incompatible with V4 and earlier implementations.
## 16. Constants Summary
- MAGIC_VERSION: 89 4D 46 05
- HEADER_SIZE: 72 bytes
- SALT_SIZE: 32
- PWV_MARKER (plaintext, encrypted in header): "PWV5MARK" (8 bytes)
- SYNC_WORD: A4 45 4E 54
- ENTRY_TYPE_FILE: 0x00
- ENTRY_TYPE_DIRECTORY: 0x01
- IV_SIZE: 12
- TAG_SIZE: 16
- CHUNK_SIZE: 1,048,576
- Argon2id parameters:
- iterations: 3
- memory_cost: 65536 (64 MiB)
- lanes: 4
- length: 32 bytes
## 17. Example Layout (File Entry, Schematic)
```
[GLOBAL HEADER 72B]
A4 45 4E 54 | 00 | 00 00 00 | [N1=fullpath_len u32] |
[SIZE u64] | [N2=basepath_len u32] | [N3=ts_len u16] | 00..00 (6B)
ENC(full_path) [N1] | ENC(base_path) [N2] | ENC(timestamp) [N3] |
(IV || ENC(chunk1)) ... (IV || ENC(chunkN))
```