Usage¶
Overview¶
The yamanifest package provides utilities for creating and managing YAML-formatted file manifests with multiple checksums (hashes). This allows you to:
Store multiple hash algorithms for each file in a single manifest
Perform cheap checksum operations first (e.g., fast xxHash)
Cascade to more expensive hashes only when needed
Verify file integrity with flexible hash matching
Installation¶
Via Conda¶
conda install -c access-nri yamanifest
Via pip¶
pip install yamanifest
Quick Start¶
Creating a Manifest¶
The easiest way to create a manifest is using the command-line tool yamf:
yamf add -n manifest.yaml file1.txt file2.txt
This creates a manifest.yaml file with hashes for the specified files. By default, two hash algorithms are used: binhash and md5.
You can specify custom hash algorithms:
yamf add -n manifest.yaml -s binhash-xxh -s sha1 -s md5 file1.txt file2.txt
Checking a Manifest¶
Verify that files match the hashes stored in your manifest:
yamf check -n manifest.yaml
This checks all files in the manifest. If all hashes match, it outputs:
manifest.yaml :: hashes are correct
You can also check specific files:
yamf check -n manifest.yaml file1.txt
To pass the check if any hash matches (instead of requiring all hashes to match):
yamf check -n manifest.yaml --any
Programmatic Usage¶
Python API¶
Import the Manifest class:
from yamanifest import Manifest
Creating a Manifest Programmatically¶
# Create a new manifest
manifest = Manifest('manifest.yaml')
# Add files with default hashes (binhash and md5)
manifest.add(['file1.txt', 'file2.txt'])
# Save to YAML file
manifest.dump()
You can specify custom hash functions:
# Add files with specific hashes
manifest.add('file1.txt', hashfn=['binhash-xxh', 'md5', 'sha1'])
# Add to all existing files in manifest
manifest.add(hashfn='sha256')
# Force overwrite existing hashes
manifest.add('file1.txt', hashfn='md5', force=True)
Loading an Existing Manifest¶
# Create and load a manifest in one step (method chaining)
manifest = Manifest('manifest.yaml').load()
# Or load separately
manifest = Manifest('manifest.yaml')
manifest.load()
Checking Files¶
# Check all files with all hashes
if manifest.check():
print("All files are valid")
# Check with specific hash functions
if manifest.check(hashfn=['md5', 'sha1']):
print("MD5 and SHA1 hashes are valid")
# Capture detailed hash results
hashvals = {}
if manifest.check(hashvals=hashvals):
print("Files are valid")
print(hashvals) # Contains computed hash values
# Use 'any' condition (pass if any hash matches, default is all)
if manifest.check(condition=any):
print("At least one hash matched for each file")
Working with Individual Files¶
# Check if a file is in the manifest
if manifest.contains('file1.txt'):
print("file1.txt is in the manifest")
# Get a specific hash value
md5_hash = manifest.get('file1.txt', 'md5')
if md5_hash:
print(f"MD5: {md5_hash}")
# Delete a file from manifest
manifest.delete('file1.txt')
# Iterate over all files in manifest
for filepath in manifest:
print(filepath)
# Get manifest size
print(f"Total files: {len(manifest)}")
Comparing Manifests¶
manifest1 = Manifest('manifest1.yaml').load()
manifest2 = Manifest('manifest2.yaml').load()
if manifest1.equals(manifest2):
print("Manifests are identical")
else:
print("Manifests differ")
Supported Hash Algorithms¶
The following hash algorithms are supported:
binhash- Change detection hash only. Not suitable for file verification across filesystems.binhash-xxh- xxHash version of binhash.md5- MD5 (default)sha1- SHA-1sha256- SHA-256sha512- SHA-512
You can check available hashes programmatically:
from yamanifest import supported_hashes
print(supported_hashes())
Direct Hash Computation¶
You can also use the hashing module directly:
from yamanifest import hash
# Compute a hash for a file
md5_value = hash('file1.txt', 'md5')
xxh_value = hash('file2.txt', 'binhash-xxh')
Manifest YAML Format¶
The manifest file is a YAML document with two sections:
Header Section (metadata):
format: yamanifest
version: 1.0
Data Section (file entries):
file1.txt:
fullpath: /absolute/path/to/file1.txt
hashes:
binhash: abc123...
md5: d41d8cd98f00b204e9800998ecf8427e
file2.txt:
fullpath: /absolute/path/to/file2.txt
hashes:
binhash: xyz789...
md5: 5d41402abc4b2a76b9719d911017c592
Example Workflow¶
Complete Example¶
from yamanifest import Manifest
# Step 1: Create a manifest for data files
manifest = Manifest('data_manifest.yaml')
manifest.add(['data/file1.csv', 'data/file2.csv'],
hashfn=['binhash-xxh', 'md5'])
manifest.dump()
# Step 2: Later, verify the data hasn't changed
manifest = Manifest('data_manifest.yaml').load()
if manifest.check():
print("Data files are unchanged")
else:
print("Data has been modified!")
# Step 3: Add stronger verification with SHA256
manifest.add(hashfn='sha256')
manifest.dump()
# Step 4: Verify with strong hash
if manifest.check(hashfn='sha256'):
print("SHA256 hash verification passed")
Command-Line Example¶
# Initialize manifest for project files
yamf add -n project.yaml -s binhash-xxh -s sha1 \
src/main.py src/utils.py tests/test_main.py
# Check files before deployment
yamf check -n project.yaml
# Verify specific files
yamf check -n project.yaml src/main.py
# Check with lenient matching (any hash OK)
yamf check -n project.yaml --any
Advanced Features¶
Method Chaining¶
Many methods support chaining:
manifest = Manifest('manifest.yaml').load() # Returns self after load
Multiprocessing¶
The Manifest class automatically uses multiprocessing when computing hashes. It detects the number of CPU cores available and distributes work accordingly.
Custom File Paths¶
When adding files to a manifest, you can specify custom fullpaths:
manifest.add(['relative/path.txt'],
fullpaths=['/absolute/path/to/file.txt'])
Shortcircuit Hashing¶
Use the shortcircuit parameter to stop after the first successful hash:
manifest.add('file.txt',
hashfn=['binhash-xxh', 'md5'],
shortcircuit=True)
Error Handling¶
The package raises exceptions for common errors:
from yamanifest import Manifest, HashExists, FilePathNonexistent, HashNonexistent
try:
manifest = Manifest('manifest.yaml').load()
except FileNotFoundError:
print("Manifest file does not exist")
except ValueError as e:
print(f"Invalid manifest format: {e}")
try:
manifest.get('nonexistent.txt', 'md5')
except FilePathNonexistent:
print("File not in manifest")
Tips and Best Practices¶
Use xxHash for Speed: When speed is important, use
binhash-xxhas your first hash:yamf add -n manifest.yaml -s binhash-xxh -s md5 large_file.bin
Cascade Hashes: Store fast hashes for quick checks and slow hashes for thorough verification
Multiple Algorithms: Store multiple hashes to detect collisions and provide redundancy
Version Control: Commit manifest files to version control alongside your data
Parallel Processing: The library automatically uses all available CPU cores
For More Information¶
Repository: https://github.com/ACCESS-NRI/yamanifest
PyPI Package: https://pypi.org/project/yamanifest/