BULK ARchive Format
Thierry Technologies
pierre@nothos.net
binary
This specification describes a BULK format to pack together independent pieces of data and
metadata about them.
There are plenty of archives formats currently in use, from widely-used and repurposed
formats like ZIP (used for generic file archives as well as Java deployment, ebooks and
office documents) to legacy formats like ARC or Z through moderately used formats enjoying
a stable niche, like tar, RAR or StuffIt.
A few archive formats actually make reuse of existing ones. Many archive formats
developped nowadays actually reuse ZIP without modification and just dictate the tree
structure inside the ZIP file. The Unix world has long had a tradition of separation of
concern, thus using different formats for archiving (ar or tar) and compression (gzip,
bzip2, lzma or now xz), with compressed archives named after the combination (foo.ar.gz,
bar.tar.bz2, etc.). Debian packages are actually ar files containing little uncompressed
metadata and a couple of compressed tar files.
But the problem remains that all these binary formats all define completely ad hoc
syntaxes, sometimes incredibly optimized but narrowly tailored to their specific
requirements. Many leave little room for future extension, or in a contrived way (many
formats are actually extended by abusing an unused metadata field and cramming a new ad
hoc format in it).
Some of these formats have a few fixed- or limited-length fields that became or will
become obsolete in time. The ar format, for example, suffers from the Year 2038 problem
and cannot store long file names. Various implementations have used different incompatible
extensions to store long file names.
So we propose yet another archive format, that uses an efficient but extensible syntax, so
that the format cannot fail to be extended or modified for new use cases or constraints.
A BARF file is basically a set of metadata fields followed by data entries. Each entry
consists of a set of metadata fields followed by its content. The interesting property of
using BULK is that any portion of that structure is dynamic (no fixed metadata fields, and
an entry without metadata is serialized as its content, as with BULK, the entry and its
content cannot be confused with each other) and anything can be enclosed in a BULK
structure to add features.
Metadata fields are just a BULK expression, which means that any ad hoc or standard BULK
vocabulary can be used in an efficient way as metadata. Mutually incompatible metadata
vocabularies could even be stored alongside each other for legacy support, if need be.
The archive file can be compressed or encrypted by an outside tool (producing a
foo.barf.gz or bar.pgp file, for example), but so can any individual BULK expression. The
entire archive, internally to the file, can be a BULK compression or encryption form, as
well as any metadata set, metadata field or entry. Almost any extension and optimization
can be retrofitted in this structure in a backward-compatible way, like checksums, digital
signature or access offsets for random access.
This extends the use case of BARF archives outside of archives for multiple files. An
extensible image format could be based on a BARF structure, allowing seamless transition
from a simple format to a full-featured one, whereas existing formats usually add complex
extensions that fail to be widely adopted (to add support for layers, transparency,
different compression or metadata). Although BARF would probably be ill-suited for
playable audio and video, it would still provide a perfect fit for the storage of raw
audio and video for editing programs.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD
NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as
described in RFC 2119.
Literal numerical values are provided in decimal or hexadecimal as appropriate.
Hexadecimal literals are prefixed with 0x to distinguish them
from decimal literals.
BULK bytes sequences and expressions are described with the same conventions than used in
the BULK 1.0 specification
This specification defines the notion of Guaranteed Backward Compatibility (GBC). It applies
to forms that carry a main payload with additional metadata. A form that obeys the rules of
GBC has the type GBCForm.
A GBCForm has the shape ( Ref {arguments} {next}:Expr ). If the
payload of a GBCForm is readable without knowledge of that form, then {next} MUST be that
payload. Otherwise, {next} MUST be nil.
For example, a GBC-compliant checksum form could have the shape ( crc32c
{crc}:Word32 {payload} ), where {crc} is the checksum of the byte sequence
{payload}. On the other hand, a GBC-compliant encryption form, where obviously the payload
is unreadable without proper knowledge of the form, could have the shape ( encrypt {payload} nil ).
The archive namespace (mnemonic: barf) is an official namespace
identified by the UUID (BULK,
"Stack 'em. Pack 'em. And rack 'em."). It provides a standard way to pack one or more data
elements together with metadata.
0x1 (mnemonic: pack )
( pack {metadata}:Expr {entries} )
This packs archive entries together as a form. {metadata} holds metadata about the whole
pack. In the context of {metadata}, rdf:this-resource
designates the whole pack.
0x2 (mnemonic: stack )
( stack {metadata}:Expr {entries-metadata} ) {entries}
This stacks archive entries together as a sequence, for the cases where it is not
appropriate for entries to belong to a single expression. {metadata} holds metadata about
the whole stack. In the context of {metadata}, rdf:this-resource designates the whole stack. {entries-metadata} MUST
be a sequence of expressions of length equal or inferior to the number of expressions in
{entries}. Each expression in {entries-metadata} holds metadata about a single entry of
the stack. In the context of such a metadata expression, rdf:this-resource designates the described stack entry. By default,
the expression number N in {entries-metadata} describes the expression number N in
{entries}.
When the stack form is in the abstract yield, this has the property that if the last entry
is an Array, the actual payload constitutes the end of the BULK stream. This can make it
possible for BULK-unaware programs to read and/or write that payload easily.
Stacking also makes the addition of a metadata-carrying entry or a metadata-less entry an
append-only operation.
0x3 (mnemonic: describe )
( describe {metadata} {payload}:Expr )
This form associates arbitrary metadata with an arbitrary payload. It is intended to
constitute most entries in BARF archives. In the context of {metadata}, rdf:this-resource designates the payload.
Type: GBCForm
0x4 (mnemonic: bulk-stream )
( bulk-streamm {payload} )
This form makes it possible to include a complete BULK stream without modification, as
{payload}.
Type: GBCForm
0x5 (mnemonic: compressed )
( compressed {method}:Expr {payload}:Array nil )
This form encapsulates a compressed payload. This specification doesn't define names to
express a compression method.
Type: GBCForm
0x6 (mnemonic: encrypted )
( encrypted {method}:Expr {payload}:Array nil )
This form encapsulates an encrypted payload. This specification doesn't define names to
express an encryption method.
Type: GBCForm
Key words for use in RFCs to Indicate Requirement Levels
Harvard University
sob@harvard.edu
Binary Uniform Language Kit 1.0
Thierry Technologies
pierre@nothos.net
ISO 8601:2004 Data elements and interchange formats -- Information interchange -- Representation
of dates and times