Design - 5 min read - 2026
The +1 byte no-expansion floor - and why every codec should have one
By Francesco Pedulli
Compress a small or already-compressed file with gzip and you can get back something bigger than the input - sometimes a few bytes, sometimes several hundred. The same is true of brotli, xz, zstd and 7-Zip: on incompressible input they prepend framing, headers or dictionary references. For most uses that is fine - if a file doesn't shrink, you keep the original.
But it is an annoying corner case for systems that compress everything unconditionally: object-storage backends, log shippers, telemetry collectors, backup systems. Every byte of overhead on incompressible data is a tax, and that tax shows up at scale.
Here are our own measurements on 1 MiB of random bytes (entropy near the maximum - nothing can compress it). The competitor numbers below are what we measured with default settings on this one input; they are illustrative, not a ranking, and are reproducible on request.
| Codec | Output size | Overhead |
|---|---|---|
| gzip -9 | 1,048,917 | +341 B |
| brotli -q 11 | 1,048,581 | +5 B |
| xz -9 -e | 1,049,000 | +424 B |
| zstd --ultra -22 | 1,048,591 | +15 B |
| 7-Zip "ultra" | 1,049,234 | +658 B |
| Pedulli | 1,048,577 | +1 B |
Pedulli's overhead is one byte: the engine tag 0x00 that tells the decoder "no engine ran, raw bytes follow." This is a guarantee about Pedulli's own output - never worse than input + 1 byte. Because the Orchestrator races xz, zstd, brotli and your data's SRD math and keeps the smallest verified output, it is never larger than the best standard codec (worst case +1 byte): it wins outright on structured data and ties the best codec on already-optimal or random data, where the racer simply selects that codec at its own size.
Why does Pedulli get away with 1 byte?
Two design choices:
- No magic header. Pedulli files don't start with a multi-byte format signature, version field, flags field or compressed-length field. The first byte is the engine tag, full stop; the rest is the engine's native payload.
- Race-and-pick-smallest with an identity candidate. The compressor tries every relevant engine plus an identity-fallback candidate. If no engine produces output strictly smaller than
input + 1 byte, identity wins. No expansion ever, by construction.
In code (paraphrased):
const candidates = [
...engineRace(input),
{ engine: 0, output: input } // identity fallback always in the race
];
const winner = candidates.reduce((best, c) =>
c.output.length < best.output.length ? c : best
);
return [winner.engineTag, ...winner.output];
Engine 0 is the identity fallback; its "output" is just the input bytes, and the 1-byte tag makes the total input + 1. Any engine producing strictly fewer bytes wins; otherwise raw wins. The contract is mechanical, not heuristic.
Why this matters in practice
1. Backups of encrypted source data
Compress a directory containing encrypted containers or password-protected archives and the codec can't shrink them. Standard codecs add a handful to a few hundred bytes of overhead per file. Across 100,000 small encrypted files that is megabytes of pure overhead - a measurable storage line item. Pedulli's is 100,000 x 1 byte.
2. Object storage with "always compress on write"
Systems that run compression on every object regardless of compressibility pay overhead on every incompressible object (already-JPEG'd images, video, encrypted blobs). Pedulli's 1-byte tag makes enabling compression nearly free for objects that don't compress, while keeping the wins on those that do.
3. Telemetry / sensor batching
A batch often mixes a few KB of structured data with a fixed-size opaque field (a calibration blob, a signature, an encrypted token). The structured part compresses well; the opaque part not at all. Generic codecs add overhead per field; Pedulli adds 1 byte.
The honest caveat
The 1-byte tag covers the case where no engine wins. When an engine does win, its payload carries its own internal framing - but the dispatcher only selects an engine if its total output is strictly smaller than identity, so the result is still <= input + 1 byte. The "+1 byte on truly incompressible" floor is the information-theoretic minimum: you cannot signal "this data is incompressible" in less than one bit, and the implementation rounds up to a byte. This is entirely within Shannon's framework, not beyond it.
Should you switch?
If your storage is dominated by structured data (logs, telemetry, periodic archives), the structure-aware wins are the main reason, and the no-expansion floor is a small bonus. If it is dominated by incompressible data (mostly video / encrypted / already-JPEG'd), the floor itself saves you the per-object framing overhead. On small generic English text, brotli's large static dictionary - or xz on long text - is already among the racers, so the Orchestrator simply selects whichever is smallest and ties it at its own size; you are never larger than the best standard codec.
See the +1 byte floor in action
Drop an incompressible file in the trial - you'll see +1 byte, byte-exact. Or check API pricing.
Built in Forli, Italy. EU-sovereign, GDPR by design.