<- Blog index

Design - 5 min read - 2026

The +1 byte no-expansion floor - and why every codec should have one

By Francesco Pedulli

Compress a small or already-compressed file with gzip and you can get back something bigger than the input - sometimes a few bytes, sometimes several hundred. The same is true of brotli, xz, zstd and 7-Zip: on incompressible input they prepend framing, headers or dictionary references. For most uses that is fine - if a file doesn't shrink, you keep the original.

But it is an annoying corner case for systems that compress everything unconditionally: object-storage backends, log shippers, telemetry collectors, backup systems. Every byte of overhead on incompressible data is a tax, and that tax shows up at scale.

Here are our own measurements on 1 MiB of random bytes (entropy near the maximum - nothing can compress it). The competitor numbers below are what we measured with default settings on this one input; they are illustrative, not a ranking, and are reproducible on request.

CodecOutput sizeOverhead
gzip -91,048,917+341 B
brotli -q 111,048,581+5 B
xz -9 -e1,049,000+424 B
zstd --ultra -221,048,591+15 B
7-Zip "ultra"1,049,234+658 B
Pedulli1,048,577+1 B

Pedulli's overhead is one byte: the engine tag 0x00 that tells the decoder "no engine ran, raw bytes follow." This is a guarantee about Pedulli's own output - never worse than input + 1 byte. Because the Orchestrator races xz, zstd, brotli and your data's SRD math and keeps the smallest verified output, it is never larger than the best standard codec (worst case +1 byte): it wins outright on structured data and ties the best codec on already-optimal or random data, where the racer simply selects that codec at its own size.

Why does Pedulli get away with 1 byte?

Two design choices:

In code (paraphrased):

const candidates = [
  ...engineRace(input),
  { engine: 0, output: input }   // identity fallback always in the race
];
const winner = candidates.reduce((best, c) =>
  c.output.length < best.output.length ? c : best
);
return [winner.engineTag, ...winner.output];

Engine 0 is the identity fallback; its "output" is just the input bytes, and the 1-byte tag makes the total input + 1. Any engine producing strictly fewer bytes wins; otherwise raw wins. The contract is mechanical, not heuristic.

Why this matters in practice

1. Backups of encrypted source data

Compress a directory containing encrypted containers or password-protected archives and the codec can't shrink them. Standard codecs add a handful to a few hundred bytes of overhead per file. Across 100,000 small encrypted files that is megabytes of pure overhead - a measurable storage line item. Pedulli's is 100,000 x 1 byte.

2. Object storage with "always compress on write"

Systems that run compression on every object regardless of compressibility pay overhead on every incompressible object (already-JPEG'd images, video, encrypted blobs). Pedulli's 1-byte tag makes enabling compression nearly free for objects that don't compress, while keeping the wins on those that do.

3. Telemetry / sensor batching

A batch often mixes a few KB of structured data with a fixed-size opaque field (a calibration blob, a signature, an encrypted token). The structured part compresses well; the opaque part not at all. Generic codecs add overhead per field; Pedulli adds 1 byte.

The honest caveat

The 1-byte tag covers the case where no engine wins. When an engine does win, its payload carries its own internal framing - but the dispatcher only selects an engine if its total output is strictly smaller than identity, so the result is still <= input + 1 byte. The "+1 byte on truly incompressible" floor is the information-theoretic minimum: you cannot signal "this data is incompressible" in less than one bit, and the implementation rounds up to a byte. This is entirely within Shannon's framework, not beyond it.

Should you switch?

If your storage is dominated by structured data (logs, telemetry, periodic archives), the structure-aware wins are the main reason, and the no-expansion floor is a small bonus. If it is dominated by incompressible data (mostly video / encrypted / already-JPEG'd), the floor itself saves you the per-object framing overhead. On small generic English text, brotli's large static dictionary - or xz on long text - is already among the racers, so the Orchestrator simply selects whichever is smallest and ties it at its own size; you are never larger than the best standard codec.

See the +1 byte floor in action

Drop an incompressible file in the trial - you'll see +1 byte, byte-exact. Or check API pricing.

Built in Forli, Italy. EU-sovereign, GDPR by design.