Compression - 6 min read - 2026
Structure-aware compression: byte-exact, never larger than the best of xz/zstd/brotli
By Francesco Pedulli
In the standard mental model of compression, you have a file, you run it through gzip / brotli / zstd / xz, and you get back something smaller. How much smaller depends on how much redundancy the codec can find. Information theory gives a floor: the Shannon entropy of the source. General-purpose, entropy-coding compressors model the source as a stochastic process, estimate symbol probabilities, and code close to that floor.
Pedulli works differently on a narrower set of inputs. A structure-aware codec doesn't just count symbol frequencies - it looks for explicit structure: a period, a domain-specific closed form, a cross-reference graph. When that structure exists and is cheap to describe, the description can be much smaller than a frequency model would suggest, because the entropy of the source given that structure is low. This is fully consistent with information theory - it is entropy-aware, not "past Shannon". When the structure isn't there, the Orchestrator simply keeps the smallest output from the standard codecs it races, so it is never larger than the best of them (see below).
Concrete example: a buffer of 1 MiB of zeros. Pedulli detects the period (here trivially period 1) and emits a small payload of the form:
0x01 01 0x00 0x00 0x00 0x10 0x00 0x00 0x00 0x00 0x00 0x00 0x00
^tag ^ ^seed ^------length------^ ^---expansion seed---^
period-1 (1 MiB = 0x100000)
The decoder reads the payload, reconstructs the 1 MiB of 0x00, and the result is byte-exact - the SHA-256 of the restored output matches the original. This is a constant-data case, the easiest possible input for any structure detector; we show it because it is fully reproducible, not because it is representative. The exact byte counts on this and the inputs below are reproducible on request.
"But zeros are a degenerate case"
Correct - constant data is the easy extreme. The same machinery applies to inputs with real, discoverable structure. The numbers below are our own measurements on specific files; they are not published third-party results and not customer outcomes. We will share the exact samples and command lines on request.
- Apache / nginx access logs - highly repetitive line shapes (IP, timestamp, a handful of path prefixes). On that specific log sample the structure-aware engine was smaller than xz -9 -e and brotli-11 - sample-specific and reproducible on request. On standard text corpora (enwik, Calgary) it does NOT was smaller than xz; it ties by selecting it.
- Periodic time-series / IoT telemetry - regular polling intervals create exploitable periodicity, so long windows reduce to small payloads on our samples.
- Container-structured media (e.g. MP4) - even h.264-compressed video carries container-level structure (atom headers, sample tables) that generic codecs treat as noise; on our sealed MP4 sample Pedulli's output was a few tens of KB smaller than xz -9 -e. The sample and its SHA-256 are available for verification.
In all cases the contract is the same: lossless, byte-exact roundtrip, verified by SHA-256. The Orchestrator is a best-of-N racer - it races xz, zstd, brotli and your data's SRD math and keeps the smallest verified output, so it is never larger than the best standard codec (worst case +1 byte). It wins outright on structured data; on already-optimal or random data it ties the best codec by selecting it.
When there's no extra structure to exploit
Honest answer: when there's no structure to find, the structure-aware path adds nothing - but the Orchestrator still races the standard codecs and keeps the smallest, so it ties the best of them rather than losing. The cases below are where the win comes from selecting an existing codec, not from a native transform.
- Truly random data (encrypted output,
/dev/urandom): nothing compresses it. Pedulli returns input + 1 byte; the tag tells the decoder "raw, no engine ran". That is the +1 byte never-worse floor - a guarantee about Pedulli's own output, not a claim about other codecs on every input. - Already-deeply-compressed media (JPEG, PNG, h.264): most redundancy is already gone, so the structure-aware path adds little and the Orchestrator ties the best codec by selecting it.
- Generic small English text (< ~100 KB): brotli-11's large static dictionary and xz both code this class tightly, so here the Orchestrator selects whichever is smallest and matches it byte-for-byte - a tie at the best codec's size, never larger.
The mental switch
If you're sizing a storage budget for data with regular structure - sensor / telemetry / IoT, application logs, time-series database blocks, scientific simulation outputs, genomics dumps, medical imaging archives, or cold archives of anything periodic - it is worth measuring rather than assuming your current codec is optimal.
Run a real sample through the free trial and read the actual number for your data. Because the racer includes the standard codecs as candidates, it is never larger than your current codec - worst case it ties at the same size. Where your data has structure, you've found a line item.
Try it on your own data
Drop a file in the browser trial for a byte-exact result, or check the API pricing.
Built in Forli, Italy. EU-sovereign, GDPR by design.