Someone posted their results for compressing the blockchain in the telegram group, this is what they were able to do:
- 165 GB - Blockchain size uncompressed
- 130 GB - tar.gz compressions
- 113 GB - 7zip ultra compression
Note, bitcoin by its nature is poorly compressible, as it contains a lot of incompressible data, such as public keys, addresses, and signatures. However, there's also a lot of redundant information in there, e.g. the transaction version, and it's usually the same opcodes, locktime, sequence number etc. over and over again.
I was curious and thought, how much could we actually compress the blockchain? This is actually very relevant: As I established in my previous post about the costs of a 1GB full node, the storage and bandwidth costs seem to be one of the biggest bottlenecks, and that CPU computation costs are actually the cheapest part, as were able almost to get away with ten year old CPUs.
Let's have a quick look at the transaction format and see what we can do. I'll have a TL;DR at the end if you don't care about how I came up with those numbers.
Before we just in, don't forget that I'll be streaming today again building a SPV node, as I've already posted about here. Last time we made some big progress, I think! Check it out here https://dlive.tv/TobiOnTheRoad. It'll start at around 15:00 UTC!
Version (32 bits)
There's currently two transaction types. Unless we add new ones, we can compress it to 1 bit (0 = version 1; and 1 = version 2).
Input/output count (8 to 72 bits)
This is the number of inputs the transaction has (see section 9 of the whitepaper). If the number of inputs is below 253, it will take 1 byte, and otherwise 2 to 8 bytes. This nice chart shows that, currently, 90% of Bitcoin transactions only have 2 inputs, sometimes 3.
A byte can represent 256 different numbers. Having this as the lowest granularity for input count seems quite wasteful! Also, 0 inputs is never allowed in Bitcoin Cash. If we represent one input with 00₂, two inputs with 01₂, three inputs with 10₂ and everything else with 11₂ + current format, we get away with only 2 bits more than 90% of the time.
Outputs are slightly higher, 3 or less 90% of the time, but the same encoding works fine.
Input (>320 bits)
There can be multiple of those. It has the following format:
- Previous output (288 bits): This specifies which output is being spent by this input. It's the transaction hash (32 bytes), plus the output index within that transaction (4 bytes).The transaction hash is usually a random number. How can we compress that? Well, we already know this transaction and its hash. If it's part of the blockchain, we can uniquely identify it by its block (via blockheight) and its position in that block. We'll list all transaction outputs of the block by whatever order is present for that block (TTOR/CTOR), and then encode the position of the output in that list. For the blockheight, we can use 4 bytes, which will allow 4 billion blocks (83,781 years) and for the transaction output we can either use 4 bytes (max. 430GB big blocks) or more, but we can use 1 bit as a flag once blocks get really large.So in total, this will be 8 bytes, or 64 bits. If transactions aren't in the blockchain yet but in the mempool, we can use something similar Xthinner uses by encoding only the prefix required to uniquely identify the transaction. But I haven't thought that through yet.
- Signature script (≥ 8 bits): As most scripts have a standardized format, we don't have to put the whole script into the input every time. Say there's seven standard formats, then we'd use 3 bits for the format. If all bits are set, the transaction will continue to be serialized in a standard way. The most common transaction type is Pay-To-Public-Key-Hash, which requires a signature and a public key. The signature currently uses the DER format, which can currently be up to 73 bytes, but just actually stores two 32 byte numbers, which can actually be represented with 512 bits.
The public key can be compressed to 257 bits, plus 1 to specify if it's compressed in the original transaction. Makes 258 bits. - Sequence (32 bits): For almost all programs, this is FF FF FF FF. We can use 1 bit to determine if it's that number (0), and encode the sequence number ordinarily if not (1).
- So in total that 838 bits per P2PKH input, or about 105 bytes.
Output (≥72 bits)
There can be multiple of those. They have the following format:
- Value in satoshis (64 bits): Currently, the first 13 bits of this amount will always be 0 (as there's a maximum of 21M BCH), but there's efforts to allow sub-satoshi values in Bitcoin, so we can't rely on that. Also, most of the time, transactions will send much less than 21,000,000 BCH, or even 21 BCH. Therefore, if we encode the sent value separately, we can use integer compression schemes, as outlined in this paper, which achieves between 6 and 16 bits per 32 bit integer (see Table 4, p. 21). I think it's save to assume we can get away with 16 bits if we implement that scheme here.
- Public key script (≥8 bits): Here, we can use the same trick as with the input, by having 3 bits for the format. For Pay-To-Public-Hash transactions, we only need 20 bytes or 160 bits for the address.
- So in total that is 179 bits per P2PKH output.
Lock time (32 bits)
This is FF FF FF FF most of the time and only occasionally transactions will be time-locked, and only change the meaning if a sequence number for an input is not FF FF FF FF. We can do the same trick as with the sequence number, such that most of the time, this will be just 1 bit.
Total
So, in summary, we have:
- Version: 1 bit
- Input/output count: two times 2 bits
- Input: 838 bits per input
- Output: 179 bits per output
- Lock time: 1 bit
Nice table:
| No. of inputs | No. of outputs | Uncompressed size | Compressed size | Ratio |
|---|---|---|---|---|
| 1 | 1 | 191 bytes (1528 bits) | 128 bytes (1023 bits) | 67.0% |
| 1 | 2 | 226 bytes (1808 bits) | 151 bytes (1202 bits) | 66.5% |
| 2 | 1 | 339 bytes (2712 bits) | 233 bytes (1861 bits) | 68.6% |
| 2 | 2 | 374 bytes (2992 bits) | 255 bytes (2040 bits) | 68.2% |
| 2 | 3 | 408 bytes (3264 bits) | 278 bytes (2219 bits) | 68.0% |
| 3 | 2 | 520 bytes (4160 bits) | 360 bytes (2878 bits) | 69.2% |
| 3 | 3 | 553 bytes (4424 bits) | 383 bytes (3057 bits) | 69.1% |
Interestingly, if we take a compression of 69%, if we were to compress the 165 GB blockchain, we'd get 113.8GB. Which is (almost) exactly the amount which 7zip was able to give us given ultra compression!
I think there's not a lot we can do to compress the transaction further, even if we only transmit public keys, signatures and addresses, we'd at minimum have 930 bits, which would still only be at 61% compression ratio (and missing outpoint and value). 7zip is probably also able to utilize re-using of addresses/public keys if someone sends to/from the same address multiple times, which we haven't explored here; but it's generally discouraged to send to the same address multiple times anyway so I didn't explore that. We'd still have signatures clocking in at 512 bits.
Note that the compression scheme I outlined here operates on a per transaction or per block basis (if we compress transacted satoshis per block), unlike 7zip, which compresses per blockchain.
I hope this was an interesting read. I expected the compression ratio to be higher, but still, if it takes 3 weeks to sync uncompressed, it'll take just 2 weeks compressed. Which can mean a lot for a business, actually.
I'll be streaming again today!
As I've already posted about here, I will stream about building an SPV node in Python again. It'll start at 15:00 UTC. Last time we made some big progress, I think! We were able to connect to my Bitcoin ABC node and send/receive our first version message. I'll do a nice recap of what we've done in that time, as there haven't been many present last time. And then we'll receive our first headers and then transactions! Check it out here: https://dlive.tv/TobiOnTheRoad.
Written by: eyeofpython
Source: http://bit.ly/2Ix93X4
No comments:
Post a Comment