I've been thinking about some methods of storing a large sequence of primes in a compact manner, while requiring trivial computation to "extract" them.

For example, say I want to store all primes up to 2^32. The simplest way would be a file of 32bit words, which would take up ~775.5 MiB

Code:

primepi(2^32)*4/1024/1024.
%5 = 775.45250320434570312500000000000000000

Now I'm just exploring some ideas using PARI to get some ballpark estimates. Some of the cases below technically would require some additional metadata, like a count for how many primes exist in each bitlength, etc., however this should mostly be negligible, so I'm ignoring that small added cost at the moment for simplicity.

So, one simple way to compact this would be to split this into separate chunks in byte ranges. Store all primes < 2^8 as one byte, then primes < 2^16 as 2 bytes, etc:

Code:

\\PARI
? (primepi(2^8)+2*(primepi(2^16)-primepi(2^8))+3*(primepi(2^24)-primepi(2^16))+4*(primepi(2^32)-primepi(2^24)))/1024/1024.
%8 = 774.41827487945556640625000000000000000

Ok, so we saved one MiB or 0.133% lol, not great.

What if we take it further to bit-packing and splitting our primes among every power of 2?

Code:

? sum(n=1,32,n*(primepi(2^n)-primepi(2^(n-1))))/8/1024/1024.
%10 = 749.43927836418151855468750000000000000

Now we save about 26MiB, or 3.35% over the original size. Marginally better.

Since all primes > 2 will be odd, lets say we can save that last bit and just store our values as x where p = (2x+1)

Code:

? sum(n=1,32,(n-1)*(primepi(2^n)-primepi(2^(n-1))))/8/1024/1024.
%11 = 725.20638763904571533203125000000000000

Now we save about 50MiB or 6.48%. Still not particularly compelling.

So, what if we just store the gap between adjacent primes? According to

wikipedia, the largest gap for p < 2^32 is 336. This can be held within 9 bits.

Code:

9*primepi(2^32)/8/1024/1024.
%14 = 218.09601652622222900390625000000000000

With this we would compress the size by 71.87%

And again since all prime gaps after (2,3) are multiples of 2, we can actually shave off another bit per gap stored, assuming the extracting program handles the initial case 2,3 specially:

Code:

? 8*primepi(2^32)/8/1024/1024.
%15 = 193.86312580108642578125000000000000000

This brings us to exactly 75% the original size. Not bad!

We could extend this further by splitting the primes into ranges based on the bitsize of the maximum gaps in the range. I haven't yet attempted to estimate the size for such a scheme, but it might save a few more percent in the total size.

One limitation of this would be that extraction is now restricted to sequential only, as opposed to before, where we could have random access if needed. That is, unless we implement some sort of "keyframe" concept, where we periodically store the full prime in the file, costing additional metadata, but maybe useful in some scenario.

Anyways, we could extend the prime gap representation further, by splitting into ranges based on the bitsize of the maximum gaps in the range. I haven't yet attempted to estimate the size for such a scheme, but it might save a few more percent in the total size.

I am wondering if any sort of trial factoring application or something like that would have a use for this sort of purely sequential representation that could be quickly streamed out? Or also, since I'm guessing I'm not the first person in the world to think about such things if there is any names or references for these sort of domain-specific prime compression concepts?

Any other thoughts on ways to further compress this information, while still limiting the extraction complexity to simple,fast operations (all of the above ideas should be do-able with essentially just binary shifts and additions). Of course LZ compression or something like that could probably compact things even greater, but then it becomes a question of if you are even saving any computation over sieving it yourself.

I used the limit of 2^32 as a simple example which is easy to calculate, but of course, would be interested in potentially storing more. Some exercises in case anyone else finds this interesting:

Give your best imagined storage scheme, and how compactly could all p < 2^64 be stored?

Or alternatively, what's the largest n where all p < n could be stored within 1GB, or even say 1TB, etc?