It also strikes me that one could encode the deltas as the index into a spiral centered on the current non-zero. That way you only get a single 64-bit value that is likely some small value instead of *two* small 32-bit values. Even more optimal would be to omit already encoded non-zeroes from the enumeration in that spiral, but I think that would be too costly computing-wise.

So like you have an array of

9 8 3 0 0 0 0 75

You'd only need position pair of {4,7} to compress the data of nonzeros, so instead of mapping all the nonzeros, we'd only need to map the zeros that come before or after nonzeros.

Sorry if this is something you've already thought of, but an idea I've had after reading this.]]>If non-zeroes are somewhat clustered then we expect the distances between them to be small. This amounts to solving a relaxed version of the Traveling Salesman Problem. We don't need an optimal solution, just a not-terrible one.

Once we have a decent path we store it in a particular way on disk, compressed using some off-the-shelf fast compressor like lz4.

Partition the matrix into submatrices no larger than 2^32 x 2^32. Let each entry in RAM be struct { _Float16 value; int32_t delta_i, delta_j; } packed, or 10 B. If we guess the number of non-zeroes per column to be *q = 160* then this is no more than 3.2 TiB which fits comfortably in 12 TiB of RAM. If the matrix is still too big then subdivide it. A matrix of say 200,000,000,000x100,000,000,000 is then split up into 5000 such submatrices.

Transform the deltas to an unsigned form for storage. 0, -1, 1, -2, 2 -> 0, 1, 2, 3, 4. This way the MSBs are almost always zero.

The final piece of this proposal is to "transpose" the data, storing each byte (or bit) of the struct in its own file. This operation is sometimes called *swizzling*. Files that would be all zeroes could be omitted.

Because the values are normalized we expect lots of redundancy in the significant bits (except sign), compressing probably to somewhere around 12 bits on average. For the deltas it's harder to know. If they're small enough that most deltas are in the range {-128, -128} .. {127, 127} then perhaps 2x8 bits. Total: 28 bits.

Matrix-vector multiply in this case involves unpacking the data, which for formats like lz4 *is faster than memcpy().*

@casperadmin Interesting paper. It implements dense matrix-vector multiply and achieves about 1% accuracy which is what I'd expect and is perfectly adequate for planning. Unfortunately we want to do sparse matrix-vector multiply which doesn't benefit from such hardware.

Ah I see. In that case I did some more research and I found these papers which suggest using a cross-point resistive memory array that appears to work well for sparse matrices. I'd be curious what your thoughts are.

https://ieeexplore.ieee.org/abstract/document/9095220

https://ieeexplore.ieee.org/abstract/document/8510634

]]>On the AI accelerator end you have things like Nvidia's tensor cores. But again the problem is you need to actually get the data into and out of the cards. The computations are relatively fast compared to shuffling the data around, because each non-zero is only used once in each matrix-vector multiply and you can't keep the whole matrix in memory, nor do you really benefit from caching. Besides, your biggest speedups are going to be from things like better preconditioners.

]]>https://ieeexplore.ieee.org/document/9252918

]]>@joe 1 TiB is small enough to keep entirely in RAM :) but in general you would just mmap() the file(s) in which the matrix is stored. Solving linear programs boils down to lots of matrix-vector multiplications. Storage + bandwidth will be the limiting factor. Either you have lots of fast disks like M.2 drives or you use the best InfiniBand cards available and lots of them and stream the matrix instead.

A socialist economy would likely find it beneficial to build a machine that does nothing but matrix-vector multiply fed with oodles of fiber optics. InfiniBand can already do something similar if I'm not mistaken, basically rDMA with the ability to do linear algebra on the numbers coming in over the cable without involving the CPU. Think a network card with lots of RAM that can do multiply+add on incoming data. so MPI_Reduce() but in hardware. Feed that from a huge database.

Can't you do matrix multiplication with analog electronic systems? Maybe there's some room for optimizing hardware there as is happening in AI fields.]]>A socialist economy would likely find it beneficial to build a machine that does nothing but matrix-vector multiply fed with oodles of fiber optics. InfiniBand can already do something similar if I'm not mistaken, basically rDMA with the ability to do linear algebra on the numbers coming in over the cable without involving the CPU. Think a network card with lots of RAM that can do multiply+add on incoming data. so MPI_Reduce() but in hardware. Feed that from a huge database.

]]>