Just adding here what I said in another thread, namely that both modern Intel x86 CPUs (Alder Lake, 2021) and GPUs have support for half-precision (16-bit) floats. gcc supports this via the _Float16 primitive type.
@thardin how much slower would an algo be if it only kept part of the data in RAM as it worked?
say it kept a 1 TB matrix on an ssd and processed it in batches.
@joe 1 TiB is small enough to keep entirely in RAM 🙂 but in general you would just mmap() the file(s) in which the matrix is stored. Solving linear programs boils down to lots of matrix-vector multiplications. Storage + bandwidth will be the limiting factor. Either you have lots of fast disks like M.2 drives or you use the best InfiniBand cards available and lots of them and stream the matrix instead.
A socialist economy would likely find it beneficial to build a machine that does nothing but matrix-vector multiply fed with oodles of fiber optics. InfiniBand can already do something similar if I'm not mistaken, basically rDMA with the ability to do linear algebra on the numbers coming in over the cable without involving the CPU. Think a network card with lots of RAM that can do multiply+add on incoming data. so MPI_Reduce() but in hardware. Feed that from a huge database.
@joe 1 TiB is small enough to keep entirely in RAM 🙂 but in general you would just mmap() the file(s) in which the matrix is stored. Solving linear programs boils down to lots of matrix-vector multiplications. Storage + bandwidth will be the limiting factor. Either you have lots of fast disks like M.2 drives or you use the best InfiniBand cards available and lots of them and stream the matrix instead.
A socialist economy would likely find it beneficial to build a machine that does nothing but matrix-vector multiply fed with oodles of fiber optics. InfiniBand can already do something similar if I'm not mistaken, basically rDMA with the ability to do linear algebra on the numbers coming in over the cable without involving the CPU. Think a network card with lots of RAM that can do multiply+add on incoming data. so MPI_Reduce() but in hardware. Feed that from a huge database.
Can't you do matrix multiplication with analog electronic systems? Maybe there's some room for optimizing hardware there as is happening in AI fields.
@casperadmin I've seen papers suggesting reviving analog computation using modern semiconductor fabrication techniques. There are some experimental chips on the market, but the performance was lackluster last time I looked. An array of analog multipliers and integrators could be made to work quite fast for dense linear algebra. For sparse system I think it's much trickier.
On the AI accelerator end you have things like Nvidia's tensor cores. But again the problem is you need to actually get the data into and out of the cards. The computations are relatively fast compared to shuffling the data around, because each non-zero is only used once in each matrix-vector multiply and you can't keep the whole matrix in memory, nor do you really benefit from caching. Besides, your biggest speedups are going to be from things like better preconditioners.
@casperadmin Interesting paper. It implements dense matrix-vector multiply and achieves about 1% accuracy which is what I'd expect and is perfectly adequate for planning. Unfortunately we want to do sparse matrix-vector multiply which doesn't benefit from such hardware.
@casperadmin Interesting paper. It implements dense matrix-vector multiply and achieves about 1% accuracy which is what I'd expect and is perfectly adequate for planning. Unfortunately we want to do sparse matrix-vector multiply which doesn't benefit from such hardware.
Ah I see. In that case I did some more research and I found these papers which suggest using a cross-point resistive memory array that appears to work well for sparse matrices. I'd be curious what your thoughts are.
@casperadmin Using op-amps to solve linear systems is an interesting idea, and memristor technology is promising. As an electronics guy I can foresee some problems setting up a thing with 100,000,000,000 op-amps that is to outcompete a digital equivalent. You also need a switching fabric of some kind to be able to realize the sparsity pattern.
I had a brain wave yesterday and realized that an efficient storage method is perhaps to store a path that visits all non-zeroes as a series of hopefully small deltas. To get from {10, 10} to {13, 9} we use {3, -1}.
If non-zeroes are somewhat clustered then we expect the distances between them to be small. This amounts to solving a relaxed version of the Traveling Salesman Problem. We don't need an optimal solution, just a not-terrible one.
Once we have a decent path we store it in a particular way on disk, compressed using some off-the-shelf fast compressor like lz4.
Partition the matrix into submatrices no larger than 2^32 x 2^32. Let each entry in RAM be struct { _Float16 value; int32_t delta_i, delta_j; } packed, or 10 B. If we guess the number of non-zeroes per column to be q = 160 then this is no more than 3.2 TiB which fits comfortably in 12 TiB of RAM. If the matrix is still too big then subdivide it. A matrix of say 200,000,000,000x100,000,000,000 is then split up into 5000 such submatrices.
Transform the deltas to an unsigned form for storage. 0, -1, 1, -2, 2 -> 0, 1, 2, 3, 4. This way the MSBs are almost always zero.
The final piece of this proposal is to "transpose" the data, storing each byte (or bit) of the struct in its own file. This operation is sometimes called swizzling. Files that would be all zeroes could be omitted.
Because the values are normalized we expect lots of redundancy in the significant bits (except sign), compressing probably to somewhere around 12 bits on average. For the deltas it's harder to know. If they're small enough that most deltas are in the range {-128, -128} .. {127, 127} then perhaps 2x8 bits. Total: 28 bits.
Matrix-vector multiply in this case involves unpacking the data, which for formats like lz4 is faster than memcpy().
@thardin so those deltas are the distances between zeros in the matrices? That's really clever. Couldn't you store the same info with less pairs by looking at the edges of zero and non zero data?
So like you have an array of
9 8 3 0 0 0 0 75
You'd only need position pair of {4,7} to compress the data of nonzeros, so instead of mapping all the nonzeros, we'd only need to map the zeros that come before or after nonzeros.
Sorry if this is something you've already thought of, but an idea I've had after reading this.
@casperadmin That'd be a kind of run length encoding. It would certainly work, but I suspect due to the dominance of zeroes over non-zeroes that it won't be as efficient. Imagine how a matrix with every non-zero near the diagonal would be coded with your scheme vs mine.
It also strikes me that one could encode the deltas as the index into a spiral centered on the current non-zero. That way you only get a single 64-bit value that is likely some small value instead of two small 32-bit values. Even more optimal would be to omit already encoded non-zeroes from the enumeration in that spiral, but I think that would be too costly computing-wise.