Fast Transforms for Sparsity

There are a number of ways you can use the fast Walsh Hadamard Transform (WHT) for sparse systems.
https://en.wikipedia.org/wiki/Hadamard_transform
Since a change in one single input element alters all the output elements one way or another it provides fully connectivity at a very low cost.
You can use it to provide full initial connectivity before a sparse net. A slight problem with that is the transform takes a spectrum of the input data. This problem can be dealt with by applying a fixed randomly chosen pattern of sign flips to the input data before calculating the transform. That results in effectively a random projection. Sub-random projections are also possible.
https://www.kdnuggets.com/2021/07/wht-simpler-fast-fourier-transform-fft.html

You can also directly use the fast WHT to create neural networks. You can view the WHT as a fixed dense layer of weighted sums with a calculation cost of nlog2(n) add subtract operations. Far cheaper than a convention dense weighted sum layer that has a cost of n squared fused multiply-add operations.
There is a big problem though…there is nothing to adjust. If you create a neural network using the WHT for the weighted sums you end up with a frozen neural network, that does something. Who knows what, but something.
The solution is to actually use individually adjustable parametric activation functions.
Then you can have a complete neural network layer for nlog2(n) add subtract operations and n multiplies using 2n parameters, for example.
https://ai462qqq.blogspot.com/

I’m reluctantly on twitter too:
https://twitter.com/SeanOCo07854298

I trimmed the blog spot post to the point where there is no information on Fast transform fixed filter bank neural networks.
Sorry. I did that to try to drive sales to something.
Here is an alternative:
https://archive.org/details/afrozenneuralnetwork

There is a X86 library for the fast Walsh Hadamard transform.
https://github.com/FALCONN-LIB/FFHT
Since only add and subtract operations are used for the WHT the time cost is dominated by data movement, not so much actual compute. Therefore processors with large L1/L2 caches are expected to do better.
Because of technical issues most processors tend to have similar L1 cache sizes. I suppose you then decide on L2 and maybe L3 cache size.
How the WHT performs on GPUs I really don’t know.

You can see the FFHT code is quite speedy from the example numbers given github.

Then all you need to create a neural network is to sandwich parametric non-linear functions between WHTs. The parameters providing necessary adjustable behavior.

You might also apply some sub-random or random pattern of sign flips to the input to the net to stop the first WHT from producing an unwanted spectral response bias to natural inputs like images.

That is a very simple recipe really. However you have to DIY everything yourself, including the training algorithms.
You won’t find options within the standard NN Python frameworks for example.