tinygrab/docs/design

Getting the core instruction set correct is the value of tinygrad

Max size tensor is 6-D for the pool2d

Unary Ops
===

These are the simplest to reason about, and have pointwise mem access.
A and B are always the same size

Forward          :    A      -> B
Backward (binary):   (B', A) -> A'


Reduce Ops (with axis)
===

These take in an axis argument. B is smaller than A
Max and Sum are pretty different, do we really need Max?

Forward          :    A      -> B
Backward         :    B'     -> A'


Binary Ops (with broadcasting)
===

Pointwise mem access also.
Broadcasting adds complexity, aliased input.
Unbroadcasting for grad is a sum, but should be combined with the ternary op.

Forward           :   (A,  B)     -> C
Backward (ternary):   (C', A, B)  -> (A', B')

C.shape = max(A.shape, B.shape)


Movement Ops (2 or 1)
===

Reshape, Transpose, Slice

Depending on your Tensor implementation, these are free.
Reshape is almost always free.
Slice can be made free, but probably shouldn't be.
Transpose is hard to make free except in trivial cases.

Regardless, these are "reindexings" of existing arrays
Transpose and Slice are similar enough I think they can be merged.
They should use a DMA engine


Processing Ops (4)
===

Matmul is 1 matmul for forward, 2 for backward.
* It's actually three matmuls transposed
* cublasSgemm()

Conv2D is very complex. It seems to need three.
* cudnnConvolutionForward()
* cudnnConvolutionBackwardData()
* cudnnConvolutionBackwardFilter()
NOTE: Tensor Cores require that the tensors be in the NHWC data layout