1
0
Fork 0
Commit Graph

519 Commits (e76c916978e8fb707752b2a0e29ef7dab899210e)

Author SHA1 Message Date
Jeff Moe e76c916978 more tensor docstrings 2023-12-07 09:57:38 -07:00
Jeff Moe df9e8aa28f rm import docstrings 2023-12-06 18:38:13 -07:00
Jeff Moe d7a7f29ed4 couple more fn 2023-12-06 09:06:30 -07:00
Jeff Moe 3a5f68059c partial docstring for tinygrad tensor 2023-12-05 16:38:08 -07:00
Jeff Moe 661dcc5ed0 Reformat, uh, everything, with black 2023-12-04 22:01:04 -07:00
qazal 4380ccb169
Non fp32 math (#2264)
* `global_load` and `global_store` using buffer dtype

* `UOps.PHI` in all dtypes

* `UOps.ALU` in all dtypes

* `UOps.CONST` & `UOps.DEFINE_ACC` in all dtypes

* -- endof implementation --
+tiny lint changes

* these tests require the fp16 extention

you can run them locally to confirm they're green: (GPT2 test is broken in master for mac, see [this](https://discord.com/channels/1068976834382925865/1069001075828469790/1177993277958533261)

`GPU=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_max_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_min_float16_cpu test/models/test_real_world.py::TestRealWorld::test_llama test/models/test_real_world.py::TestRealWorld::test_gpt2 test/models/test_whisper.py test/test_specific_conv.py::TestSpecific::test_big_vec_mul`

skip the new test_linearizer_failures in CI GPU because of the fp16 extention

This passes on a real GPU since the extention is available:
`GPU=1 python3 -m pytest test/test_linearizer_failures.py::TestLinearizerFailures::test_failure_8`

see CI logs [here](https://github.com/tinygrad/tinygrad/actions/runs/6996590597/job/19032641427#step:14:644)

* these tests fail in CI due to segfaults and CPU crashes

To confirm they're green locally, you can run the following commands:

1. For the tests skipped in test_ops.py (note: CLANG is very slow)

`for var in GPU CUDA CLANG; do export $var=1; for test in test/test_ops.py::TestOps::test_slice_fancy_indexing_no_dim_collapse test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_collapse_int test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_none test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_and_collapse; do python3 -m pytest $test; done; unset $var; done`

2. For the ONNX tests skipped in CLANG:

```
CLANG=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_ai_onnx_ml_array_feature_extractor_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_0_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_1_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_none_no_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_negative_indices_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_no_weight_reduction_mean_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_mean_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_mean_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_none_no_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_sum_weight_high_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_expanded_cpu
```

3. The LLVM test I skipped here is already [skipped in master for all backends](https://github.com/tinygrad/tinygrad/blob/master/test/external/external_test_onnx_backend.py#L186), I just made it more specific

`LLVM=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu`

* Revert "these tests fail in CI due to segfaults and CPU crashes"

This reverts commit 15db570143.

* merge with cleanup-vectorized-hip-renders

* barely working HIP P1, ALU ops need a refactor?

* manage the fact that in HIP [half2 is actually an unsigned int vec](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L59)) and half is a totally different __half that [has an unsigned int element in it](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L50)) but can't be accessed [because it's private](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L86)). If you just do this:

```
half2 val0 = // ...
half val1 = // ...
```
then you can't do:
```
val0.x + val1 // error: use of overloaded operator '+' is ambiguous (with operand types 'unsigned short' and 'half' (aka '__half'))
```

* update the sign definition to avoid division by zero in all dtypes

* diff cleanup p1: why were these in the diff anyways

* less hacky HIP, enable CIFAR fp16 benchmark, test ops for HIP in CI!

add ALU ops overloads for HIP

this will make HIP max work

handle mod

Revert "handle mod"

This reverts commit 370fd4b3fbe99b6ae8cc293d005b106628205933.

update max to use hmax

add HIP GEP render logic

enable CIFAR fp16 benchmark

test ops for HIP

back to store as float because this only works for float4 grouping right now

test_ops for hip!!

always sign

* back to the sign we had before because we cant do a backward pass on a Less node

* remove old hacks

HIP compiling test_ops in CI takes ~9 mins, not doing it for now

new HIP ALUs

* reduce accs done right

* refactor to function

* no device hacks

hacks p2

the other way

* LLVM ALU ops

half, float and double are all float

update max

* update test_uops, cmplt is always a bool in the real linearizer. assertAlmostEqual is wrong when ret is bool

* cleanup LLVM wrong code

* dummy change for the CUDA install glitch

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-12-03 13:45:49 -08:00
George Hotz d6b404ac11
No dtype alloc (#2570)
* fix all allocs

* improve docs

* ugh fix fake alloc
2023-12-02 13:29:40 -08:00
George Hotz 6733425095
lower schedule (#2559)
* lower schedule

* remove RAND, and don't put load in the JIT yet

* better fix for that test
2023-12-01 19:17:46 -08:00
chenyu 67f4e03724
rewrite 0 size loadop into a CONST (#2556)
* rewrite 0 size loadop into a CONST

* check alloc size

* EMPTY is better

* Revert "EMPTY is better"

This reverts commit 574fe0f9ed28f1b97da5a81afdfd2cd5d9a94ff9.

* no ast is created

* fix test
2023-12-01 18:29:06 -05:00
George Hotz 2c363b5f0b
new style device (#2530)
* cpu tests pass

* torch works

* works

* metal works

* fix ops_disk

* metal jit works

* fix openpilot

* llvm and clang work

* fix webgpu

* docs are rly broken

* LRU works on metal

* delete comment

* revert name to ._buf. LRU only on Compiled

* changes

* allocator

* allocator, getting closer

* lru alloc

* LRUAllocator

* all pass

* metal

* cuda

* test examples

* linearizer

* test fixes

* fix custom + clean realize

* fix hip

* skip tests

* fix tests

* fix size=0

* fix MOCKHIP

* fix thneed

* copy better

* simple

* old style metal copy

* fix thneed

* np reshape

* give cuda a device
2023-11-30 17:07:16 -08:00
chenyu e56511b59a
more type annotation for tensor and lazy (#2528)
* more type annotation for tensor and lazy

* don't need that
2023-11-30 17:50:22 -05:00
chenyu 5db0cdfbd3
support list of ints (or other Tensorable) in tensor indices (#2520)
* support list of ints (or other Tensorable) in tensor indices

* enable some index test cases
2023-11-30 12:46:33 -05:00
qtkite cb507a9389
Remove the toCPU copy (#2445)
* Remove the rawbuffer copy in runtime/lib.py on line 44

* remove buffer view

* added metadata back, oops

* delayed cpu testcase

* whitespace

* whitespace

* buffer behavior as is

* Update test_jit.py
2023-11-27 20:37:13 -08:00
George Hotz 9e07824542
move device to device.py (#2466)
* move device to device.py

* pylint test --disable R,C,W,E --enable E0611

* fix tests
2023-11-27 11:34:37 -08:00
Paul Gustafson 58b1dd463e
Add error code to type: ignore (#2451)
Co-authored-by: Paul Gustafson <paul.gustafson@theambrusgroup.com>
2023-11-26 13:04:10 -08:00
Tobias Fischer 5326bbc9a6
fix causal mask in Tensor class (#2425)
* fixed causal mask in Tensor class

* added tests for scaled attention against pytorch

* cleaned up test formatting

* removed duplicate test
2023-11-24 18:38:18 -05:00
George Hotz 8ff2e13550
From teeny (#2426)
* changes from teenygrad work

* support not supporting ImageDType/PtrDType

* fixups from teeny
2023-11-24 12:50:56 -08:00
George Hotz 12023b6824
onnx ops cleanup (#2413)
* onnx ops cleanup

* revert those
2023-11-23 18:39:49 -08:00
George Hotz e4026dc197
don't pass lazybuffer to rawbuffer (#2400)
* don't pass lazybuffer to rawbuffer

* tensor comments
2023-11-23 09:40:28 -08:00
George Hotz 4f8f0ac139
minor cleanups, remove dead files (#2398)
* minor cleanups, remove dead files

* s.name

* use disk

* pytest passes on mac
2023-11-23 09:01:50 -08:00
George Hotz 66c75f30c6
remove triton (#2396) 2023-11-23 07:40:59 -08:00
George Hotz cbb8486779
ResNet training changes (update benchmark) (#2390)
* default arg for chunk

* bring back to_

* good changes

* new set

* unused hash

* fix optim

* new torch loader

* fix test lr scheduler
2023-11-22 17:41:12 -08:00
chenyu 6add808f6a
support tuple shape input for rand and empty (#2367) 2023-11-19 20:20:39 -05:00
chenyu 03968622a2
Pretty multinomial (#2365)
* pretty multinomial

p, cdf_normalized -> weight, cdf
symmetric unsqueeze / squeeze
check num_sample > 0

TODO: how do we want to handle 0/0 in general?

* no 0-dim input

* single sum
2023-11-19 15:10:10 -05:00
George Hotz c7b38b324b
A beautiful MNIST training example (#2272)
* beautiful mnist

* beautiful mnist example

* from tinygrad import Tensor

* more beautiful

* the jit is super core tinygrad

* globalcounters reset on jit run

* symlinks and exclude

* beautiful_cartpole

* evaluate is it's own function

* no symlinks

* more beautiful

* jit reset for double speed

* type hinting for JIT

* beautiful_mnist gets 98%

* beautiful_mnist < 4s with BEAM=2

* better cartpole

* use actor critic

* zero_grad got lost

* delete double relu

* stable cartpole with PPO

* beautiful_cartpole is more beautiful

* REPLAY_BUFFER

* beautiful stuff typechecks

* None support in shape

* hp tuning
2023-11-17 19:42:43 -08:00
chenyu 74e6b6c9fc
types (#2346) 2023-11-17 18:49:24 -05:00
chenyu aa01a63b3f
cleanup of lines / unused / types (#2336) 2023-11-16 21:15:32 -05:00
George Hotz 3baaf298d6
two stage cumsum in tensor.py (#2331)
* two stage cumsum in tensor.py

* 2 more kernels for llama cumsum

* gpt-2 and llama use fast multinomial
2023-11-16 12:09:53 -08:00
chenyu 27f4c26312
fix getitem slice when end < start (#2329) 2023-11-16 11:20:27 -05:00
George Hotz 294e71de15
remove lines (unused code) (#2319)
* remove lines

* uhh, i'm tired

* that function never worked

* types for ast_parse
2023-11-15 14:36:11 -08:00
Marcello Fuschi b8d460d203
Add Tensor.multinomial (#2295)
* add Tensor.multinomial only with replacement

* add support for 2D input in Tensor.multinomial

* fix multinomial output shape

* allow passing replacement=False to Tensor.multinomial when num_samples=1

* improve tests for Tensor.multinomial

* fix edge case in Tensor.multinomial

* Tensor.multinomial no more staticmethod
2023-11-15 11:38:39 -08:00
chenyu 9a20bc08d6
Tensor(None) is Tensor([]) (#2316) 2023-11-15 13:49:18 -05:00
chenyu f1f863c953
allow 0-dim array to broadcast into zero shape tensor (#2315)
* allow 0-dim array to broadcast into zero shape tensor

* not in
2023-11-15 13:12:21 -05:00
chenyu 123a0b86b2
support zero in shape (#2303)
* zero in shape start

* no assert for that

* if output size is 0, return without exec

* tweak

* strides

* reduce over non-zero

* shrink and expand

* fix import

* test_elementwise where

* cannot reshape from size 0 to size 1

* compiled backend reduce over 0

* zeros for numpy

* reduce over 0 and keepdim resulted in 1

* reduce empty set default values

* compare with same input

* pad test case

* cat test case

* torch does not support that?
2023-11-15 11:57:48 -05:00
chenyu 175cdbe815
fix pad None will value (#2308) 2023-11-14 23:57:05 -05:00
rodfer 53c5baa8b6
add dilation to avg_pool2d (#2270)
* add dilation to avg_pool2d

* avg_pool_fix

* avg_pool_fix

* woo

* oops

* force it correct

---------

Co-authored-by: rodfer0x80 <rodfer0x80@proton.me>
Co-authored-by: zibokapi <zibokapi@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-11-13 08:47:56 -08:00
valar 123ea051e6
refactor/ci: delete many `# type: ignore` (#2281)
* refactor/ci: delete many `# type: ignore`

* replace `axis.__class__ is int` with `isinstance(axis, int)` to make mypy happy
* add `--warn-unused-ignores` to mypy flag

refs #2240

* ci: move `--warn-unused-ignores` flag to mypy config

refs #2240
2023-11-12 11:04:20 -08:00
chenyu 453f48ce02
pad None means (0,0) (#2273) 2023-11-11 09:50:26 -08:00
chenyu a753c8e071
examples of new GPT2 and JIT change (#2261)
* var_vals are global

* working with global ish

* better

* fix export model

* fix tests

* better kv cache

* does it run?

* use where for kvmask

* fix excessive var_vals

* fix import

* how does multigpu use this?

* llama kinda work

* faster and simpler

* cleanup

* fix conversation mode

* test cleanups

* fix one more test

* test cleanup

---------

Co-authored-by: George Hotz <geohot@gmail.com>
2023-11-10 15:07:02 -05:00
George Hotz 8ba7ced7f9
extract const if it's const (#2193)
* extract const if it's const

* fix if statement

* fast math issue

* fix graphing and casting

* disable flaky copyout test
2023-10-31 18:52:35 -07:00
George Hotz b245f1307e
add exp2 (#2192) 2023-10-31 17:48:42 -07:00
George Hotz e0201922e3
Q network for pruning BEAM / uops deduping / BEAM_ESTIMATE (#2142)
* stable diffusion < 324ms

* revert swap action

* fix tests due to more sum splitting

* REDUCEOP_SPLIT_THRESHOLD env var

* added from unaligned np test (#2134)

* align cpu buffer before copy into cl buffer (#2135)

* remove shelve from handcode_resnet50_opt.py (#2139)

* Add dictionary keys to reduce db size (#2131)

* work

* ignore beam cache

* dictionary keys are generic

* minor db cleanups

* fix baseline and extract dataset

* fix training

* log likelihood

* more lin to feats

* sts

* training policynet

* net sort of works

* dedup

* refactor, stupid new actions

* fix uops deduping

* BEAM_ESTIMATE

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>
2023-10-27 10:53:06 -10:00
chenyu 9215bccb41
Tensor.uniform set default to standard uniform (#2158)
* Tensor.uniform set default to standard uniform

* clean up test to reuse function
2023-10-27 16:15:30 -04:00
George Hotz c5edb3c374
train value net, improve API, add BCE (#2047)
* api cleanups, BCE losses

* valuenet

* fixup examples

* learning okay

* add valuenet runner

* net improvements

* net improvements

* 40% win rate
2023-10-12 07:56:38 -07:00
George Hotz 41bfeb2c1e
start work on auto opt (#2034)
* start work on auto opt

* lin failure

* not beating hcopt

* greedy

* timing is fast

* codegen.search

* greedy search in handcode_opt

* track running gflops

* clean up those files

* no failure
2023-10-11 12:54:53 -07:00
qazal e40f141203
Refactor and add more unit tests for disktensors (#2022)
* testing with the test_ops pattern

* add assign test

* flake8 complaining about single line fn

* slice 2d and minor cleanup

* make assign_slice a one-liner

* we dont need to repeat the same lambda twice, default tinygrad_fxn to be np_fxn

* back assign fn for np array

* implement __setitem__ in tensor.py

* dont re-slice the ret tesnsor

* one liner assign

* drop the permute test
2023-10-09 18:46:29 -07:00
George Hotz cea4cbfc7a
move image+kopt to features (#2015)
* move image+kopt to features

* fix tests

* debug prints (unrelated)
2023-10-07 15:41:08 -07:00
George Hotz ffa33d743a
good changes from openpilot_compile2 (#2000)
* good changed from openpilot_compile2

* float32 image type was wrong

* cleaner way to write that + a test
2023-10-06 13:33:24 -07:00
George Hotz de5d603ec1
corealize + remove realize from lazybuffer (#1968)
* corealize + remove realize from lazybuffer

* fix multigpu

* fix graph
2023-10-04 10:59:31 -07:00
George Hotz d449b3bef1
think about removing realize from lazybuffer (#1965)
* remove realize from lazybuffer

* okay fine, back that off

* fix tests maybe

* fix test
2023-10-04 07:18:58 -07:00