1
0
Fork 0
Commit Graph

77 Commits (deepcrayon)

Author SHA1 Message Date
James Roberts 0d405fd5bc
Parallelize CI tests (#535) 2023-02-06 15:27:44 -06:00
George Hotz 90529d3750
tests are 20% faster (#529)
* pytorch CPU

* no cache, it's slower

* pytorch cpu for real

* remove double onnx
2023-02-06 09:56:14 -06:00
George Hotz 039de1b332 oops, pytest is for testing 2023-02-06 09:30:12 -06:00
George Hotz 6eb0e6a650 shuffle deps: always tqdm, make linting category 2023-02-06 09:27:01 -06:00
Jacky Lee ad4f6aa2cf
Add test for quick_gelu (#526)
* Add test for quick_gelu

* Bump PyTorch version for approximate
2023-02-03 20:01:39 -08:00
George Hotz cd97b036cc
A Triton backend for tinygrad (#470)
* triton can add

* print stuff from triton

* write out file

* ops triton working

* reduce ops

* sort of works

* Triton bugfixes & implementation of remaining ops (#490)

* padding

* support pow, max, relu, gt0

* allocate return buffer

* Fix reduce

* Add tests for power op

* Fix triton illegal memory accesses and memory leak (#512)

* Fix mypy issue

* Add triton to setup.py

* Replace torch with pycuda

* Use one cuda stream for data transfer and kernels

* Remove triton submodule

* Fix memory leak by using weakrefs for caching

* Fix memory access by adding valid as mask for load

* Fix invalid kernel launches by flattening the grid (#515)

---------

Co-authored-by: Martin Loretz <20306567+martinloretzzz@users.noreply.github.com>
2023-02-01 11:53:57 -08:00
George Hotz bd8a5c2ced
Simple CUDA Runtime (#480)
* factor out opencl runtime

* don't use CL outside the runtime

* cuda runtime adds

* final_dimension

* tests pass with CUDA backend

* more cuda

* cuda simpler

* retain old functionality

* linter and typing

* move globalcounters out of runtimes

* oops, GlobalCounters in cuda

* MAX_OUTPUT_SHAPE=3 is fine for CUDA
2023-01-27 16:26:24 -08:00
Jacky Lee 026ba78526
Add commit hooks (#478)
* Add pre-commit hook

* We need ret

* Fix some type definitions
2023-01-26 22:24:31 -08:00
George Hotz bfd4f4e35c testdocker 2023-01-09 12:41:52 -08:00
George Hotz b8c94a67c9
Simple chonker (#431)
* chonker will make llvm fast

* work

* better speed tests, we will make them fast

* with the cache add is the same speed

* relu and neg are fast

* fix sum speed

* maximum maxnum?

* hack for gemm opt

* gemm very slow

* zeros like

* test_permute

* shapetracker returns self

* fix shapetracker factorization

* err, int strides

* permutes are faster now in tinygrad than pytorch

* support -1 in expand

* gemm unrolled

* improve final test case

* WIP GEMM

* why isn't GEMM fast?

* revert cache dim

* ffp contract works on clang, not llvm?

* ignore llvm ir

* this makes fma work at least, but no faster

* USE_4x4

* 63 GFLOPS

* 87 GFLOPS

* that wasn't matmul, 44 GFLOPS now

* 82 GFLOPS permuted

* this permute too

* a little speed for the convs

* 45 GFLOPS

* speed tests pass again

* clean up prints

* fix FMA WHAT A WASTE OF TIME

* colors

* moar fair

* GPU

* useless on chonker

* cleanups

* improve factorized shapetracker

* better threshold

* label conv

* work

* ops test pass again

* hot load the index

* run the last view, no need to create

* ZeroView needs a repr for the key to work

* fix segfault on out of bounds

* one more test

* start amx, and llvm.initialize_native_asmparser

* amx works

* nice AMX class

* nicer AMX class

* refactor get_idxs

* amx working

* is slower...

* useless flip

* cache

* SZ_X

* AMX_SZ_X/Y work alone

* Contiguous mlop

* test gemm packed

* PREPARE in packed

* use_amx factor

* prefetch isn't faster

* loop

* same 3ms

* 2.24 ms

* allow double on store in TG

* amx reduce is the same speed as non amx reduce

* include memory bandwidth

* clean up shapetracker

* flip returns stride

* prepare for upstream

* Update ops_llvm.py (#426)

* permutes are yellow and green now

* faster conv

* llvm cleanups

* Show optimised IR under debug 4 (#428)

* ASTKernel class

* Make tinygrad work with older python version (#427)

* Make tinygrad work with older python version

* Use partialmethod instead of partial

* smiple chonker is chonking

* remove junk from test speed vs torch

* fix linker and types

* AMX is only here now

* add LLVM tests, it's a valid backend now

* oops, run llvm test

* contiguous_op

* fix loadops compare

* dedup reduceops

Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>
2022-11-10 23:17:09 -08:00
George Hotz 92ed87b0a5 bump version to 0.4.0 2022-11-08 08:44:42 -08:00
George Hotz b132de677d
tinygrad.nn (#367)
* tinygrad.nn

* flake8

* working on pylint

* more pylint

* more pylint

* pylint passes

* networkx

* mypy can't infer that type

* junk
2022-08-18 07:41:00 -07:00
Nicklas Boman 64d986bc8b
add mypy to ci testing (#353) 2022-07-03 15:11:35 -07:00
George Hotz 0d82cfd587 huh, torch 1.12 broke it. remove unused requirements.txt and pin torch 1.11 2022-07-02 23:07:59 -07:00
George Hotz a710b3a210 it's a real test now 2022-06-11 11:33:33 -07:00
George Hotz 8440dbfa5d support inputs 2022-06-11 11:21:45 -07:00
George Hotz 082089d1c7 install requires pillow 2021-10-30 16:00:33 -07:00
Liam bcf1518309
All devices are equal! (#196)
* Update all devices to be tested

ANE, CPU and OCL all now support all tests.

However tests are not currently passing on GPU and I cannot test on CPU.

Failing GPU test are not an issue caused by this update. Tests have not
been passing due to a missing "six" required installation.

OpenCL Tests have not been run since commit: 1a1c63a08b

devices have 3 types and are handle by a new DeviceTypes enum. (The goal
is to revert to Tensor.<type>, but this current setup allows for keyword
argument defaults: `device=DeviceType.CPU`)

All references to Tensor.GPU/CPU/ANE as been converted to the
corresponding `DeviceTypes` enum.

Refactor of the conversion code to allow for any device to any device
conversion.

* Add six dependency in requirements.txt

* Resolve failure to run tests

Move six into gpu required installs. Remove six from standard
installation.

* Remove repeated data conversion

* Refactor method names

Also reduce code with .to and .to_

* Dynamic device handlers

* Refactor DeviceTypes -> Device

* Add mem copy profiling back

* test_backward_pass_diamond_model passing

* Resolve Sum issue on GPU

* Revert batchnorm2d tests

* Update README with upadated API

* ANE testing with

* Last minute line gains
2020-12-15 23:44:08 -08:00
Liam 34b38dd4d0
Extra install requirements. (#164)
* Testing install requirements

* GPU install requirements
2020-12-09 02:22:47 -08:00
George Hotz 06504a5824 bump version 2020-11-08 09:34:07 -08:00
Marcel Bischoff d24363f421
Update setup.py (#49)
I think `:=` in tinygrad/test/test_mnist.py actually needs 3.8
2020-11-02 18:09:31 -08:00
George Hotz 0b68c08de0 literally just bump version for picture on pypi 2020-10-27 08:14:22 -07:00
George Hotz 6b5982b6b3 push pypi 2020-10-27 08:13:15 -07:00
George Hotz 43591a1e71 make the example simpler 2020-10-26 09:19:20 -07:00
George Hotz 64bd4f7936 lol, it's not 1.0 2020-10-26 09:11:32 -07:00
Göktuğ Karakaşlı 8d80726207 two spaces 2020-10-26 18:54:55 +03:00
Göktuğ Karakaşlı cc9bd45b44 add setup.py and change imports to relative 2020-10-26 18:19:50 +03:00