tinygrab

deepcrayon

tinygrab

Author	SHA1	Message	Date
Rory Clear	553688f12a	update metal matmul and matvec for compile api (#2238 )	2023-11-08 08:08:35 -08:00
George Hotz	2f7aab3d13	move optimize_local_size (#2221 ) * move optimize_local_size * interpret_ast	2023-11-05 21:00:52 -08:00
chenyu	f582ec56d5	Replace (getenv("CI", "") != "") with helpers.CI (#2213 )	2023-11-03 15:20:44 -07:00
George Hotz	f17bc16f46	simple runtime args (#2211 ) * simple runtime args * fix some tests * fix abstractions and triton * fix search	2023-11-03 12:31:29 -07:00
George Hotz	ddbc6eecaf	some refactors in the realization (#2206 ) * some refactors * delete old kernel search	2023-11-02 19:51:28 -07:00
George Hotz	03cf0afa4f	move all to compile api (#2203 ) * move metal+clang to compile api * all to the new style * remove binary arg * fix triton * fixup tests * fix clang * diskcache is generic * __wrapped__ * compile_gpu * fix thneed * keep the src in the ASTRunner * lib * move compile_gpu * compile_gpu in device * put compiler in astrunner * test reverts * triton compiler * ugh, that too	2023-11-01 23:01:32 -07:00
George Hotz	8932816816	remove arm64, caching for cuda (#2201 ) * remove arm64, caching for cuda * caching in llvm * switch cache_compiled to new cache * fix clang * caching for metal * fix pylint * cleanups * perf_counter and binary	2023-11-01 18:44:00 -07:00
George Hotz	7103b716c4	merge kernel and optimizer (#2200 ) * merge kernel and optimizer * linearize is reentrant * move global/local size * clean up linearizer copy * remove unneeded lin copies * stop linearizing twice * oops, that should be None	2023-11-01 15:20:01 -07:00
George Hotz	33bb650e94	use mad in opencl (#2198 ) Co-authored-by: Comma Device <device@comma.ai>	2023-11-01 10:40:08 -07:00
Comma Device	2e9982fe2d	fastvits example that's 10% faster	2023-10-31 21:48:23 -07:00
George Hotz	8ba7ced7f9	extract const if it's const (#2193 ) * extract const if it's const * fix if statement * fast math issue * fix graphing and casting * disable flaky copyout test	2023-10-31 18:52:35 -07:00
George Hotz	5aaa8a0cc1	fix shape	2023-10-31 11:36:19 -07:00
George Hotz	a27c9f9de5	openpilot compile2 (#2189 ) * try compile2 * pass to thneed * fix tanh onnx	2023-10-31 11:08:58 -07:00
forcefieldsovereign	f294bdd681	fixed imports (#2185 )	2023-10-30 22:07:17 -07:00
Akshay Kashyap	018bd29e37	Enable Multi-Output Export (#2179 ) * Enable Multi-Output Export * Add test * Update examples and lint * fix padding * test ops * dummy commit to rerun test * revert cuda lint * Enforce tuple/list of tensors * subscripted generics * put back webgpu test * Re-enable WebGPU Efficientnet test	2023-10-30 18:42:26 -07:00
chenyu	6c58bf3e9c	in time_linearizer, allocate a scratch buffer if output buffer is also input (#2152 ) * in time_linearizer, allocate a scratch buffer if output buffer is also input * move scratch buffer creation outside search	2023-10-28 07:17:41 -10:00
George Hotz	e0201922e3	Q network for pruning BEAM / uops deduping / BEAM_ESTIMATE (#2142 ) * stable diffusion < 324ms * revert swap action * fix tests due to more sum splitting * REDUCEOP_SPLIT_THRESHOLD env var * added from unaligned np test (#2134) * align cpu buffer before copy into cl buffer (#2135) * remove shelve from handcode_resnet50_opt.py (#2139) * Add dictionary keys to reduce db size (#2131) * work * ignore beam cache * dictionary keys are generic * minor db cleanups * fix baseline and extract dataset * fix training * log likelihood * more lin to feats * sts * training policynet * net sort of works * dedup * refactor, stupid new actions * fix uops deduping * BEAM_ESTIMATE --------- Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>	2023-10-27 10:53:06 -10:00
chenyu	0ca0e9ee5e	exclude ast with variables from beam search (#2140 ) * exclude ast with variables from beam search * test that * add to CI	2023-10-25 16:35:29 -04:00
wozeparrot	c29653605e	hip multigpu training (#1878 ) * feat: move to hip * feat: special path for RawBufferTransfer * feat: initial rawbuffertransfer * feat: hip ipc * feat: working hip ipc * feat: need to base device without args * feat: close mem handle * feat: modified test * feat: more multihip stuff * clean: cleanup * feat: cleaner * feat: don't crash * feat: test more * clean: way cleaner hip wrapper * feat: barrier * feat: barrier * feat: this breaks stuff * feat: we can use empty here * feat: maybe fix tests * feat: maybe fix tests again? * fix: probably fix tests * feat: no waiting here * feat: wait here * feat: much larger test * feat: need to sync here * feat: make this async * feat: no waiting! * feat: cut here * feat: sync copy * feat: random imports * feat: much cleaner world * feat: restore this * feat: restore this * clean: cleanup * feat: set this	2023-10-24 17:35:53 -04:00
nimlgen	2e89fd264f	Refactor hipgraph (#2141 ) * refactor hip graph * linter happy * happy liner	2023-10-24 15:45:56 -04:00
George Hotz	cea2bc7964	Add dictionary keys to reduce db size (#2131 ) * work * ignore beam cache * dictionary keys are generic * minor db cleanups * fix baseline and extract dataset * fix training * log likelihood	2023-10-24 10:49:22 -04:00
George Hotz	6dc8eb5bfd	universal disk cache (#2130 ) * caching infra for tinygrad * nons tr key * fix linter * no shelve in beam search * beam search caching * check tensor cores with beam too * pretty print * LATEBEAM in stable diffusion	2023-10-22 10:56:57 -07:00
George Hotz	abeba8f1fc	optimization: get actions in CI (#2125 ) * get actions in CI * actually run the test * pythonpath	2023-10-20 12:22:01 -07:00
Sean D'Souza	999c95ea29	fix: hlb cifar types (#2099 )	2023-10-17 19:23:50 -07:00
Ahmed Harmouche	2b5ea7d9cb	Fix output Float32Array size in webgpu export (#2096 )	2023-10-17 15:28:19 -07:00
Szymon Ożóg	4bef1591f0	Disable ocelot cache + fix matvec in triton (#2010 ) * Revert "disable flaky triton test" This reverts commit `1e15fdaee7`. * Update test.yml * check if has shared for matvec * disable ocelot cache for triton * disable ocelot cache * disable ocelot cache * pass shared to triton uops tests * temporary debugs for CI crash * Revert "temporary debugs for CI crash" This reverts commit `fee3ea96c8`. * Revert "triton isn't tested, and allows this refactor (#2007)" This reverts commit `dea8bb0938`. * add runtime_args to every renderer, move triton local size override to runtime args * Add binary to args, correct type returned * update to new loops * Update test.yml	2023-10-17 10:33:32 -07:00
geohotstan	5ed630204b	Add ONNX to CI for other backends (#2069 ) * some cleanup * move continue back * more more more * added to CI * try * try intentionally break some tests * wtf * del True for test * yay tests broke, now pls no break * try AGAIN * gahy * lol * try * move over constant * moved over MORE * move shrink over * trailing lines * try CUDA CI * try again * boom * oops * improved comments * try: disable some flags and disable CUDA * try breaking tests * traceback has too much info so add --tb=no * revert forced CI failure * add comments and del unused imports * oooooooo using regular debug try enable tb * intentionally break tests * added tb back. Maybe not too verbose * strip whitespcae * missed something * Shape op int32 -> int64 * oops missed something * add some types * get rid of crazy 1 liners in pad op * actually test Split this time LOL * strip that whitespace	2023-10-17 09:33:54 -07:00
George Hotz	1bf4aef0f5	fix image dtype cmp (#2089 ) * fix image dtype cmp * print that with debug 3	2023-10-16 17:52:38 -07:00
George Hotz	a7b18ac325	try beam search on device (#2085 ) * try beam search on device * fix beam with nolocals * ops too --------- Co-authored-by: Comma Device <device@comma.ai>	2023-10-16 12:52:42 -07:00
George Hotz	c36d306606	KOPT is over, BEAM is upstream (#2071 ) * create cache for q learning * make linter happy * global beam * where it belongs * bugfix * ditch the kopt, use the beam * faster lin and DEBUG=2 okay * remove kopt, move search to features	2023-10-16 09:46:03 -07:00
George Hotz	5472a14544	openpilot compile2 (#1977 ) * start compile2 * tweak * why are there two more kernels? * minor cleanups * don't break onnx tests * add __metadata__ support to safetensors * no early realize in onnx * cleanups * bugfix * clean up image type, add optimize * opt to match old * try that * opt work * run compile2 * optimizer * prt more * prerealize * imp * NOLOCALS works * no locals means no locals * support fractional globals * all locals welcome * int that * cleanups * show gemv regression * clean up diff * use idx for the cond * nolocals --------- Co-authored-by: Comma Device <device@comma.ai>	2023-10-15 20:39:46 -07:00
George Hotz	49bcfec383	0s in the action space (#2070 ) * 0s in the action space * simpler * skip duplicate actions	2023-10-14 11:22:48 -07:00
George Hotz	4124cf1df5	cleanup tensor cores, expose exclude local upcast (#2064 ) * expose exclude_local_upcast * convert apply tensor cores to ops * update comment * put LOCAL back to what it was, BEAM is better than way	2023-10-14 09:21:03 -07:00
George Hotz	90c777d815	remove apply_auto_opt (#2063 )	2023-10-13 07:44:14 -07:00
George Hotz	6f1810af2d	with unroll, the action space goes from 161 -> 127 (#2060 ) * with unroll, the action space goes from 161 -> 127 * more reliable instrumentation * beam search is so op * beam bugfix	2023-10-12 20:52:23 -07:00
George Hotz	c5edb3c374	train value net, improve API, add BCE (#2047 ) * api cleanups, BCE losses * valuenet * fixup examples * learning okay * add valuenet runner * net improvements * net improvements * 40% win rate	2023-10-12 07:56:38 -07:00
George Hotz	0ba629c7b9	add world dataset (#2045 )	2023-10-11 15:54:30 -07:00
George Hotz	0c3b6f13a8	Latest opt (#2044 ) * split out actions * rl algorithm	2023-10-11 15:46:14 -07:00
George Hotz	41bfeb2c1e	start work on auto opt (#2034 ) * start work on auto opt * lin failure * not beating hcopt * greedy * timing is fast * codegen.search * greedy search in handcode_opt * track running gflops * clean up those files * no failure	2023-10-11 12:54:53 -07:00
chenyu	1c980517c5	s/var_vals_from_ast/vars_from_ast (#2038 )	2023-10-10 20:21:55 -07:00
George Hotz	f139060103	Rewrite hand coded opt with action space (#2030 ) * tests passing * hand coded opt with new abstractions * simpler opts * split out tensor cores	2023-10-10 07:38:38 -07:00
George Hotz	16ca8410f8	op logger + replay (#2021 ) * logops * fix dtype printing * needs inf * ops dataset * minor improvements * 12k kernels * opt can compile * graph flops	2023-10-08 15:10:18 -07:00
George Hotz	8db92bd060	fix tvm gemm example	2023-10-08 05:57:41 -07:00
Francis Lam	dece9958f8	wmma: clean up to make WMMA arg order consistent (#2014 ) also add cache defeat to extra/gemm/simple_matmul.py	2023-10-07 17:45:40 -07:00
George Hotz	6ee9cae44f	don't extract CIFAR every time / use the cache	2023-10-07 12:33:50 -07:00
George Hotz	dea8bb0938	triton isn't tested, and allows this refactor (#2007 ) * triton isn't tested * cuda buffer	2023-10-07 07:29:59 -07:00
Roelof van Dijk	26fcc8dff6	fix: remove runtime imports (#1982 ) fix: import what is used probably monkeypatched fix: import revert selective import	2023-10-07 05:23:08 -07:00
George Hotz	f54959e5cd	move print tree into graph (#2003 ) * move print tree into graph * add winograd profiling test * change pre-commit to run ruff first	2023-10-07 04:39:21 -07:00
Ahmed Harmouche	2114dc13d1	Allow multi-input model export (#1995 ) * Allow multi-input model export * Add model export unit test * Fix efficientnet compilation * Only run model export test on JIT supported devices * Skip export model test if not EXPORT_SUPPORTED_DEVICE	2023-10-07 04:13:34 -07:00
George Hotz	ffa33d743a	good changes from openpilot_compile2 (#2000 ) * good changed from openpilot_compile2 * float32 image type was wrong * cleaner way to write that + a test	2023-10-06 13:33:24 -07:00
Francis Lam	0ba75c4370	optimizer: add matvec optimizations (#1972 ) * optimizer: add matvec optimizations * renderer: fix alignment of shared memory in opencl	2023-10-04 14:16:27 -07:00
George Hotz	de5d603ec1	corealize + remove realize from lazybuffer (#1968 ) * corealize + remove realize from lazybuffer * fix multigpu * fix graph	2023-10-04 10:59:31 -07:00
nimlgen	2ea1dd3e87	no process() in Linearizer (#1966 ) * no process() in Linearizer * more process() clean up	2023-10-04 07:18:42 -07:00
George Hotz	717451a244	Revert "optimizer: add matvec optimizations (#1753 )" (#1959 ) This reverts commit `f520323054`.	2023-10-03 00:28:42 -07:00
Francis Lam	f520323054	optimizer: add matvec optimizations (#1753 ) * optimizer: add matvec optimizations * Update optimizer.py --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-10-03 00:01:59 -07:00
David Hou	8e9db88474	expand after expr_idxs in Linearizer.global_load (#1818 ) * small changes * expand in terms of substitute, directly expand g_idxs g_valid * delete expand_ops * don't compare using hash * any instead of in thanks gijskoning Co-authored-by: Gijs Koning <gijs-koning@live.nl> * support tc * testing code * no more create_rednode * maxsize none in view/node * oops * undo * typing * oops * oops * lmao * lmao * add expand multi test * Node.iter_idxs * type * type * delete checks! * clean up a little? * expand_idx in symbolic * un-golf * play around with types >.> * test_substitute and also remove an incorrect test? * get rid of range * Update symbolic.py * split out view cache change * split out flat components change * reduce diff * reduce diff * add some float4 tests * fix --------- Co-authored-by: Gijs Koning <gijs-koning@live.nl>	2023-09-29 10:33:34 -07:00
Francis Lam	f445e056ed	wmma: add test and tensor core shape (#1925 )	2023-09-28 18:04:28 -07:00
Yixiang Gao	094d3d71be	with Tensor.train() (#1935 ) * add with.train * remove the rest TODOs * fix pyflake * fix pyflake error * fix mypy	2023-09-28 18:02:31 -07:00
George Hotz	c36d0e3bd8	tvm import hook	2023-09-28 09:24:32 -07:00
George Hotz	adab724caa	schedule2, keep the tests working with small changes (#1932 ) * lazy cleanups * ast functions take in LazyOps * op instead of self.op * _base for mops * fix contiguous * start schedule * test_schedule * fix openpilot * more tests * bugfix and test skip * work * make sure things get freed * fix zerosized tensors * fix failing test * fix ceil and friends * fix openpilot * disable training * disable test collectives	2023-09-28 09:14:43 -07:00
nimlgen	45f02393f0	HipGraph support (#1880 ) * init hip graph * optimize args update * cache symbolic in jit * remove NOSTAT * init BasicBatchExecutor * symbolic infer cache per jit instance * basicbatchexec is defualt for compiled * batch_exec is taken from ASTRunner * no infer cache * batched execution of hip graph * add comment about hip graph batches * readable hip graph	2023-09-24 20:14:36 +08:00
Szymon Ożóg	58296c079d	Make Triton work again (#1547 ) * Move ops_triton to runtime and remove errors from deprecated code * Remove deprecated AST Kernel * Remove deprecated buffer * Add TritonProgram * Triton Buffer * Use RawCUDABuffer * triton_compile * Added new parameter * pass _buf to program * remove deprecated include * Added triton tests * Deprecated includes removed * remove double print * Disable float4 support * Disable float4 support * variable load fix * Track local size * Add pycuda to triton dependencies * Merge test.yml * install cuda packages for testing * merge double package install * remove emulated from triton tests * upscale local index to power of 2 and add masking * cuda envs * Add TernaryOps * ConstOp loading * proper function name * remove deprecated variables * get global program from name * const ops match local shape * Enable test_nn * remove deprecated import * fix linter error * Add wait logic * Add local size override * accumulate local shapes instead of using max shape * Merge triton tests into global tests * fix envs in testing * Old testing routine * split file into renderer and program * remove print and starting whitespace * pretty ptx print on debug 5 * linter errors * ignore triton saturation tests * ignore test example * remove pytorch cpu extra index * Add triton to existing testing routine * use triton tests * disable cuda backend in triton tests * use cudacpu in tests * print used device * Print device default * Remove print * ensure we are running triton backend * update variable signatures * update dtypes for load * infinity render fixed * limit global size * negative infinity now properly rendered * split chain with parentheses for and node * Add option to disable shared memory, disable for triton * missing import * Properly index and mask conditional load * use mask only if not loading a block pointer * nan support * fix symbolic tests to include chain split * proper masking for stores * Implemented bool dtype * Add mod * fix loads for variables with valid range * merge triton with cuda runtime * merge from master * run triton tests with cuda * Correct target when running from triton * conftest with triton compiler config * use triton nightly * verbose tests for triton * capture stdout * fix function depth when exiting multiple loops * add render valid function for readabilty * fix mask for local loops * add _arg_int32 datatype * fix dims for conditional loads * enable non float stores * correct variable dtypes * fix type for arg_int32 * remove junk * Added get max function for range based var.max * remove deprecated code * Fix triton ptxas path * Fix testing for CI * clamp local size by max local size instead of always running max * Disable matmul test in triton cpu * rerun tests * Disable broken test in triton cpu * whitespace removed * rerun tests again * Disable TestSymbolicOps for triton * update to new uops * linter fix * ignore test/extra * linting fix * Update tinygrad/renderer/triton.py Co-authored-by: Gijs Koning <gijs-koning@live.nl> * remove deprecated line * quotes type fix * linter * Remove unnecesary lines * UnaryOps.NEG * dont define constants * Linting fix * Disable tests that are broken in ocelot * remove trailing whitespace * reduce line count * linting fix * update to new uast * New looping style * Update to new uast * make AST runner work with triton * linting fix * set renderer var for testing * disable local for ocelot * reenable all tests for ocelot * Pass shared to cuda * Don't group if the backend doesn't support shared mem * use working gpuocelot branch * enable all tests * enable local for ocelot * cleanup * Update test.yml * update cache key * reenable test symbolic and extra * Update test.yml * Revert "Update test.yml" (rerun tests) This reverts commit `98c0630ee5`. * Revert "fix symbolic tests to include chain split" This reverts commit `22a9a4c9cd`. * Revert "split chain with parentheses for and node" This reverts commit `7499a7004e`. * use global size from linearizer * rename newvar to dtype to match other renderers * join program start lines * simplify code that adds axis to local dims * assign r[u] in ssa * We no longer need to replace target in src * we no longer need to cast indices to int by hand * Update triton.py(rerun tests) * Update triton.py(rerun tests) * Update triton.py(rerun tests) --------- Co-authored-by: Gijs Koning <gijs-koning@live.nl> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-09-23 14:17:12 +08:00
qazal	d0e752003d	fixes (#1893 )	2023-09-22 07:20:27 +08:00
wozeparrot	009a99a0b1	feat: way cleaner hip wrapper (#1895 )	2023-09-22 07:20:03 +08:00
kormann	864746d6aa	polish print_tree (#1868 ) * fix * isinstance	2023-09-21 11:13:10 +08:00
chenyu	3ec301c2d7	apply view.py patch (#1844 )	2023-09-10 17:32:15 -07:00
kormann	7ac65a93b4	utils.printtree (#1816 ) * utils.printtree * linter compliance * rename to print_tree	2023-09-07 23:08:57 -07:00
George Hotz	4613c9e77c	add tvm example, formatting (#1813 ) * add tvm example * no realize	2023-09-07 11:50:41 -07:00
Pavol Rusnak	52a92bf95d	use class Foo: instead of class Foo(): (#1797 ) * use class Foo: instead of class Foo(): * add ruff linter, copy settings from .flake8 to ruff.toml	2023-09-06 12:20:25 -07:00
geohotstan	9af5645ba3	onnx full passing (#1076 ) * 1 * 83 failed * learning how git works * lol idk * zero shape aaaa * space lol * aaa * test check * haha * fixed gather * 73 failing * 71 failing * 68 failing * added some debug * fking resize * lol * 62 failing * 58 failling fucking did nearest resize hell yeah * clean up * 56 failing * janitor duty * lol * 53 failing * hi mom * 50 failing * added linear interp, but coord_trans is wrong * did lin interpolation woohoo * 43 failing * 40 failing * temporary Gather fix * 39 failing * fixed slice onnxver<10 * 37 failing * 35 failing * excluded tests that use float64 * 32 failing with hacks * added _batchnorm() for 3D 5D batchnorm, 29 failing * changed ALLOWED_KERNEL_COUNT from 199 to 207 * added improved Gather op, reverted ALLOWED_KERNEL_COUNT commit * support Round op * added storage_order/indices maxpool, 27 failing * support maxunpool, 25 failures * support Gradient, 23 failures * merged new where * added Adam * cleanups * added Momentum and Nesterov Momentum * added Adagrad * support sequence_type, 20 failing * ugh git * I give up on cubic interp :D, 9 failing * sexy 1 liner gather, much improved, wow * polished gather to make it shine bright like a diamond * clean 1 liner for gather * improved readability of gather * uhh * clean up * more clean up * WHITEspace * implemented SoftmaxCrossEntropyLoss op * added comments and cleaned up if statements * update * thank based wozeparrot for pow and new GatherElements * CPU and TORCH all pass \| cast float64 -> float32 for all fromCPU() * _nearest_gather() failing on yolo * reverted ops_cpu change and added assert in Resize * added comments for resize for multiple channels * oops * merge * test * switched np.pad to Tensor.pad for constant padding * gah * gah2 * sexy reflect pad with movementops -> add * delete commented out lines * edge mode pad sexy as well * trying out model_benchmark * revert gitignore change lol * init * Revert "init" This reverts commit `682bf2073a`. * wrote cast workaround for CPU, CPU and TORCH all pass * wrote cast workaround for CPU, CPU and TORCH all pass * skipped tests w/ 0 shape for METAL and GPU * excluded tests for CLANG, CPU, TORCH, CLANG pass * fixed hacky ConvTranspose * gotta figure out autopad * UOps.STORE support cast bool -> float * small fix for fast gather * reverted 0 shape skipped tests * oops missed a file * added comment * fixed slice op hack * First commit to pr * More trig ops * More trig ops * format * isinf support * More ops * changed onnx_ops to use our new gather :D * Det op bug fix * rebase * fixed some tests * det broken and slow * fixed compress to use new gather * implemented argmax argmin * support variable types in type_proto * support Upsample and Identity sequence * we support float64 now and tinygrad support automatic broadcasting * added EyeLike op * resize does support multiple channels now actually * yolov8 onnx runs successfully * added batch size 1 * oops * finally fixed type_proto I think * fixed some llvm bugs * del whitespaces * added ZenginU Format PR * test * oops * added float64 exclude tests back * more skipped tests * try * ok openpilot pass * flake8 pass * woooooohooo * revert external_model_benchmark changes * perf tested gather * removed promote types from ops_cpu * numerical errors from 1681 is fixed --------- Co-authored-by: ZenginU <umutzengin00@gmail.com>	2023-09-05 13:23:32 -07:00
George Hotz	56abe04e4b	disable assembly (#1755 )	2023-09-04 09:41:20 -07:00
wozeparrot	bf05534c6e	hip multidevice (#1728 ) * feat: hip multidevice support + p2p * feat: default device	2023-09-01 06:46:13 -07:00
Karan Handa	a8aa13dc91	[ready] Replacing os with pathlib (#1708 ) * replace os.path with pathlib * safe convert dirnames to pathlib * replace all os.path.join * fix cuda error * change main chunk * Reviewer fixes * fix vgg * Fixed everything * Final fixes * ensure consistency * Change all parent.parent... to parents	2023-08-30 10:41:08 -07:00
nimlgen	1c0449e190	add cache collector (#1595 ) * init cache collector * add test_cache_collector.py * switch GlobalCounters.cache to CacheCollector * init jit models test * jitted SD * add debug msg to print loaded bufs count * moved cache collctor to jit * clearer SD * no double device import	2023-08-28 19:59:55 -07:00
George Hotz	a6d842af7a	move device to ops (#1646 ) * move device to ops * mlops types * 2 lines	2023-08-23 08:30:17 -07:00
George Hotz	718ced296c	move state to nn/state (#1619 )	2023-08-22 07:36:24 -07:00
Umut Zengin	f720682beb	np.argmax to Tensor.argmax (#1608 ) * to tensor argmax * removed keepdim * training update	2023-08-21 15:22:29 -07:00
Yixiang Gao	4d54afb6df	sparse cat cross entropy (#1597 ) * add sparse cat cross entropy * minor fix * add log_softmax into loss function * add test * update docs * fix training loss * add device	2023-08-21 14:14:54 -07:00
George Hotz	2e60920317	Revert "sparse cat cross entropy (#1591 )" (#1596 ) This reverts commit `f0ee850e98`.	2023-08-21 10:04:26 -07:00
Yixiang Gao	f0ee850e98	sparse cat cross entropy (#1591 ) * add sparse cat cross entropy * minor fix * add log_softmax into loss function * add test * update docs	2023-08-21 09:56:41 -07:00
Yixiang Gao	8d6662a741	.cpu().numpy() -> .numpy() (#1594 ) * .cpu().numpy() -> .numpy() * restore ops_torch * restore test_speed_v_torch	2023-08-21 09:53:29 -07:00
George Hotz	e464442adf	WMMA for 7900XTX (#1563 ) * go * hip no LRU * work * works * 16 TFLOPS * 29 TFLOPS * 30 TFLOPS * never mind, it's 60 TFLOPS * fix metal WMMA * put hip alloc back	2023-08-19 09:07:23 -07:00
chenyu	ae39cf84ab	Symbolic Shape JIT main PR (#1353 ) * Symbolic Shape JIT update tests 2 variables symbolic ops, adding more tests test passing cleanup * more test cases * single flag * review update * jit attention one piece * realize * symbolic_jit test for cuda * old artifact * works with cuda gpu but failed ci * CUDACPU	2023-08-18 14:39:55 -07:00
wozeparrot	50decf0d45	train cifar using multigpu (#1529 ) * feat: train cifar using multigpu * feat: split eval batch across 5 * feat: cleaner allreduce * feat: 93.88% * feat: cleaner batch chunking from bert * feat: cleaner grad sync * feat: tinygrad argmax * feat: make it work with different gpu counts * feat: move some stuff into the normal __init__ * feat: autodetect gpu count * feat: move import inside	2023-08-18 09:35:44 -07:00
wozeparrot	15150d60c4	fix: small fix for lru on hip (#1567 )	2023-08-18 09:18:38 -07:00
Ethan Sorrell	cb62911f6b	PTX Reintegration and Passing Tests (#1512 ) * move assembly, assembly_ptx * successful but broken rendering of ptx asm * clear ins before render asm * slightly less broken :') * we needed thread syncs * fix float16 loading, rounding modifiers and other casting stuff, passing casts_from_half * Fix runtime_args for gpuocelot * our casts were flipped on both ends * more casting * add ternary where op * dealing with storing/loading bool * add test for casting to bool from negative * Fix args.valid on ConstOp * add to CI, TODO: fix runtime_args for test_uops * fix placement of runtime_args to work with lazy.Device * undo ci changes so I can push * fix lints * start cleanup and fix things we broke fixing lints * add checks for PTX specifc asm instructions * revert added test -- doesn't pass on llvm * skip tests for underflow,overflow * another fix for how we're setting runtime args * Less broken cleanup * add to CI * add more env variables for ci test * fix ci to install pycuda for ptx * ci: copy cuda test command * cleanup * assert to make sure we're actually running ptx in ci * remove test assert * move is_ptx arg * move assembly, assembly_ptx back to extras * fix imports * initial merge fixes * clear registers, fix UOps.LOAD with invalid value * draft merge fixes * remove prints * quick lint and merge fixes * cleanup * remove PTXProgram wrapper * final cleanup * temp change for ci rerun * ci rerun * rollback ISA version	2023-08-16 16:20:20 -07:00
JaSpa99	491e85597a	Run onnx commavq model (#1537 ) * try to run commavq * fix 0 dim, start implementing new ops - Implement EmbedLayerNormalization - Implement Attention * SkipLayerNormalization and FastGelu * use original torch model, cast inputs * fix some ops: - properly do Cast - Attention: bi- and unidirectional - FastGelu: add bias before gelu * cleanup onnx_ops.py * add validation option to benchmark * cleanup imports * add checks incase onnx2torch implements ops in future * run onnx instead of original torch * just skip gpu on m1 * reactivate the other models * check for strange params & squash whitespace * cleanup * fix causal mask Attention * Range doesn't need int cast * embedding vocab_counter same dtype as input * no need to cast * always validate, fix PosixPath ort --------- Co-authored-by: George Hotz <george@comma.ai>	2023-08-16 12:24:40 -07:00
George Hotz	f8109b830c	promote assembly to the main codebase (#1544 ) * promote assembly to the main codebase * not namedtuple	2023-08-14 22:47:45 -07:00
Steven Anderson	93a36c3659	Arm (#1421 ) * testing new memops * better debugging * testing padded conv * branching with load * refactoring a bit * first try * fixing bugs * fixing some * eq * eq2 * do not use x's * working * fixing imm * getting things working * refactor * pow not working * working except one * refactor: one store mem * refactor: global load * refactor: imm * refactor: cleaning * fixing big offsets * refactor with ci * try ci * typo * another typo * ubuntu default * forgot git * do i need git? * missing packages * adding python-dev * with cache? * buildx action * buildx name issue? * maybe now? * python3 * newline warning * maybe now * i actually need this * ci should work now * improved caching * fixing cache * maybe now it will cache * this * testing cache * trying again * load * missing platform * caching gha * testing cache * full testing * typo * now? * why * adding checkout back * bad formatting * fixing convention issues * supporting python * adding CI flag * testing all * better comments * adding debugging * takes 12x longer * does it output progress now? * ignore models for speed * fixing merge * excluding conv_transpose2d * only 2 test cuz is to slow * another approach * let's see * faster duh * my bad * T_T * typo * sup * with output? * comment test * comment test * comment test * :? * no comment * with cache * back to normal * testing that ci works * back to passing * trying again * does it create another entry * does it create another entry? * build local * hey * Revert "excluding conv_transpose2d" This reverts commit `cc7348de03`. * does it cache if done before? * does it cache? * done * adding test ops * bad formatting * no need for this * working static mem * sum 1d * add ndim * better reg import * fix stack * back to np * working except for softmax * 5 failing * no pogress * remove keystone * remove keystone * testops passing * cleanups * more cleanup * typo * ci * ci2 * cond import * ci3 * ci4 * ci4 * ci5 * ci5 * ci6 * aligment * test all * correct test * err read_unmapped * passing test * ignore for speed * ignore for speed * ci7 * cleanup * remove docker * fixing merge * fixing bugs * add skipload for const ops * comments * First merge to master: Renderer * fix emulation * passing all tests arm64 * cleaning * fix handcoded binary * cleaning * fix errs * fix runtime arg binary * clean git diff * fix and clean * fixing metal test * cleaning * fix metal test * ci ~8 min * fix pylint and clang * cache the files in ops_clang --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-08-14 19:29:30 -07:00
Szymon Ożóg	330fb7b1a3	Print more meaningfull hip error messages (#1530 )	2023-08-12 07:16:20 -07:00
wozeparrot	29d5801387	distributed collectives (#1519 ) * feat: world * feat: tests * feat: no more backwards * feat: recv into * feat: whoops * feat: test in ci * feat: some debug logging * feat: workflow naming * feat: need to set pythonpath * feat: just send to same device * feat: allreduce * feat: test * feat: need contiguous * feat: test in ci * feat: exit with correct code * feat: don't need that * feat: opencl wait_for just doesn't work * feat: synchronize on out * feat: try? * feat: try again? * feat: add extra realizes * feat: print * feat: seed * feat: tol * feat: test ones and zeros * feat: remove print * feat: are you just flaky * feat: seperate scatter and gather? * feat: just try synchronizing * feat: remove print again * feat: bring back difference * feat: no sync * feat: revert that * feat: back to wait_for * fix: typo	2023-08-11 10:22:07 -07:00
wozeparrot	7e7c9001e9	distributed world (#1481 ) * feat: world * feat: tests * feat: no more backwards * feat: recv into * feat: whoops * feat: test in ci * feat: some debug logging * feat: workflow naming * feat: need to set pythonpath * feat: just send to same device	2023-08-10 10:00:51 -07:00
George Hotz	c417cd3c97	fast HIP gemm -> 100 TFLOPS (#1476 ) * fast HIP gemm * wmma * correct b * fix spilling * 60 TFLOPS * 64 TFLOPS * 65 TFLOPS	2023-08-09 06:54:15 -07:00
Yixiang Gao	6480a1a180	CIFAR 94.03% (#1340 ) * add disk_tensor * fix jit * new baseline before whitening * whitening through torch * whiting done currently at 91.65% * 91.99% * clean up mixup and 92.3% * clean up 92.30% * 92.49% before searching for new hyper-parameters * fix CI * fix white space * add whitening init in test * refactor, update hyperpara, 92.72% * converting whiting to tinygrad operation * update CI kernels count for CIFAR * add pad reflect * add random crop 92.53% * update hyperpara 93% * 93.15% on docker container, need to refactor the assignment for hyper param * print out weights and bias to be separated * bias/non-bias params separated * fix whitespace * clean up * refactor hyper-param with dict * refactor lr schedular params * fix whitespace * fix cross entropy loss * fix whitespace * move opt hyp to hyp dict * minor fixup * adjust model, loss scaling * 92.74% while using half of compute as before * update hyp for cutmix * random shuffle during batches * clean up * updating the model * update ConvGroup * disable gradients for batchnorm layer weights * whitespace * 93.92% * clean up * finally 94%git add .! * rewrite whitening to remove dependency on torch * whitespace * remove dependency on torch, 93.91% * back to 94.03% * clean up * update test_real_world	2023-08-08 15:13:24 -07:00
George Hotz	d24f936501	just cmplt (#1493 ) * just cmplt * fix maximum * don't save, there's no backward * ugh, no slot either * eq is a scam	2023-08-08 13:58:10 -07:00
Roelof van Dijk	0ce7511110	fix: is not use with a literal (#1487 ) Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>	2023-08-08 07:35:30 -07:00
Diogo	4dc8595069	simple exporting models (#1344 ) * unified exporting * json exporting * ignore more * simplified buffer export * added dtypes * added assert * swift example * fix tests * linter * remove whitespace * fixed tests * remove swift example * remove unintended changes * allow callable models to be used * whitespace * more readable json export * name change * whitespace * whitespace	2023-08-01 09:35:48 -07:00
David Hou	3300d0aeaf	syncthreads before wmma (#1389 ) (venv) chaos@tiny3:~/tinygrad$ KX=2 KY=2 N=2048 python extra/gemm/hip_matmul.py 4194304 289.60 us, would be 59322.55 GFLOPS matmul, 173.80 GB/s	2023-07-31 17:05:49 -07:00
George Hotz	37fa7e96fb	Revert "update editorconfig, enforce via CI (#1343 )" (#1380 ) This reverts commit `da2efecbe2`.	2023-07-31 10:35:50 -07:00
Pavol Rusnak	da2efecbe2	update editorconfig, enforce via CI (#1343 ) * update editorconfig to set unix-style newlines and trim whitespace * add editorconfig github action to the CI * fix whitespace	2023-07-30 18:44:30 -07:00

1 2 3 4 5 ...

478 Commits (deepcrayon)