1
0
Fork 0
Commit Graph

2997 Commits (01503ca90dd4d246ae1441d5604c5ad49207d0e2)

Author SHA1 Message Date
Jeff Moe 01503ca90d sphinx, crufty 2023-12-04 22:00:18 -07:00
Jeff Moe 8e35afc276 ignore tmp 2023-12-04 21:21:52 -07:00
Jeff Moe 4b032f66b0 Initial translation strings 2023-12-04 20:55:50 -07:00
Jeff Moe ad752fc417 Set up for tinygrab from Parrot stubs 2023-12-04 20:52:35 -07:00
Jeff Moe cf472ad527 Pull in Sphinx build config, etc from Parrot 2023-12-04 20:40:20 -07:00
Jeff Moe e29858ae64 tinygrab readme 2023-12-04 20:27:18 -07:00
Jeff Moe 6185369c64 mv upstream README 2023-12-04 20:21:13 -07:00
chenyu 6ba6349c97
JIT=0 llama.py should not jit (#2609) 2023-12-04 20:21:07 -05:00
George Hotz 41d696145d hotfix: forking works okay in HIP now 2023-12-04 21:59:18 +00:00
George Hotz 09b6e254a3
hip compile speed (#2606) 2023-12-04 13:47:40 -08:00
nimlgen 19a0a839db
fix used resources in metal graph (#2604) 2023-12-04 13:45:51 -08:00
Yixiang Gao fde44aed76
update hip_matmul with new abstraction (#2605) 2023-12-04 13:37:10 -08:00
George Hotz 5540f6e966 hotfix: make_half4 2023-12-04 09:58:34 -08:00
Amrit Sahu e8d6a6ef2e
view.reshape without symbolic (#2218)
* handle reshape of contiguous subparts with explicit mask

* remove the add/remove ones logic in reshape

* accomodate ones in accumulate logic

* make multiply commutative

* fix linting

* make mypy happy

* add test for commutative mul

* merge dimensions in shape_strides for 1 range masks

* add offsets for merging

* fix linting

* add back explicit 1 reshapes

* fix mypy errors

* fix accumulate by includng state

* include non-zero stride dimension in acc

* small cleanup

* more compact to_shape_strides

* more logical cleanup

* compress more

* compress reshape mask

* adding some comments

* small bug fix

* improve test coverage

* remove explicit add remove ones

* small bug in test

* enable test_reshape_splitting_combining

* small fix

* 10 lines less to_shape_strides

* shorten reshape mask

* some more cleanup

* more cleanup

* introduce some symbols for compactness

* more symbols

* more cleaner

* lessen symbols, it became less readable

* remove merge_views from view.reshape

* change to_shape_strides to _merge_dims

* improve readability

* fix corner case

* cleanup

* better handling of 1 <= Variable('i',1,10) & new_dim = Variable('i',1,10)

* rewrite _reshape_mask for readability

* fix white space

* add comment

* nice shorthands for readability

* add proof in docs

* small nit

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2023-12-04 12:46:53 -05:00
George Hotz 664475f247
vals is an argument (#2599)
* vals is an argument

* don't even know how that's legal python
2023-12-03 21:50:43 -08:00
George Hotz fcd0b2ee6c
fix multigpu on tinybox (#2595)
* fix multigpu on tinybox

* fixed multigpu
2023-12-03 16:48:07 -08:00
George Hotz 61c0113928 test external_multi_gpu.py (and works in CUDA) 2023-12-03 15:57:13 -08:00
George Hotz bbeba8ec85
use default dict for external_model_benchmark (#2592)
* device default

* Device.DEFAULT

* half max for cuda

* CUDA_INCLUDE_PATH

* closer to working

* cuda fixups

* Update ops_cuda.py
2023-12-03 15:25:43 -08:00
chenyu 550817389a
enable test_sample for all backend (#2593) 2023-12-03 17:20:27 -05:00
chenyu a58736fdf1
print DEBUG=2 stats after stats update (#2590) 2023-12-03 17:13:37 -05:00
George Hotz bc012f26b9 hotfix, disable model inference benchmark on NVIDIA 2023-12-03 13:52:41 -08:00
qazal 4380ccb169
Non fp32 math (#2264)
* `global_load` and `global_store` using buffer dtype

* `UOps.PHI` in all dtypes

* `UOps.ALU` in all dtypes

* `UOps.CONST` & `UOps.DEFINE_ACC` in all dtypes

* -- endof implementation --
+tiny lint changes

* these tests require the fp16 extention

you can run them locally to confirm they're green: (GPT2 test is broken in master for mac, see [this](https://discord.com/channels/1068976834382925865/1069001075828469790/1177993277958533261)

`GPU=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_max_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_min_float16_cpu test/models/test_real_world.py::TestRealWorld::test_llama test/models/test_real_world.py::TestRealWorld::test_gpt2 test/models/test_whisper.py test/test_specific_conv.py::TestSpecific::test_big_vec_mul`

skip the new test_linearizer_failures in CI GPU because of the fp16 extention

This passes on a real GPU since the extention is available:
`GPU=1 python3 -m pytest test/test_linearizer_failures.py::TestLinearizerFailures::test_failure_8`

see CI logs [here](https://github.com/tinygrad/tinygrad/actions/runs/6996590597/job/19032641427#step:14:644)

* these tests fail in CI due to segfaults and CPU crashes

To confirm they're green locally, you can run the following commands:

1. For the tests skipped in test_ops.py (note: CLANG is very slow)

`for var in GPU CUDA CLANG; do export $var=1; for test in test/test_ops.py::TestOps::test_slice_fancy_indexing_no_dim_collapse test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_collapse_int test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_none test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_and_collapse; do python3 -m pytest $test; done; unset $var; done`

2. For the ONNX tests skipped in CLANG:

```
CLANG=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_ai_onnx_ml_array_feature_extractor_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_0_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_1_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_none_no_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_negative_indices_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_no_weight_reduction_mean_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_mean_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_mean_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_none_no_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_sum_weight_high_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_expanded_cpu
```

3. The LLVM test I skipped here is already [skipped in master for all backends](https://github.com/tinygrad/tinygrad/blob/master/test/external/external_test_onnx_backend.py#L186), I just made it more specific

`LLVM=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu`

* Revert "these tests fail in CI due to segfaults and CPU crashes"

This reverts commit 15db570143.

* merge with cleanup-vectorized-hip-renders

* barely working HIP P1, ALU ops need a refactor?

* manage the fact that in HIP [half2 is actually an unsigned int vec](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L59)) and half is a totally different __half that [has an unsigned int element in it](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L50)) but can't be accessed [because it's private](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L86)). If you just do this:

```
half2 val0 = // ...
half val1 = // ...
```
then you can't do:
```
val0.x + val1 // error: use of overloaded operator '+' is ambiguous (with operand types 'unsigned short' and 'half' (aka '__half'))
```

* update the sign definition to avoid division by zero in all dtypes

* diff cleanup p1: why were these in the diff anyways

* less hacky HIP, enable CIFAR fp16 benchmark, test ops for HIP in CI!

add ALU ops overloads for HIP

this will make HIP max work

handle mod

Revert "handle mod"

This reverts commit 370fd4b3fbe99b6ae8cc293d005b106628205933.

update max to use hmax

add HIP GEP render logic

enable CIFAR fp16 benchmark

test ops for HIP

back to store as float because this only works for float4 grouping right now

test_ops for hip!!

always sign

* back to the sign we had before because we cant do a backward pass on a Less node

* remove old hacks

HIP compiling test_ops in CI takes ~9 mins, not doing it for now

new HIP ALUs

* reduce accs done right

* refactor to function

* no device hacks

hacks p2

the other way

* LLVM ALU ops

half, float and double are all float

update max

* update test_uops, cmplt is always a bool in the real linearizer. assertAlmostEqual is wrong when ret is bool

* cleanup LLVM wrong code

* dummy change for the CUDA install glitch

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-12-03 13:45:49 -08:00
chenyu 1ac958a058
update pytest marks and CI test filters (#2587)
* remove pytest marks

* test more stuff

* fine revert some

* add that mark back

* skip that

* hmm LLVM does not work on ubuntu

* too slow on CUDA CI

* dup test
2023-12-03 15:20:44 -05:00
nimlgen 88a5c368d4
fix metal graph with var_vals (#2583) 2023-12-03 09:24:36 -08:00
qazal f180cac8f0
wgsl renderer cleanup: use the same const render, reuse cast render logic (#2579)
* share const render

* should just use render_cast here
2023-12-03 09:24:00 -08:00
qazal ab2d4d8d29
Fix cl import in the copy_speed test and cifar example (#2586)
* fix CL import

* update test to only run on GPU

* update hlb_cifar too
2023-12-03 09:22:07 -08:00
chenyu 3226b3d96b
enable the jit random test (#2580) 2023-12-02 20:25:23 -05:00
chenyu 09c9794f3f
clean external_test_opt.py (#2578) 2023-12-02 19:51:08 -05:00
George Hotz 171543fc8d
cleanups to save lines and files (#2577)
* runtime/graph -> features/graph

* put all the cstyle renderers in cstyle

* same line for those

* how did that pass mypy
2023-12-02 16:29:56 -08:00
George Hotz a9a76639c8
that's not needed (#2574) 2023-12-02 16:01:29 -08:00
chenyu 875c34bfc4
minor lazy tweak before rewrite (#2573) 2023-12-02 18:23:33 -05:00
qazal fa1d4dd14b
implement MAX in other dtypes (#2572) 2023-12-02 15:21:59 -08:00
nimlgen 065495e0c9
save a few lines in ops_gpu (#2564)
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-12-02 15:05:22 -08:00
Oleg Rybalko 5e87083783
Whisper + LLAMA + VITS (#2332)
* feat: working voice 2 text using whisper

* feat: added llama generation

* feat: vits init

* feat: more accurate voice conversion

* feat: support for tts and working pipeline for the first pass

* fix: linter checks

* refactored vits initialization and inference, added mmts-tts support

* fixed process sync and now we can have an infinite conversation

* reuse output stream to remove overhead of creating a new one each time

* added pre-prompt configuration with yaml files

* adjusted code to merge PR which changed whisper

* optimized whisper, now it's blazing fast and also reduced number of lines

* added better debug printing

* use jitted encode function for whisper, added timings and removed response delim to save speed on generating those tokens

* fixed hf convert and now it's working with tinyllama

* added tinyllama config

* refactored code and made it work with all llama models

* prettier order

* prettier order

* fixed suffix for tinyllama and refactored convert_from_hf

* added missing parameters

* fixed stream release and added missing params

* jitted dp and encoder

* jitted flow forward

* removed re-init of espeak on each call to save up time

* jitted generator forward for blazing fast tts

* added contextmanager for displaying a chat log

* removed whitespace for pylint

* updated code to support latest fetch func

* wait for llama eos token and pass params from cli to llama

* listen for not fixed amount of time

* refactored code a bit

* removed thresholding and now the output streams directly to whisper

* tokenize llama output for vits batch size to work and stream each sentence to a speaker

* changed speaker

* whisper is now printing on the same line

* don't trigger llama on whisper output in parens

* added tinyllama chat model

* adjusted code to work with tinyllama chat model

* removed unused cli arg

* autofetch tokenizer and tinyllama model. add 3 chat tokens to the tokenizer

* fixed issue with long sentences by chunking them

* support for multiline llama output

* prettified log output

* adjusted sentence length

* remove quote from response to avoid funny tts

* fixed prompts

* added missing parameter
2023-12-02 15:03:46 -08:00
qazal 47cec4caf3
int operations shouldn't have a fast math flag (#2571) 2023-12-02 14:53:36 -08:00
George Hotz d6b404ac11
No dtype alloc (#2570)
* fix all allocs

* improve docs

* ugh fix fake alloc
2023-12-02 13:29:40 -08:00
chenyu c8774713c5
lazy cleanup (#2567) 2023-12-02 13:21:43 -05:00
George Hotz 5068e99d18
refactor to remove extra kernel params (#2563)
* refactor to have compiled kernel

* bugfixes

* docs/beautiful.py

* revert that

* fix tests
2023-12-02 00:32:25 -08:00
George Hotz 27481b9206
Switch ops_gpu -> gpuctypes (#2532)
* ops_gpu is go

* fix size 0

* fix image, and add more tests

* nerf openpilot test, doesn't test thneed

* run the schedule

* better

* oops, new inputs

* delete pyopencl

* Update ops_gpu.py
2023-12-01 22:30:21 -08:00
qazal 99ee2ec37a
Refactor code_for_op to accept a dtype (#2555)
* update cstyle renderers to take a dtype in code_for_op

* implement NEG for bools in LLVM

* update triton

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-12-01 22:05:28 -08:00
George Hotz 82fd932921
Lower schedule 2 (#2561)
* ls2

* fix types

* simpler

* cleaner
2023-12-01 20:25:49 -08:00
George Hotz 217cda81ba hotfix: no metalgraph if there's weird ops 2023-12-01 19:32:55 -08:00
George Hotz 6733425095
lower schedule (#2559)
* lower schedule

* remove RAND, and don't put load in the JIT yet

* better fix for that test
2023-12-01 19:17:46 -08:00
Christopher Mauri Milan 077567f62d
Remove as_buffer for TORCH (#2554)
* remove as_buffer for torch

* enable torch zerocopy if on cpu

* remove as_buffer even on torch:cpu
2023-12-01 18:51:38 -08:00
chenyu 05a5357dd9
fix handcode_resnet50_opt.py (#2558) 2023-12-01 20:51:21 -05:00
chenyu 86fbd413f3
update test_real_world configs (#2557) 2023-12-01 20:03:52 -05:00
andresgit 00523d5656
New fix accessing elements created by padding (#2529)
* pad slice test cases, many failing

* fix failing test cases

check mask if we are outside the base buffer
also create a multi-view if in that case we reshape to an empty shape

* real_offset calculation more readable

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2023-12-01 19:08:10 -05:00
George Hotz bfdce1f0e7 hotfix: make openpilot test deterministic 2023-12-01 15:37:23 -08:00
George Hotz 9c306be282 better name for fast path 2023-12-01 15:32:47 -08:00
chenyu 67f4e03724
rewrite 0 size loadop into a CONST (#2556)
* rewrite 0 size loadop into a CONST

* check alloc size

* EMPTY is better

* Revert "EMPTY is better"

This reverts commit 574fe0f9ed28f1b97da5a81afdfd2cd5d9a94ff9.

* no ast is created

* fix test
2023-12-01 18:29:06 -05:00