* feat: working voice 2 text using whisper
* feat: added llama generation
* feat: vits init
* feat: more accurate voice conversion
* feat: support for tts and working pipeline for the first pass
* fix: linter checks
* refactored vits initialization and inference, added mmts-tts support
* fixed process sync and now we can have an infinite conversation
* reuse output stream to remove overhead of creating a new one each time
* added pre-prompt configuration with yaml files
* adjusted code to merge PR which changed whisper
* optimized whisper, now it's blazing fast and also reduced number of lines
* added better debug printing
* use jitted encode function for whisper, added timings and removed response delim to save speed on generating those tokens
* fixed hf convert and now it's working with tinyllama
* added tinyllama config
* refactored code and made it work with all llama models
* prettier order
* prettier order
* fixed suffix for tinyllama and refactored convert_from_hf
* added missing parameters
* fixed stream release and added missing params
* jitted dp and encoder
* jitted flow forward
* removed re-init of espeak on each call to save up time
* jitted generator forward for blazing fast tts
* added contextmanager for displaying a chat log
* removed whitespace for pylint
* updated code to support latest fetch func
* wait for llama eos token and pass params from cli to llama
* listen for not fixed amount of time
* refactored code a bit
* removed thresholding and now the output streams directly to whisper
* tokenize llama output for vits batch size to work and stream each sentence to a speaker
* changed speaker
* whisper is now printing on the same line
* don't trigger llama on whisper output in parens
* added tinyllama chat model
* adjusted code to work with tinyllama chat model
* removed unused cli arg
* autofetch tokenizer and tinyllama model. add 3 chat tokens to the tokenizer
* fixed issue with long sentences by chunking them
* support for multiline llama output
* prettified log output
* adjusted sentence length
* remove quote from response to avoid funny tts
* fixed prompts
* added missing parameter
* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device
* torch and numpy don't share ops anymore
* that should be filtered out elsewhere
* still const
* graph + enet example cleanup
* hmm, we do still need it because of symbolic
* add name support
* use fetch in gpt2
* remove requests from main lib, networkx also optional
* umm, keep that assert
* updates to fetch
* i love the walrus so much
* stop bundling mnist with tinygrad
* err, https
* download cache names
* add DOWNLOAD_CACHE_VERSION
* need env.
* ugh, wrong path
* replace get_child
* fixed hf convert and now it's working with tinyllama
* added tinyllama config
* refactored code and made it work with all llama models
* prettier order
* prettier order
* fixed suffix for tinyllama and refactored convert_from_hf
* dynamically update help if MODEL_PARAMS changes and default size is the 1st
* beautiful mnist
* beautiful mnist example
* from tinygrad import Tensor
* more beautiful
* the jit is super core tinygrad
* globalcounters reset on jit run
* symlinks and exclude
* beautiful_cartpole
* evaluate is it's own function
* no symlinks
* more beautiful
* jit reset for double speed
* type hinting for JIT
* beautiful_mnist gets 98%
* beautiful_mnist < 4s with BEAM=2
* better cartpole
* use actor critic
* zero_grad got lost
* delete double relu
* stable cartpole with PPO
* beautiful_cartpole is more beautiful
* REPLAY_BUFFER
* beautiful stuff typechecks
* None support in shape
* hp tuning
* var_vals are global
* working with global ish
* better
* fix export model
* fix tests
* better kv cache
* does it run?
* use where for kvmask
* fix excessive var_vals
* fix import
* how does multigpu use this?
* llama kinda work
* faster and simpler
* cleanup
* fix conversation mode
* test cleanups
* fix one more test
* test cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* WIP: Stable diffusion WebGPU port
* Load whole model: split safetensor to avoid Chrome allocation limit
* Gitignore .DS_Store, remove debug print
* Clip tokenizer in JS
* WIP: Compile model in parts (text model, diffusor, get_x_prev_and_pred_x0, decoder), and recreate forward logic in JS
* e2e stable diffusion flow
* Create initial random latent tensor in JS
* SD working e2e
* Log if some weights were not loaded properly
* Remove latent_tensor.npy used for debugging
* Cleanup, remove useless logs
* Improve UI
* Add progress bar
* Remove .npy files used for debugging
* Add clip tokenizer as external dependency
* Remove alphas_cumprod.js and load it from safetensors
* Refactor
* Simplify a lot
* Dedup base when limiting elementwise merge (webgpu)
* Add return type to safe_load_metadata
* Do not allow run when webgpu is not supported
* Add progress bar, refactor, fix special names
* Add option to chose from local vs huggingface weights
* lowercase tinygrad :)
* fp16 model dl, decompression client side
* Cache f16 model in browser, better progress
* Cache miss recovery
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* merge kernel and optimizer
* linearize is reentrant
* move global/local size
* clean up linearizer copy
* remove unneeded lin copies
* stop linearizing twice
* oops, that should be None
* Enable Multi-Output Export
* Add test
* Update examples and lint
* fix padding
* test ops
* dummy commit to rerun test
* revert cuda lint
* Enforce tuple/list of tensors
* subscripted generics
* put back webgpu test
* Re-enable WebGPU Efficientnet test
* stable diffusion < 324ms
* revert swap action
* fix tests due to more sum splitting
* REDUCEOP_SPLIT_THRESHOLD env var
* added from unaligned np test (#2134)
* align cpu buffer before copy into cl buffer (#2135)
* remove shelve from handcode_resnet50_opt.py (#2139)
* Add dictionary keys to reduce db size (#2131)
* work
* ignore beam cache
* dictionary keys are generic
* minor db cleanups
* fix baseline and extract dataset
* fix training
* log likelihood
* more lin to feats
* sts
* training policynet
* net sort of works
* dedup
* refactor, stupid new actions
* fix uops deduping
* BEAM_ESTIMATE
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>
* feat: move to hip
* feat: special path for RawBufferTransfer
* feat: initial rawbuffertransfer
* feat: hip ipc
* feat: working hip ipc
* feat: need to base device without args
* feat: close mem handle
* feat: modified test
* feat: more multihip stuff
* clean: cleanup
* feat: cleaner
* feat: don't crash
* feat: test more
* clean: way cleaner hip wrapper
* feat: barrier
* feat: barrier
* feat: this breaks stuff
* feat: we can use empty here
* feat: maybe fix tests
* feat: maybe fix tests again?
* fix: probably fix tests
* feat: no waiting here
* feat: wait here
* feat: much larger test
* feat: need to sync here
* feat: make this async
* feat: no waiting!
* feat: cut here
* feat: sync copy
* feat: random imports
* feat: much cleaner world
* feat: restore this
* feat: restore this
* clean: cleanup
* feat: set this