satnogs-wut/docs/README-distributed.md

50 lines
1.0 KiB
Markdown
Raw Permalink Normal View History

2020-01-16 14:36:52 -07:00
# Distributed Computing
HOWTO Set up and run Tensorflow on multiple nodes.
This is to this particular configuration.
# Software
Main software in use:
* Debian
* Proxmox
* Ceph
* Python 3
* Tensorflow
* Jupyter
* `clusterssh`
# Installation
Major steps.
1. Install Proxmox on bare metal.
1. Clone Debian KVM Nodes.
1. Set up nodes.
1. Install Tensorflow.
1. Set up Ceph.
2020-01-16 16:00:25 -07:00
## Proxmox
Setting up Proxmox is outside the scope of this document.
All you really need is some virtual machines, however
they are created.
* https://www.proxmox.com/en/proxmox-ve
2020-01-17 17:45:14 -07:00
## Set up nodes
```
# On main workstation or node where you built tensorflow:
NODES="ml1 ml2 ml3 ml4 ml5"
```
```
# On worker nodes:
sudo apt update
sudo apt install python3-pip sshfs jq
pip3 install --upgrade --user pip
2020-01-20 13:47:39 -07:00
# make sure new `pip3` at `~/.local/bin/pip3` is in front in `$PATH`.
pip3 install --upgrade --user -r requirements-node.txt
2020-01-20 13:47:39 -07:00
# If you have cloned the tensorflow repo, test with:
#python3 ~/devel/tensorflow/tensorflow/tensorflow/python/distribute/multi_worker_continuous_run_test.py
2020-01-21 11:45:31 -07:00
```
2020-01-29 13:12:50 -07:00