2020-01-16 14:36:52 -07:00
|
|
|
# Distributed Computing
|
|
|
|
HOWTO Set up and run Tensorflow on multiple nodes.
|
|
|
|
This is to this particular configuration.
|
|
|
|
|
|
|
|
# Software
|
|
|
|
Main software in use:
|
|
|
|
|
|
|
|
* Debian
|
|
|
|
* Proxmox
|
|
|
|
* Ceph
|
|
|
|
* Python 3
|
|
|
|
* Tensorflow
|
|
|
|
* Jupyter
|
|
|
|
* `clusterssh`
|
|
|
|
|
|
|
|
# Installation
|
|
|
|
Major steps.
|
|
|
|
|
|
|
|
1. Install Proxmox on bare metal.
|
|
|
|
1. Clone Debian KVM Nodes.
|
|
|
|
1. Set up nodes.
|
|
|
|
1. Install Tensorflow.
|
|
|
|
1. Set up Ceph.
|
|
|
|
|
2020-01-16 16:00:25 -07:00
|
|
|
## Proxmox
|
|
|
|
Setting up Proxmox is outside the scope of this document.
|
|
|
|
All you really need is some virtual machines, however
|
|
|
|
they are created.
|
|
|
|
|
|
|
|
* https://www.proxmox.com/en/proxmox-ve
|
|
|
|
|
2020-01-17 17:45:14 -07:00
|
|
|
## Set up nodes
|
|
|
|
```
|
|
|
|
# On main workstation or node where you built tensorflow:
|
|
|
|
NODES="ml1 ml2 ml3 ml4 ml5"
|
|
|
|
```
|
|
|
|
|
|
|
|
```
|
|
|
|
# On worker nodes:
|
|
|
|
sudo apt update
|
2022-05-29 18:30:10 -06:00
|
|
|
sudo apt install python3-pip sshfs jq
|
|
|
|
pip3 install --upgrade --user pip
|
2020-01-20 13:47:39 -07:00
|
|
|
# make sure new `pip3` at `~/.local/bin/pip3` is in front in `$PATH`.
|
2022-05-29 18:30:10 -06:00
|
|
|
pip3 install --upgrade --user -r requirements-node.txt
|
2020-01-20 13:47:39 -07:00
|
|
|
|
2022-05-29 18:30:10 -06:00
|
|
|
# If you have cloned the tensorflow repo, test with:
|
|
|
|
#python3 ~/devel/tensorflow/tensorflow/tensorflow/python/distribute/multi_worker_continuous_run_test.py
|
2020-01-21 11:45:31 -07:00
|
|
|
```
|
2020-01-29 13:12:50 -07:00
|
|
|
|