1.0 KiB
1.0 KiB
Distributed Computing
HOWTO Set up and run Tensorflow on multiple nodes. This is to this particular configuration.
Software
Main software in use:
- Debian
- Proxmox
- Ceph
- Python 3
- Tensorflow
- Jupyter
clusterssh
Installation
Major steps.
- Install Proxmox on bare metal.
- Clone Debian KVM Nodes.
- Set up nodes.
- Install Tensorflow.
- Set up Ceph.
Proxmox
Setting up Proxmox is outside the scope of this document. All you really need is some virtual machines, however they are created.
Set up nodes
# On main workstation or node where you built tensorflow:
NODES="ml1 ml2 ml3 ml4 ml5"
# On worker nodes:
sudo apt update
sudo apt install python3-pip sshfs jq
pip3 install --upgrade --user pip
# make sure new `pip3` at `~/.local/bin/pip3` is in front in `$PATH`.
pip3 install --upgrade --user -r requirements-node.txt
# If you have cloned the tensorflow repo, test with:
#python3 ~/devel/tensorflow/tensorflow/tensorflow/python/distribute/multi_worker_continuous_run_test.py