satnogs-wut/docs/README-distributed.md

1.0 KiB

Distributed Computing

HOWTO Set up and run Tensorflow on multiple nodes. This is to this particular configuration.

Software

Main software in use:

  • Debian
  • Proxmox
  • Ceph
  • Python 3
  • Tensorflow
  • Jupyter
  • clusterssh

Installation

Major steps.

  1. Install Proxmox on bare metal.
  2. Clone Debian KVM Nodes.
  3. Set up nodes.
  4. Install Tensorflow.
  5. Set up Ceph.

Proxmox

Setting up Proxmox is outside the scope of this document. All you really need is some virtual machines, however they are created.

Set up nodes

# On main workstation or node where you built tensorflow:
NODES="ml1 ml2 ml3 ml4 ml5"
# On worker nodes:
sudo apt update
sudo apt install python3-pip sshfs jq
pip3 install --upgrade --user pip
# make sure new `pip3` at `~/.local/bin/pip3` is in front in `$PATH`.
pip3 install --upgrade --user -r requirements-node.txt

# If you have cloned the tensorflow repo, test with:
#python3 ~/devel/tensorflow/tensorflow/tensorflow/python/distribute/multi_worker_continuous_run_test.py