satnogs-wut/README-distributed.md

1.8 KiB

Distributed Computing

HOWTO Set up and run Tensorflow on multiple nodes. This is to this particular configuration.

Software

Main software in use:

  • Debian
  • Proxmox
  • Ceph
  • Python 3
  • Tensorflow
  • Jupyter
  • clusterssh

Installation

Major steps.

  1. Install Proxmox on bare metal.
  2. Clone Debian KVM Nodes.
  3. Set up nodes.
  4. Install Tensorflow.
  5. Set up Ceph.

Proxmox

Setting up Proxmox is outside the scope of this document. All you really need is some virtual machines, however they are created.

Set up nodes

# On main workstation or node where you built tensorflow:
NODES="ml1 ml2 ml3 ml4 ml5"
for i in $NODES
	do scp -p tensorflow-2.1.0-cp37-cp37m-linux_x86_64.whl $i:
done
# On worker nodes:
sudo apt update
sudo apt install python3-pip sshfs
# XXX deps...
pip3 install --upgrade setuptools
pip3 install --user tensorflow-2.1.0-cp37-cp37m-linux_x86_64.whl
pip3 install --user simplejson
pip3 install --user pillow

Another way, using upstream tensorflow packages. You also have to install the latest pip from pip, or you'll get tensorflow 1.x.

pip3 install pip
pip3 install --upgrade pip
# make sure new `pip3` at `~/.local/bin/pip3` is in front in `$PATH`.
# install tensorflow
pip3 install --user tensorflow
# If that fails due to the PATH, run like:
~/.local/bin/pip3 install --user tensorflow
pip3 list | grep tensorflow
# There's a bunch of tests that can be run, such as:
python3 ~/devel/tensorflow/tensorflow/tensorflow/python/distribute/multi_worker_continuous_run_test.py

Usage

top

Meh

# for running some tensorflow tests:
pip3 install --user portpicker
# For other examples/tests:
#pip3 install --user opencv-python
apt install python3-opencv
pip3 install --user pandas