Distributed Crunching #4

Open
opened 2020-01-16 18:51:59 -07:00 by jebba · 2 comments

Set up Tensorflow to use multiple nodes.

Set up Tensorflow to use multiple nodes.
jebba self-assigned this 2020-01-16 20:47:48 -07:00
jebba added the
Server
AI
ML
labels 2020-01-16 20:57:12 -07:00
Poster
Owner

ml1, ml2, ml3, ml4, ml5 are set up with Debian Buster. Ready to install Tensorflow, etc.

ml1, ml2, ml3, ml4, ml5 are set up with Debian Buster. Ready to install Tensorflow, etc.
Poster
Owner

It will correctly (afaict) distribute for Sequential() and model.compile but fails on model.fit.

Some startup warnings to clean up:

WARNING:tensorflow:`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
WARNING:tensorflow:ModelCheckpoint callback is not provided. Workers will need to restart training if any fails.
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']

Also looks like a small thing to fix:

Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`.

Main problem: during fit(), in the first epoch after running for awhile, it dies with this:

Train for 121 steps, validate for 123 steps
Epoch 1/4
2020-01-18 18:55:17.738187: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Invalid argument: Upper bound check fail for input 1 from node Mkl2Tf/_77 to node scoped_allocator_concat_1_1 input bounds = [0x7fa85486f880, 0x7fa85486ff40] backing_tensor bounds = [0x7fa5d84e1040, 0x7fa600ef9100]
         [[{{node scoped_allocator_concat_1_1}}]]
Additional GRPC error information:
{"created":"@1579398917.740498537","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Upper bound check fail for input 1 from node Mkl2Tf/_77 to node scoped_allocator_concat_1_1 input bounds = [0x7fa85486f880, 0x7fa85486ff40] backing_tensor bounds = [0x7fa5d84e1040, 0x7fa600ef9100]\n\t [[{{node scoped_allocator_concat_1_1}}]]","grpc_status":3}
...
It will correctly (afaict) distribute for Sequential() and model.compile but fails on model.fit. Some startup warnings to clean up: ``` WARNING:tensorflow:`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster. WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation. WARNING:tensorflow:ModelCheckpoint callback is not provided. Workers will need to restart training if any fails. WARNING:tensorflow:sample_weight modes were coerced from ... to ['...'] WARNING:tensorflow:sample_weight modes were coerced from ... to ['...'] ``` Also looks like a small thing to fix: ``` Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`. ``` Main problem: during fit(), in the first epoch after running for awhile, it dies with this: ``` Train for 121 steps, validate for 123 steps Epoch 1/4 2020-01-18 18:55:17.738187: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Invalid argument: Upper bound check fail for input 1 from node Mkl2Tf/_77 to node scoped_allocator_concat_1_1 input bounds = [0x7fa85486f880, 0x7fa85486ff40] backing_tensor bounds = [0x7fa5d84e1040, 0x7fa600ef9100] [[{{node scoped_allocator_concat_1_1}}]] Additional GRPC error information: {"created":"@1579398917.740498537","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Upper bound check fail for input 1 from node Mkl2Tf/_77 to node scoped_allocator_concat_1_1 input bounds = [0x7fa85486f880, 0x7fa85486ff40] backing_tensor bounds = [0x7fa5d84e1040, 0x7fa600ef9100]\n\t [[{{node scoped_allocator_concat_1_1}}]]","grpc_status":3} ... ```
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: spacecruft/satnogs-wut#4
There is no content yet.