Python API of Rabit¶
This page contains document of python API of rabit.
Reliable Allreduce and Broadcast Library.
Author: Tianqi Chen
-
rabit.
allreduce
(data, op, prepare_fun=None)¶ Perform allreduce, return the result.
Parameters: - data (numpy array) – Input data.
- op (int) – Reduction operators, can be MIN, MAX, SUM, BITOR
- prepare_fun (function) – Lazy preprocessing function, if it is not None, prepare_fun(data) will be called by the function before performing allreduce, to intialize the data If the result of Allreduce can be recovered directly, then prepare_fun will NOT be called
Returns: result – The result of allreduce, have same shape as data
Return type: array_like
Notes
This function is not thread-safe.
-
rabit.
broadcast
(data, root)¶ Broadcast object from one node to all other nodes.
Parameters: - data (any type that can be pickled) – Input data, if current rank does not equal root, this can be None
- root (int) – Rank of the node to broadcast data from.
Returns: object – the result of broadcast.
Return type: int
-
rabit.
checkpoint
(global_model, local_model=None)¶ Checkpoint the model.
This means we finished a stage of execution. Every time we call check point, there is a version number which will increase by one.
Parameters: - global_model (anytype that can be pickled) – globally shared model/state when calling this function, the caller need to gauranttees that global_model is the same in all nodes
- local_model (anytype that can be pickled) – Local model, that is specific to current node/rank. This can be None when no local state is needed.
Notes
local_model requires explicit replication of the model for fault-tolerance. This will bring replication cost in checkpoint function. while global_model do not need explicit replication. It is recommended to use global_model if possible.
-
rabit.
finalize
()¶ Finalize the rabit engine.
Call this function after you finished all jobs.
-
rabit.
get_processor_name
()¶ Get the processor name.
Returns: name – the name of processor(host) Return type: str
-
rabit.
get_rank
()¶ Get rank of current process.
Returns: rank – Rank of current process. Return type: int
-
rabit.
get_world_size
()¶ Get total number workers.
Returns: n – Total number of process. Return type: int
-
rabit.
init
(args=None, lib='standard')¶ Intialize the rabit module, call this once before using anything.
Parameters: - args (list of str, optional) – The list of arguments used to initialized the rabit usually you need to pass in sys.argv. Defaults to sys.argv when it is None.
- lib ({'standard', 'mock', 'mpi'}) – Type of library we want to load
-
rabit.
load_checkpoint
(with_local=False)¶ Load latest check point.
Parameters: with_local (bool, optional) – whether the checkpoint contains local model Returns: tuple – if with_local: return (version, gobal_model, local_model) else return (version, gobal_model) if returned version == 0, this means no model has been CheckPointed and global_model, local_model returned will be None Return type: tuple
-
rabit.
tracker_print
(msg)¶ Print message to the tracker.
This function can be used to communicate the information of the progress to the tracker
Parameters: msg (str) – The message to be printed to tracker.
-
rabit.
version_number
()¶ Returns version number of current stored model.
This means how many calls to CheckPoint we made so far.
Returns: version – Version number of currently stored model Return type: int