lib_config module

Management of CorrelX configuration files.

lib_config.create_directories(directories, v=0, file_log=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Create directories from a list of str with their paths.

lib_config.get_conf_out_dirs(master_name, hadoop_dir, app_dir, conf_dir, suffix_conf, output_dir, suffix_out, v=1, file_log=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Get paths for configuration and output folders. App and Conf directories are modified with master’s name.

master_name : str
master node hostname.
hadoop_dir : str
hadoop base folder.
app_dir : str
base app folder.
conf_dir : str
base configuration folder.
suffix_conf : str
suffix for configuration folder.
output_dir : str
base output folder.
suffix_out : str
suffix for output folder (only the string after the last “/” is taken).
v : int
verbose if 1.
file_log : file handler
handler for log file.
app_dir : str
path to app folder (modified for this master).
conf_dir : str
path to configuration folder (for modified config file and bash scripts).
hadoop_conf_dir : str
path to hadoop configuration folder for master node.
hadoop_default_conf_dir : str
path to hadoop default configuration folder (to be used at slaves nodes).
output_dir : str
path in local filesystem for output file.
Having a different folder associated to the master node allows to run multiple deployments in the same cluster if
the local filesystem is NFS.
lib_config.get_config_mod_for_this_master(config_file, config_suffix, master_node, script_arg_zero)[source]
Overwrite all instances of “localhost” in configuration file with master node name
into new configuration file (original configuration file is used as a template), “~” with home folder for the current user, “localuser” with the current user, and “localpath” with the path of the script_arg_zero (mapred_cx.py).
config_file : str
path to CorrelX configuration file.
config_suffix : str
suffix to be added to resulting configuration file.
master_node : str
master node name.
script_arg_zero : str
path given for the main script (mapred_cx.py).
new_config_file : str
configuration file plus suffix.

TO DO:

Move new configuration file into folder with logs for this job.
lib_config.get_configuration(file_log, config_file, timestamp_str, v=0)[source]

Read parameters from configuration file “configh.conf”.

file_log : handler to file
handler to log file.
config_file : str
path to CorrelX configuration file.
timestamp_str : str
suffix to be added to temporary data folder (hwere media will be split).
v : int
verbose if 1.
MAPPER : str
Python mapper (.py).
REDUCER : str
Python reducer (.py).
DEPENDENCIES : str
Comma separated list of Python files required for mapper and reducer (1.py,2.py,etc).
PACKETS_PER_HDFS_BLOCK : int
Number of VDIF frames per file split.
CHECKSUM_SIZE : int
Number of bytes for checksum.
SRC_DIR : str
Folder with Python sources for mapper, reducer and dependencies.
APP_DIR : str
Folder to place mapper, reducer and dependencies in all nodes (in master-associated folder).
CONF_DIR : str
Base working folder for configuration files (to be updated later for this master).
TEMPLATES_CONF_DIR : str
Folder with templates for core-site.xml,yarn-site.xml,mapred-site.xml,hdfs-site.xml.
TEMPLATES_ENV_DIR : str
Folder with templates for hadoop-env.sh, etc.
HADOOP_DIR : str
Path to Hadoop home folder.
HADOOP_CONF_DIR : str
Path to Hadoop configuration folder (to be updated later for this master).
NODES : str
File to write list of nodes to host the cluster (one node per line).
MAPPERSH : str
File to write bash script for mapper (call to python script with all arguments).
REDUCERSH : str
File to write bash script for reducer (call to python script with all arguments).
JOBSH : str
File to write bash script for job request for Hadoop (call to python script with all arguments).
PYTHON_X : str
Path to Python executable.
USERNAME_MACHINES : str
Username for ssh into the cluster machines.
MAX_SLAVES : int
Maximum number of worker nodes (-1 no maximum).
SLAVES : str
Filename for Hadoop slaves file.

MASTERS : str Filename for Hadoop masters file. MASTER_IS_SLAVE : bool

Boolean, if 1 master is also launching a nodemanager (doing mapreduce).
HADOOP_TEMP_DIR : str
Folder for Hadoop temporary folders.
DATA_DIR : str
Path with media input files.
DATA_DIR_TMP : str
Path to folder to place splits of input file before moving them to the distributed filesystem.
HDFS_DATA_DIR : str
Path in the HDFS distributed filesystem to move input splits.
HADOOP_START_DELAY : str
Number of seconds to wait after every interaction with Hadoop during the cluster initialization.
HADOOP_STOP_DELAY : str
Number of seconds to wait after every interaction with Hadoop during the cluster termination.
PREFIX_OUTPUT : str
Prefix for output file.
HADOOP_TEXT_DELIMITER : str
Text delimiter for input splits (lib_mapredcorr.run_mapreduce_sh).
OUTPUT_DIR : str
Folder in local filesystem to place output file.
OUTPUT_SYM : str
Folder within experiment configuration folders to place symbolic link to output file.
RUN_PIPELINE : bool
Boolean, if 1 will run in pipeline mode.
RUN_HADOOP : bool
Boolean, if 1 will run Hadoop.
MAX_CPU_VCORES : int
Maximum number of virtual CPU cores.
HDFS_REPLICATION : int
Number of copies of each input split in HDFS.
OVER_SLURM : bool
Boolean, 1 to run in a cluster where the local filesystem is NFS (or synchronized among all nodes).
HDFS_COPY_DELAY : int
Number of seconds to wait after every interaction with Hadoop during file distribution to HDFS.
FFT_AT_MAPPER : bool
Boolean, if 0 FFT is done at reducer (default).
INI_FOLDER : str
Folder with experiment .ini files.
INI_STATIONS : str
Stations ini file name.
INI_SOURCES : str
Sources ini file name.
INI_DELAY_MODEL : str
Delay model ini file name.
INI_DELAYS : str
Delay polynomials ini file name.
INI_MEDIA : str
Media ini file name.
INI_CORRELATION : str
Correlation ini file name.
INTERNAL_LOG_MAPPER
[remove] currently default 0.
INTERNAL_LOG_REDUCER
[remove] currenlty default 0.
ADJUST_MAPPERS : float
Force number of mappers computed automatically to be multiplied by this number.
ADJUST_REDUCERS : float
Force number of reducers computed automatically to be multiplied by this number.
FFTS_PER_CHUNK
[Remove] Number of DFT windows per mapper output, -1 by default (whole frame)
TEXT_MODE : bool
True by default.
USE_NOHASH_PARTITIONER : bool
True to use NoHash partitioner.
USE_LUSTRE_PLUGIN : bool
True to use Lustre plugin for Hadoop.
LUSTRE_USER_DIR : str
Absolute path for the Lustre working path (used in mapreduce job).
LUSTRE_PREFIX : str
Path in Lustre to preceed HDFS_DATA_DIR if using Lustre.
ONE_BASELINE_PER_TASK : int
0 by default (if 1, old implementation allowed scaling with one baseline per task in the reducers).
MIN_MAPPER_CHUNK
[Remove] Chunk constraints for mapper.
MAX_MAPPER_CHUNK
[Remove] Chunk constraints for mapper.
TASK_SCALING_STATIONS: int
0 by default (if 1, old implementation allowed linear scaling per task in the reducers).
SORT_OUTPUT : bool
If 1 will sort lines in output file.
BM_AVOID_COPY : bool
If 1 will not split and copy input files if this has already been done previously (for benchmarking).
BM_DELETE_OUTPUT : bool
If 1 will not retrieve output file from distributed filesystem (for benchmarking).
TIMEOUT_STOP : int
Number of seconds to wait before terminating nodes during cluster stop routine.
SINGLE_PRECISION : bool
If 1 computations will be done in single precision.
PROFILE_MAP: int
if 1 will generate call graphs with timing information for mapper (requires Python Call Graph package),
if 2 will use cProfile.
PROFILE_RED : int
if 1 will generate call graphs with timing information for reducer (requires Python Call Graph package),
if 2 will use cProfile.

Configuration:

All constants taken from const_config.py and const_hadoop.py.


TO DO:

OVER_SLURM: explain better, and check assumptions.
Remove INTERNAL_LOG_MAPPER and INTERNAL_LOG_REDUCER.
Remove FFTS_PER_CHUNK,MIN_MAPPER_CHUNK and MAX_MAPPER_CHUNK.
Check that SINGLE_PRECISION is followed in mapper and reducer.
lib_config.get_list_configuration_files(config_file)[source]

Get list of Hadoop configuration files.

config_file : str
Path to CorrelX configuration file.
list_configurations : list
List of sections from configuration file associated to lists in “pairs_config” below.
pairs_config : list
List of pairs [[param0,value0],[param1,value1]...]...,[[...]...,[...]]] to update
Hadoop config files later.
lib_config.get_log_file(config_file, suffix='', output_log_folder='e')[source]

Get logging files.

config_file : str
path to CorrelX configuration file.
suffix : str
suffix (with timestamp) to be added to log filename.
output_log_folder : str
suffix to be added to log file path.
file_log : handler to file
handler to log file.
temp_log : str
path to temporary (buffer) file for system calls.
lib_config.get_nodes_file(config_file)[source]

Get name of file with list of nodes (from config file).

config_file : str
path to CorrelX configuration file.
file_read_nodes : str
path to hosts file.
lib_config.is_this_node_master(master, temp_log, v=0, file_log=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Devised in case the script was run in parallel at many nodes. Currenlty simply used to enforce that only one node is running as master.

master : str
master node name.
temp_log : str
path to temporary file for system calls (buffer).
v : int
verbose if 1.
file_log : file handler
handler for log file.
this_is_master : int
1 if current node is the master, 0 otherwise.
my_name : str
current node name.
my_ip : str
current node IP address.

TO DO:

Simplify this.
lib_config.override_configuration_parameters(forced_configuration_string, config_file, v=1, file_log=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

This function takes as input the string with the parameters to the main script and overrides the corresponding parameters in the configuration files. This is to simplify batch testing.

forced_configuration_string : str
Comma separated list of parameter0=value0,parameter1=value1,...
config_file : str
Path to CorrelX configuration file.
v : int
Verbose if 1.
file_log : file handler
Handler for log file.
N/A

Assumptions:

Assuming that C_H_MAPRED_RED_OPTS is higher than C_H_MAPRED_MAP_OPTS, so the first value is
taken for C_H_MAPRED_CHILD_OPTS.


Notes:

For new parameters in configh.conf:
(1) Add constants for CLI in const_config.py.
(2) Check/add constants for hadoop configuration files in const_hadoop.py (if applicable).
(3) Add parameter reading in get_configuration().
(4) Add option in if-structure below.
lib_config.overwrite_nodes_file(nodes_list, nodes_file, v=0, file_log=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Overwrite nodes file (in case less nodes are requested than are available).

nodes_list : list of str
names of the nodes in the allocation.
nodes_file : str
path to nodes file.
v : int
verbose if 1.
file_log : file handler
handler for log file.
lib_config.reduce_list_nodes(num_slaves, nodes_list, v=1, file_log=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Reduce list of nodes given a maximum number of nodes.

num_slaves : int
maximum number of slaves (-1 for no maximum).
nodes_list : list of str
names of nodes.
num_slaves: number of nodes in updated list. nodes_list: updated nodes_list.
lib_config.update_config_param(source_file, pairs, v=0, file_log=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]
Updates a list of pairs [parameter,value] in a configuration file. Should be valid to any .ini file, but this is
used to override the configuration on the CorrelX configuration file.
source_file : str
configuration file (.ini).
pairs : list
list of [parameter,value].
v : int
verbose if 1.
file_log : file handler
handler for log file.
N/A

TO DO:

Currently parameters that are not found are not added, this should be reported.