Run a PocketCoffea Analysis with law#
Introduction#
A typical analysis with PocketCoffea consists of several steps, such as e.g. dataset preparation, dataset processing and plotting, which depend on each other. To handle these dependencies the python package law can be used to structure the analysis. With law you can define tasks that depend on each other by defining requirements and outputs.
1import law
2
3
4class MyTask(law.Task):
5 def requires(self):
6 return MyOtherTask.req(self)
7
8 def output(self):
9 return law.LocalFileTarget("output.txt")
10
11 def run(self):
12 self.output().dump("Hello, World!")
law checks wether the output of a tasks exists and runs the task and its dependencies if the output is missing.
The following tasks are available in PocketCoffea (pocket_coffea/law_tasks/tasks
):
CreateDatasets
: prepare datasets for processingJetCalibration
: build jet calibration factoryRunner
: run the analysis on the datasetsPlotter
: create histogram plots from the processed datasetsPlotSystematics
: create histogram plots for systematic shifts
Analysis Setup#
For a general setup of an analysis with PocketCoffea see the analysis example. Here we focus on the setup with law.
Two additional files are needed to setup the analysis with law:
law.cfg: configuration file for law
setup.sh: setup script to set environment variables and index law tasks
So the directory structure looks like this:
analysis-config
| config.py
| law.cfg
| setup.sh
|
└── datasets
| | datasets_definition.json
|
└── parameters
| | object_preselection.yaml
| | plotting.yaml
| | ...
The setup file could look like this:
# setup PATHs and configurations for law
setup_analysis() {
# get important paths
orig="${PWD}"
this_file="$( [ ! -z "${ZSH_VERSION}" ] && echo "${(%):-%x}" || echo "${BASH_SOURCE[0]}" )"
this_dir="$( cd "$( dirname "${this_file}" )" && pwd )"
cd "${orig}"
# === analysis ===
export ANALYSIS_STORE="/net/scratch/analysis_outputs"
# === law ===
export LAW_CONFIG_FILE="${this_dir}/law.cfg"
export LAW_HOME="${this_dir}/.law"
if which law &> /dev/null; then
# source law's bash completion script
source "$( law completion )" ""
# index law and check if it was successful
law index -q
return_code=$?
if [ ${return_code} -ne 0 ]; then
echo "failed to index law with error code ${return_code}"
return 1
else
echo "law tasks were successfully indexed"
fi
fi
}
setup_analysis "$@"
The ANALYSIS_STORE
variable is used to specify the output directory for the analysis. Every task creates a subdirectory with its output.
law configuration#
A detailed description of the configuration of law can be found in the configuration section of the law documentation.
The important part is the [modules]
section, where all the modules that you want to use in your analysis are listed.
[modules]
# must be accessible to python (PYTHONPATH)
pocket_coffea.law_tasks.tasks.datasets
pocket_coffea.law_tasks.tasks.runner
pocket_coffea.law_tasks.tasks.plotting
# custom tasks
custom_tasks.tasks
You can also add custom tasks here, which must be accessible via PYTHONPATH
. So you should add the following to your setup.sh
:
export PYTHONPATH="${this_dir}:${PYTHONPATH}"
Running law tasks#
Getting an Overview#
You can get an overview of all available tasks if you index law in verbose mode:
law index -v
Tasks have different parameters which control the behavior of the task. The available parameters will be listed if you press TAB
twice after the task name in the command line.
law run CreateDatasets <TAB><TAB>
To get an extensive list with descriptions of all parameters you can use the --help
flag:
law run CreateDatasets --help
Tasks depend on each other, so you can use the --print-deps <DEPTH>
flag to get an overview of the dependencies of a task, where <DEPTH>
is an integer that specifies the depth of the dependency tree (-1 displays all dependencies).
law run Plotter --print-deps -1
This just prints the dependencies. To check which tasks have already been executed and which are still missing you can use the --print-status <DEPTH>
flag.
law run Plotter --print-status -1
Executing Tasks#
To execute a task you simply use law run
followed by the task name and the parameters. Some parameters have defaults, so you don’t have to specify them. If you want to overwrite a default parameter you can do this by specifying the parameter with the new value.
Let’s assume you have your configuration file in a folder config/config.py
, you want to run on lxplus with the dask executor and you want to scale out to 50 workers. In the plots you dont want to plot the data and you want the y-axis to be logarithmic. You can run the plotting task like this:
law run Plotter --cfg config/config.py --version version01 --executor dask@lxplus --scaleout 50 --blind True --log-scale
The version parameter is used to create a new directory in the output directory to for example separate different configurations. The parameters --blind
and --log-scale
are both boolean parameters, so to set them to True
you can just specify them without a value.
If a task has already been executed and you want to rerun it you can use the --remove-output <DEPTH>
flag, where <DEPTH>
can be an integer or a tuple. The first integer specifies the depth of the dependency tree. For the second value you can choose between d
(dry), i
(interactive) and a
(all). The third value is a boolean that specifies if the task should be executed after the removal (1) or not (0).
law run Plotter --cfg config.py --remove-output 0,i,1