Run a PocketCoffea Analysis with law#

Introduction#

A typical analysis with PocketCoffea consists of several steps, such as e.g. dataset preparation, dataset processing and plotting, which depend on each other. To handle these dependencies the python package law can be used to structure the analysis. With law you can define tasks that depend on each other by defining requirements and outputs.

 1import law
 2
 3
 4class MyTask(law.Task):
 5    def requires(self):
 6        return MyOtherTask.req(self)
 7
 8    def output(self):
 9        return law.LocalFileTarget("output.txt")
10
11    def run(self):
12        self.output().dump("Hello, World!")

law checks wether the output of a tasks exists and runs the task and its dependencies if the output is missing.

The following tasks are available in PocketCoffea (pocket_coffea/law_tasks/tasks):

  • CreateDatasets: prepare datasets for processing

  • JetCalibration: build jet calibration factory

  • Runner: run the analysis on the datasets

  • Plotter: create histogram plots from the processed datasets

  • PlotSystematics: create histogram plots for systematic shifts

Analysis Setup#

For a general setup of an analysis with PocketCoffea see the analysis example. Here we focus on the setup with law.

Two additional files are needed to setup the analysis with law:

  • law.cfg: configuration file for law

  • setup.sh: setup script to set environment variables and index law tasks

So the directory structure looks like this:

analysis-config
|   config.py
|   law.cfg
|   setup.sh
|
└── datasets
|   |   datasets_definition.json
|
└── parameters
|   |   object_preselection.yaml
|   |   plotting.yaml
|   |   ...

The setup file could look like this:

# setup PATHs and configurations for law

setup_analysis() {
    # get important paths
    orig="${PWD}"
    this_file="$( [ ! -z "${ZSH_VERSION}" ] && echo "${(%):-%x}" || echo "${BASH_SOURCE[0]}" )"
    this_dir="$( cd "$( dirname "${this_file}" )" && pwd )"
    cd "${orig}"

    # === analysis ===
    export ANALYSIS_STORE="/net/scratch/analysis_outputs"

    # === law ===
    export LAW_CONFIG_FILE="${this_dir}/law.cfg"
    export LAW_HOME="${this_dir}/.law"

    if which law &> /dev/null; then
        # source law's bash completion script
        source "$( law completion )" ""
        # index law and check if it was successful
        law index -q
        return_code=$?
        if [ ${return_code} -ne 0 ]; then
            echo "failed to index law with error code ${return_code}"
            return 1
        else
            echo "law tasks were successfully indexed"
        fi
    fi

}

setup_analysis "$@"

The ANALYSIS_STORE variable is used to specify the output directory for the analysis. Every task creates a subdirectory with its output.

law configuration#

A detailed description of the configuration of law can be found in the configuration section of the law documentation.

The important part is the [modules] section, where all the modules that you want to use in your analysis are listed.

[modules]

# must be accessible to python (PYTHONPATH)
pocket_coffea.law_tasks.tasks.datasets
pocket_coffea.law_tasks.tasks.runner
pocket_coffea.law_tasks.tasks.plotting

# custom tasks
custom_tasks.tasks

You can also add custom tasks here, which must be accessible via PYTHONPATH. So you should add the following to your setup.sh:

export PYTHONPATH="${this_dir}:${PYTHONPATH}"

Running law tasks#

Getting an Overview#

You can get an overview of all available tasks if you index law in verbose mode:

law index -v

Tasks have different parameters which control the behavior of the task. The available parameters will be listed if you press TAB twice after the task name in the command line.

law run CreateDatasets <TAB><TAB>

To get an extensive list with descriptions of all parameters you can use the --help flag:

law run CreateDatasets --help

Tasks depend on each other, so you can use the --print-deps <DEPTH> flag to get an overview of the dependencies of a task, where <DEPTH> is an integer that specifies the depth of the dependency tree (-1 displays all dependencies).

law run Plotter --print-deps -1

This just prints the dependencies. To check which tasks have already been executed and which are still missing you can use the --print-status <DEPTH> flag.

law run Plotter --print-status -1

Executing Tasks#

To execute a task you simply use law run followed by the task name and the parameters. Some parameters have defaults, so you don’t have to specify them. If you want to overwrite a default parameter you can do this by specifying the parameter with the new value.

Let’s assume you have your configuration file in a folder config/config.py, you want to run on lxplus with the dask executor and you want to scale out to 50 workers. In the plots you dont want to plot the data and you want the y-axis to be logarithmic. You can run the plotting task like this:

law run Plotter --cfg config/config.py --version version01 --executor dask@lxplus --scaleout 50 --blind True --log-scale 

The version parameter is used to create a new directory in the output directory to for example separate different configurations. The parameters --blind and --log-scale are both boolean parameters, so to set them to True you can just specify them without a value.

If a task has already been executed and you want to rerun it you can use the --remove-output <DEPTH> flag, where <DEPTH> can be an integer or a tuple. The first integer specifies the depth of the dependency tree. For the second value you can choose between d (dry), i (interactive) and a (all). The third value is a boolean that specifies if the task should be executed after the removal (1) or not (0).

law run Plotter --cfg config.py --remove-output 0,i,1