Datasets handling#

The framework assigns a specific meaning to: datasets, part, samples, sub-samples.

  • Dataset: a dataset is a set of NanoAOD file containing events generated with the same configuration.

    Each dataset has a set of metadata, a unique datataking period, an isMC attribute. Each dataset groups files from one or more “CMS datasets” names (e.g. "/ttHTobb_M125_TuneCP5_13TeV-powheg-pythia8/RunIISummer.....-v2/NANOAODSIM). Dataset can have a part metadata, which is usually also part of the name: for example the dataset WjetsHT200-500_2016APV should have the metadata part=HT200-500: inside the framework the part metadata can be used to define customizations.


The dataset name must be unique inside the framework and it is used to identify the output objects. The script, defined below, makes sure that the final dataset name is the composition of the user defined label, the datataking period (and the era for data), the part metadata.

  • Sample: a sample is a set of events representing a common physics process.

    The sample name can be seen as the category of events: multiple datasets may have the same sample name. Inside the framework the sample label is used to categorize the events and to customize weights, variables and categories.

  • Subsample: a subsample is a subset of a sample: it can be seen as a categorization applied only to events part of a specific sample.

    Subsamples are configured in a specific way using Cut objets: look at the Configuration docs for more details.

Datasets definition files#

Input datasets for the analyses are defined in a JSON file following the syntax below:

 2    "DYJetsToLL_M-50":{
 3        "sample": "DYJetsToLL",
 4        "json_output": "datasets/DYJetsToLL_M-50.json",
 5        "files":[
 6            { "das_names": 
 7                ["/DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL18NanoAODv9-106X_upgrade2018_realistic_v16_L1v1-v2/NANOAODSIM"],
 8              "metadata": {
 9                  "year":"2018",
10                  "isMC": true,
11                  "xsec": 6077.22,
12                  "part": "M-50"
13              }
14            }
15        ]
16    },
18    "DATA_SingleMuon": {
19        "sample": "DATA_SingleMuon",
20        "json_output": "datasets/DATA_SingleMuon.json",
21        "files": [
22            {
23                "das_names": [
24                    "/SingleMuon/Run2018A-UL2018_MiniAODv2_NanoAODv9-v2/NANOAOD"
25                ],
26                "metadata": {
27                    "year": "2018",
28                    "isMC": false,
29                    "primaryDataset": "SingleMuon",
30                    "era": "A"
31                }
32            },
33            {
34                "das_names": [
35                    "/SingleMuon/Run2018B-UL2018_MiniAODv2_NanoAODv9-v2/NANOAOD"
36                ],
37                "metadata": {
38                    "year": "2018",
39                    "isMC": false,
40                    "primaryDataset": "SingleMuon",
41                    "era": "B"
42                }
43            }
44         ]
45    }

The framework uses this definition to build the actual dataset json file.

  • The user defined label makes the dataset unique.

  • The sample key is the one used internally in the framework to identify the sample type (see explanation above).

  • The json_output defines the output location for the list of files.

The same dataset can contain different group of dataset (DAS) names, each with a separate metadata dictionary. Each group will be interpreted by the script to create unique set of files, with a unique label build as {user_defined_label}__{part}__{year}_{Era}.

To build the JSON dataset, run the following script:

   ___       _ __   _____       __               __ 
  / _ )__ __(_) /__/ / _ \___ _/ /____ ____ ___ / /_
 / _  / // / / / _  / // / _ `/ __/ _ `(_-</ -_) __/

usage: [-h] [--cfg CFG] [-k KEYS [KEYS ...]] [-d] [-o] [-c] [-s] [-l LOCAL_PREFIX] [-ws WHITELIST_SITES [WHITELIST_SITES ...]] [-bs BLACKLIST_SITES [BLACKLIST_SITES ...]] [-rs REGEX_SITES]

Build dataset fileset in json format

optional arguments:
  -h, --help            show this help message and exit
  --cfg CFG             Config file with parameters specific to the current run
  -k KEYS [KEYS ...], --keys KEYS [KEYS ...]
                        Dataset keys to select
  -o, --overwrite       Overwrite existing file definition json
  -c, --check           Check file existance in the local prefix
  -s, --split-by-year   Split output files by year
  -l LOCAL_PREFIX, --local-prefix LOCAL_PREFIX
                        Local prefix
                        List of sites in the whitelist
                        List of sites in the blacklist
  -rs REGEX_SITES, --regex-sites REGEX_SITES
                        Regex to filter sites

The DBS and Rucio services are queries to get information about the requested CMS datasets.

More than one version of the JSON dataset is saved:

  • one dataset configuration file containing the remote files with an explicit path, without using the AAA xrootd redirector (this can help with uproot misbehaviour with the redirector).

  • one _redirector.json dataset, containing the root:// prefix to use the AAA xrootd redirector.

  • one with a local prefix (passed with the -l options), referring files in the local disk of the machine (no xrootd).

The dataset files output can be split by years, to facilitate the bookeeping, with the --split-by-year option.

It is recommended to run on local files, if present, or to use the version of the dataset with direct links to files (no AAA redirector). One can filter or exclude the desidered sites using the whitelist, blacklist, and regex options.

For example:

Restricting the dataset source in Europe (recommended for working from lxplus) --cfg datasets/datasets_definitions.json -o -rs 'T[123]_(FR|IT|DE|BE|CH|UK)_\w+' 

Restricting the dataset source to two possible whitelisted sites --cfg datasets/datasets_definitions.json -o -ws T3_CH_PSI T2_CH_CSCS

Blacklisting datasets at CERN and requesting the dataset in CH. --cfg datasets/datasets_definitions.json -o -bs T0_CH_CERN 'T[123]_CH_\w+' 

Datasets building output#

The output of the script is the actual input of the coffea processing. It contains metadata and the explicit list of files to be analyzed.

Moreover the output contains the total number of events contained in the dataset (from DBS) and the size bytes of the dataset.

    "DYToMuMu_M-50_2023": {
        "metadata": {
            "das_names": "['/DYToMuMu_M-20_TuneCP5_13p6TeV-pythia8/Run3Winter23NanoAOD-GTv4Digi_GTv4_MiniGTv4_NanoGTv4_126X_mcRun3_2023_forPU65_v4-v2/NANOAODSIM']",
            "sample": "DYToMuMu",
            "year": "2023",
            "isMC": "True",
            "xsec": "6077.22",
            "nevents": "9710000",
            "size": "9115860727"
        "files": [