Statistical Analysis#

PocketCoffea ships a small toolkit (pocket_coffea.utils.stat) to turn the histograms stored in a .coffea output into CMS Combine datacards. The workflow is:

  1. Declare processes — group the analysis samples into the physics processes that become columns of the datacard (MCProcess, DataProcess, collected in MCProcesses / DataProcesses).

  2. Declare systematic uncertaintieslnN normalization terms and shape terms mapped to the histogram variations stored in the output (SystematicUncertainty, collected in Systematics).

  3. Build one Datacard per category — each datacard slices one category from one variable histogram and writes a .txt card plus a .root file with the nominal and varied templates.

  4. Combine the cardscombine_datacards writes a shell script that runs combineCards.py and text2workspace.py to merge all categories into a single workspace.

The classes are importable from two places:

1from pocket_coffea.utils.stat import (
2    MCProcess, DataProcess, MCProcesses, DataProcesses,
3    SystematicUncertainty, Systematics, Datacard,
4)
5from pocket_coffea.utils.stat.combine import Datacard, combine_datacards

(combine_datacards lives in pocket_coffea.utils.stat.combine; Datacard is re-exported from the package root.)

Key convention — processes are split per year. Every (process, year) pair becomes its own datacard column, named "{process}_{year}". Data is the single exception: it is written as one data_obs column summed over years. This is what lets you correlate/decorrelate systematics across years simply by choosing which years a SystematicUncertainty covers.

The snippets below use generic, illustrative names (signal, ttbar, singletop, …, categories SR / CR1 / CR2). Replace them with your own sample, process and variable names.


1. Loading the output#

Everything starts from a merged PocketCoffea output. You need the histogram for each variable you want to fit, the datasets_metadata (maps datasets → samples → data-taking periods) and the cutflow (used for the data_obs observation and to skip empty datasets).

1from coffea.util import load
2
3df = load("output_all.coffea")
4datasets_metadata = df["datasets_metadata"]
5cutflow           = df["cutflow"]
6
7years = ["2022_preEE", "2022_postEE", "2023_preBPix", "2023_postBPix", "2024"]
8label = "run3"

Pick the variables to fit and, optionally, the rebinning. Each fit category is associated with one variable histogram:

 1hist_dict = {
 2    "SR":  df["variables"]["discriminant"],
 3    "CR1": df["variables"]["control_var_1"],
 4    "CR2": df["variables"]["nJets"],
 5}
 6
 7# Optional per-category rebinning; `None` keeps the histogram's native binning.
 8bins_edges_dict = {
 9    "SR":  [0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
10    "CR1": None,
11    "CR2": [4, 5, 6, 12],
12}

The keys of hist_dict are the categories — the values of the histogram’s category axis. One Datacard is built per key below.


2. Processes#

MC processes#

An MCProcess maps one datacard process to a list of analysis samples (the sample names as they appear in datasets_metadata), for a set of years:

1mc_processes = MCProcesses([
2    MCProcess(name="signal",    samples=["SignalSampleA", "SignalSampleB"], years=years, is_signal=True),
3    MCProcess(name="ttbar",     samples=["TTbar"],     years=years, is_signal=False, has_rateParam=True),
4    MCProcess(name="singletop", samples=["SingleTop"], years=years, is_signal=False),
5    MCProcess(name="vjets",     samples=["Vjets"],     years=years, is_signal=False),
6    MCProcess(name="diboson",   samples=["VV"],        years=years, is_signal=False),
7])

MCProcess fields:

field meaning
name datacard process name (the per-year columns become name_year).
samples list of analysis sample names summed into this process.
years years for which the process is defined.
is_signal True → process id ≤ 0 in Combine (signal). Required (no default).
has_rateParam True → a free rateParam (SF_<name>, range [0,5]) floats this process. Default False.
label optional display label; defaults to name.

MCProcesses is an ordered, name-indexable dict subclass. Its signal_processes / background_processes / n_processes helpers all expand to the per-year column names. Because it is a dict, you can build processes programmatically — handy when splitting one physics process into several template bins (e.g. by jet multiplicity or a categorisation index), each as its own (optionally floating) process:

 1template_bins = [
 2    ("bin_lo", ["BkgSample__bin_lo_a", "BkgSample__bin_lo_b"]),
 3    ("bin_mid", ["BkgSample__bin_mid_a", "BkgSample__bin_mid_b"]),
 4    ("bin_hi", ["BkgSample__bin_hi_a", "BkgSample__bin_hi_b"]),
 5]
 6
 7for suffix, samples in template_bins:
 8    name = f"bkg_{suffix}"
 9    mc_processes[name] = MCProcess(
10        name=name, samples=samples, years=years,
11        is_signal=False, has_rateParam=True,
12    )

Data process#

DataProcess declares the observation. Combine expects the name data_obs, and exactly one data process is supported per datacard:

1data_processes = DataProcesses([
2    DataProcess(
3        name="data_obs",
4        samples=["DATA_EGamma", "DATA_SingleMuon"],
5        years=years,
6    )
7])

If data_processes is omitted, the card is written with observation -1 (an Asimov/blinded placeholder).


3. Systematic uncertainties#

A SystematicUncertainty has these fields:

field meaning
name internal identifier; also the default coffea variation name to look up.
typ "lnN" or "shape". Note the spelling — the field is typ, not type.
processes list of process names the uncertainty applies to, or a dict {process: value} for per-process values.
years years the uncertainty applies to (controls correlation — see below).
value lnN: the κ factor (scalar, or (down, up) tuple). shape: conventionally 1.0. Omit only when processes is a dict.
datacard_name optional override for the nuisance name written in the card (defaults to name).
coffea_name_alias optional override for the variation name looked up in the histogram (string, or {process: variation} dict).

Systematics collects them; it keys on datacard_name, so each nuisance written to a card must have a unique datacard_name.

lnN — normalization#

A flat log-normal. Scalar, asymmetric tuple, or per-process dict:

 1# symmetric
 2SystematicUncertainty(name="lumi_2024", typ="lnN",
 3                      processes=[p.name for p in mc_processes.values()],
 4                      years=["2024"], value=1.016)
 5
 6# asymmetric (down, up)
 7SystematicUncertainty(name="some_unc", typ="lnN", processes=["ttbar"],
 8                      years=years, value=(0.98, 1.05))
 9
10# per-process values in one nuisance (no `value` argument)
11SystematicUncertainty(name="xsec", typ="lnN",
12                      processes={"ttbar": 1.06, "diboson": 1.05},
13                      years=years)

Correlation is controlled by years. Several SystematicUncertainty objects that share the same nuisance datacard_name but cover different years stay correlated within a group and uncorrelated across groups. A common pattern is to correlate luminosity across eras of the same main year but decorrelate different main years:

 1lumi_systematics = {
 2    "2022_preEE": 1.014, "2022_postEE": 1.014,
 3    "2023_preBPix": 1.013, "2023_postBPix": 1.013,
 4    "2024": 1.016,
 5}
 6
 7# group eras by main year (2022 / 2023 / 2024) ...
 8lumi_by_main_year = {}
 9for era, value in lumi_systematics.items():
10    main_year = era.split("_")[0]
11    lumi_by_main_year.setdefault(main_year, {"eras": [], "value": value})["eras"].append(era)
12
13# ... and emit one correlated nuisance per main year
14for main_year, data in lumi_by_main_year.items():
15    sys_unc.append(SystematicUncertainty(
16        name=f"lumi_{main_year}", typ="lnN",
17        processes=[p.name for p in mc_processes.values()],
18        years=data["eras"], value=data["value"],
19    ))

Per-process cross-section normalizations, correlated across years but uncorrelated between processes:

1norm = {"singletop": 1.05, "diboson": 1.05, "vjets": 1.05}
2
3for process, value in norm.items():
4    sys_unc.append(SystematicUncertainty(
5        name=f"process_norms_{process}", typ="lnN",
6        processes=[process], value=value, years=years,
7    ))

shape — template variations#

A shape systematic points at an Up/Down histogram variation that the processor already stored on the histogram’s variation axis (value=1.0 by convention). For each process and shift the code looks up the coffea variation f"{coffea_name}{shift}"; the variation written to the card is named f"{datacard_name}{shift}".

  • name is the default coffea variation name.

  • datacard_name renames the nuisance in the card.

  • coffea_name_alias overrides the variation name read from the histogram — most usefully as a dict, when the same physical uncertainty is stored under different variation names for different processes.

If a requested variation is missing from a sample’s histogram, the code falls back to the nominal template (and prints a notice); if any variation differs from nominal by more than 100% in some bin, a warning is printed (_check_shapes).

Variations correlated across years use one nuisance covering all years:

 1shapes_corr_by_year = [
 2    "sf_ele_reco", "sf_mu_id", "ele_scale", "muon_scale", "met_unclustered",
 3    # ...
 4]
 5
 6for syst in shapes_corr_by_year:
 7    sys_unc.append(SystematicUncertainty(
 8        name=syst, typ="shape",
 9        processes=all_process_names, years=years, value=1.0,
10    ))

Variations decorrelated by year are emitted once per year, folding the year into the nuisance name while pointing back at the same coffea variation via coffea_name_alias:

 1shapes_split_by_year = ["pileup", "btag_stat"]
 2
 3for syst in shapes_split_by_year:
 4    for year in years:
 5        sys_unc.append(SystematicUncertainty(
 6            name=f"{syst}_{year}", typ="shape",
 7            processes=all_process_names, years=[year], value=1.0,
 8            datacard_name=f"{syst}_{year}",   # per-year nuisance in the card
 9            coffea_name_alias=syst,           # but read the year-agnostic variation
10        ))

Per-process variation names use coffea_name_alias as a dict. This is the right tool when the same physical uncertainty is stored under different variation names for different process groups (for example a flavour-split scale factor where some processes carry a group-specific variation while the rest use a generic one):

 1# {process: variation_name_as_stored_in_the_histogram}
 2alias = {
 3    "ttbar":     "sf_btag_groupA",
 4    "singletop": "sf_btag_groupA",
 5    "vjets":     "sf_btag_groupB",
 6    "diboson":   "sf_btag_groupB",
 7}
 8
 9sys_unc.append(SystematicUncertainty(
10    name="sf_btag", typ="shape",
11    processes=list(alias.keys()), years=years, value=1.0,
12    coffea_name_alias=alias,
13))

Multiple independent shape sources (e.g. the regrouped JES sources) are usually generated from a dict mapping each variation name to the list of years it applies to — the list encodes the correlation:

 1jes_sources = {
 2    "JES_Absolute":      years,            # correlated across years
 3    "JER":               years,
 4    "JES_Absolute_2024": ["2024"],         # per-year, decorrelated
 5    # ...
 6}
 7
 8for syst, syst_years in jes_sources.items():
 9    sys_unc.append(SystematicUncertainty(
10        name=syst, typ="shape",
11        processes=all_process_names, years=syst_years, value=1.0,
12    ))

Collect everything:

1systematics = Systematics(sys_unc)
2print("lnN:",   [s.name for s in systematics.values() if s.typ == "lnN"])
3print("shape:", [s.name for s in systematics.values() if s.typ == "shape"])

4. Building the datacards#

One Datacard is built per category. The constructor signature is:

argument meaning
histograms the histogram for this category's variable (a value of hist_dict).
datasets_metadata df["datasets_metadata"].
cutflow df["cutflow"]; used for data_obs and to skip datasets empty in presel.
years years to include.
mc_processes the MCProcesses container.
systematics the Systematics container.
category the category (key of hist_dict) to slice from the histogram's category axis.
data_processes the DataProcesses container; default None (no observation row).
mcstat True (default) adds an autoMCStats line; or pass a dict with threshold / include_signal / hist_mode.
bins_edges optional rebinning edges; None keeps native binning.
bin_prefix optional prefix for the Combine bin name.
bin_suffix optional suffix; defaults to "_".join(years). The bin name is [prefix_]category[_suffix].
verbose True (default) prints diagnostics about skipped datasets / missing variations.
 1datacards = {}
 2
 3for cat, histograms in hist_dict.items():
 4    datacard = Datacard(
 5        histograms=histograms,
 6        datasets_metadata=datasets_metadata,
 7        cutflow=df["cutflow"],
 8        years=years,
 9        mc_processes=mc_processes,
10        data_processes=data_processes,
11        systematics=systematics,
12        category=cat,
13        bin_suffix=label,
14        bins_edges=bins_edges_dict[cat],
15    )
16
17    card_name   = f"datacard_{cat}_{label}.txt"
18    shapes_name = f"shapes_{cat}_{label}.root"
19    datacard.dump(directory=output_dir, card_name=card_name, shapes_name=shapes_name)
20    datacards[card_name] = datacard

Datacard.dump(directory, card_name, shapes_name) writes the text card and a ROOT file holding every process_year_nominal and process_year_<nuis>{Up,Down} template into directory. Useful attributes/properties: datacard.bin (the Combine bin name), datacard.observation, datacard.rate(process).

autoMCStats is enabled by default with threshold=0, include_signal=0, hist_mode=1 (see the Combine bin-wise stats docs). Pass mcstat=False to disable it, or a dict to override the values.

A rateParam is emitted automatically for every non-signal process declared with has_rateParam=True, as SF_<name> rateParam * <name>_<year> 1 [0,5].

Formatting caveat. The card columns are padded by process-name length. When processes carry long per-year suffixes the rate/systematic columns can render tightly; you may want to post-process the written card to widen the spacing. This is cosmetic — combineCards.py/text2workspace.py parse the card regardless.


5. Combining categories#

combine_datacards writes a shell script that runs combineCards.py over all per-category cards and then text2workspace.py to produce the workspace:

1combine_datacards(
2    datacards,                               # {filename: Datacard}
3    directory=output_dir,
4    path=f"combine_datacards_{label}.sh",    # script to write (must end in .sh)
5    card_name=f"datacard_combined_{label}.txt",
6    workspace_name=f"workspace_{label}.root",
7    channel_masks=False,
8)

The generated script combines the cards using each datacard’s bin name as the channel label (combineCards.py <bin>=<filename> ... > combined.txt) and then builds the workspace. Set channel_masks=True to append --channel-masks to text2workspace.py (useful for blinding or fitting subsets of categories).

Run the generated .sh inside a CMSSW + Combine environment to obtain the final workspace, which is then passed to combine.