Statistical Analysis#
PocketCoffea ships a small toolkit (pocket_coffea.utils.stat) to turn the
histograms stored in a .coffea output into CMS Combine
datacards. The workflow is:
Declare processes — group the analysis samples into the physics processes that become columns of the datacard (
MCProcess,DataProcess, collected inMCProcesses/DataProcesses).Declare systematic uncertainties —
lnNnormalization terms andshapeterms mapped to the histogram variations stored in the output (SystematicUncertainty, collected inSystematics).Build one
Datacardper category — each datacard slices one category from one variable histogram and writes a.txtcard plus a.rootfile with the nominal and varied templates.Combine the cards —
combine_datacardswrites a shell script that runscombineCards.pyandtext2workspace.pyto merge all categories into a single workspace.
The classes are importable from two places:
1from pocket_coffea.utils.stat import (
2 MCProcess, DataProcess, MCProcesses, DataProcesses,
3 SystematicUncertainty, Systematics, Datacard,
4)
5from pocket_coffea.utils.stat.combine import Datacard, combine_datacards
(combine_datacards lives in pocket_coffea.utils.stat.combine; Datacard is
re-exported from the package root.)
Key convention — processes are split per year. Every
(process, year)pair becomes its own datacard column, named"{process}_{year}". Data is the single exception: it is written as onedata_obscolumn summed over years. This is what lets you correlate/decorrelate systematics across years simply by choosing which years aSystematicUncertaintycovers.
The snippets below use generic, illustrative names (signal, ttbar,
singletop, …, categories SR / CR1 / CR2). Replace them with your own
sample, process and variable names.
1. Loading the output#
Everything starts from a merged PocketCoffea output. You need the histogram for
each variable you want to fit, the datasets_metadata (maps datasets → samples
→ data-taking periods) and the cutflow (used for the data_obs observation
and to skip empty datasets).
1from coffea.util import load
2
3df = load("output_all.coffea")
4datasets_metadata = df["datasets_metadata"]
5cutflow = df["cutflow"]
6
7years = ["2022_preEE", "2022_postEE", "2023_preBPix", "2023_postBPix", "2024"]
8label = "run3"
Pick the variables to fit and, optionally, the rebinning. Each fit category is associated with one variable histogram:
1hist_dict = {
2 "SR": df["variables"]["discriminant"],
3 "CR1": df["variables"]["control_var_1"],
4 "CR2": df["variables"]["nJets"],
5}
6
7# Optional per-category rebinning; `None` keeps the histogram's native binning.
8bins_edges_dict = {
9 "SR": [0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
10 "CR1": None,
11 "CR2": [4, 5, 6, 12],
12}
The keys of hist_dict are the categories — the values of the histogram’s
category axis. One Datacard is built per key below.
2. Processes#
MC processes#
An MCProcess maps one datacard process to a list of analysis samples (the
sample names as they appear in datasets_metadata), for a set of years:
1mc_processes = MCProcesses([
2 MCProcess(name="signal", samples=["SignalSampleA", "SignalSampleB"], years=years, is_signal=True),
3 MCProcess(name="ttbar", samples=["TTbar"], years=years, is_signal=False, has_rateParam=True),
4 MCProcess(name="singletop", samples=["SingleTop"], years=years, is_signal=False),
5 MCProcess(name="vjets", samples=["Vjets"], years=years, is_signal=False),
6 MCProcess(name="diboson", samples=["VV"], years=years, is_signal=False),
7])
MCProcess fields:
| field | meaning |
|---|---|
name |
datacard process name (the per-year columns become name_year). |
samples |
list of analysis sample names summed into this process. |
years |
years for which the process is defined. |
is_signal |
True → process id ≤ 0 in Combine (signal). Required (no default). |
has_rateParam |
True → a free rateParam (SF_<name>, range [0,5]) floats this process. Default False. |
label |
optional display label; defaults to name. |
MCProcesses is an ordered, name-indexable dict subclass. Its
signal_processes / background_processes / n_processes helpers all expand to
the per-year column names. Because it is a dict, you can build processes
programmatically — handy when splitting one physics process into several
template bins (e.g. by jet multiplicity or a categorisation index), each as its
own (optionally floating) process:
1template_bins = [
2 ("bin_lo", ["BkgSample__bin_lo_a", "BkgSample__bin_lo_b"]),
3 ("bin_mid", ["BkgSample__bin_mid_a", "BkgSample__bin_mid_b"]),
4 ("bin_hi", ["BkgSample__bin_hi_a", "BkgSample__bin_hi_b"]),
5]
6
7for suffix, samples in template_bins:
8 name = f"bkg_{suffix}"
9 mc_processes[name] = MCProcess(
10 name=name, samples=samples, years=years,
11 is_signal=False, has_rateParam=True,
12 )
Data process#
DataProcess declares the observation. Combine expects the name data_obs, and
exactly one data process is supported per datacard:
1data_processes = DataProcesses([
2 DataProcess(
3 name="data_obs",
4 samples=["DATA_EGamma", "DATA_SingleMuon"],
5 years=years,
6 )
7])
If data_processes is omitted, the card is written with observation -1 (an
Asimov/blinded placeholder).
3. Systematic uncertainties#
A SystematicUncertainty has these fields:
| field | meaning |
|---|---|
name |
internal identifier; also the default coffea variation name to look up. |
typ |
"lnN" or "shape". Note the spelling — the field is typ, not type. |
processes |
list of process names the uncertainty applies to, or a dict {process: value} for per-process values. |
years |
years the uncertainty applies to (controls correlation — see below). |
value |
lnN: the κ factor (scalar, or (down, up) tuple). shape: conventionally 1.0. Omit only when processes is a dict. |
datacard_name |
optional override for the nuisance name written in the card (defaults to name). |
coffea_name_alias |
optional override for the variation name looked up in the histogram (string, or {process: variation} dict). |
Systematics collects them; it keys on datacard_name, so each nuisance written
to a card must have a unique datacard_name.
lnN — normalization#
A flat log-normal. Scalar, asymmetric tuple, or per-process dict:
1# symmetric
2SystematicUncertainty(name="lumi_2024", typ="lnN",
3 processes=[p.name for p in mc_processes.values()],
4 years=["2024"], value=1.016)
5
6# asymmetric (down, up)
7SystematicUncertainty(name="some_unc", typ="lnN", processes=["ttbar"],
8 years=years, value=(0.98, 1.05))
9
10# per-process values in one nuisance (no `value` argument)
11SystematicUncertainty(name="xsec", typ="lnN",
12 processes={"ttbar": 1.06, "diboson": 1.05},
13 years=years)
Correlation is controlled by years. Several SystematicUncertainty
objects that share the same nuisance datacard_name but cover different years
stay correlated within a group and uncorrelated across groups. A common pattern
is to correlate luminosity across eras of the same main year but decorrelate
different main years:
1lumi_systematics = {
2 "2022_preEE": 1.014, "2022_postEE": 1.014,
3 "2023_preBPix": 1.013, "2023_postBPix": 1.013,
4 "2024": 1.016,
5}
6
7# group eras by main year (2022 / 2023 / 2024) ...
8lumi_by_main_year = {}
9for era, value in lumi_systematics.items():
10 main_year = era.split("_")[0]
11 lumi_by_main_year.setdefault(main_year, {"eras": [], "value": value})["eras"].append(era)
12
13# ... and emit one correlated nuisance per main year
14for main_year, data in lumi_by_main_year.items():
15 sys_unc.append(SystematicUncertainty(
16 name=f"lumi_{main_year}", typ="lnN",
17 processes=[p.name for p in mc_processes.values()],
18 years=data["eras"], value=data["value"],
19 ))
Per-process cross-section normalizations, correlated across years but uncorrelated between processes:
1norm = {"singletop": 1.05, "diboson": 1.05, "vjets": 1.05}
2
3for process, value in norm.items():
4 sys_unc.append(SystematicUncertainty(
5 name=f"process_norms_{process}", typ="lnN",
6 processes=[process], value=value, years=years,
7 ))
shape — template variations#
A shape systematic points at an Up/Down histogram variation that the processor
already stored on the histogram’s variation axis (value=1.0 by convention).
For each process and shift the code looks up the coffea variation
f"{coffea_name}{shift}"; the variation written to the card is named
f"{datacard_name}{shift}".
nameis the default coffea variation name.datacard_namerenames the nuisance in the card.coffea_name_aliasoverrides the variation name read from the histogram — most usefully as a dict, when the same physical uncertainty is stored under different variation names for different processes.
If a requested variation is missing from a sample’s histogram, the code falls
back to the nominal template (and prints a notice); if any variation differs
from nominal by more than 100% in some bin, a warning is printed
(_check_shapes).
Variations correlated across years use one nuisance covering all years:
1shapes_corr_by_year = [
2 "sf_ele_reco", "sf_mu_id", "ele_scale", "muon_scale", "met_unclustered",
3 # ...
4]
5
6for syst in shapes_corr_by_year:
7 sys_unc.append(SystematicUncertainty(
8 name=syst, typ="shape",
9 processes=all_process_names, years=years, value=1.0,
10 ))
Variations decorrelated by year are emitted once per year, folding the year
into the nuisance name while pointing back at the same coffea variation via
coffea_name_alias:
1shapes_split_by_year = ["pileup", "btag_stat"]
2
3for syst in shapes_split_by_year:
4 for year in years:
5 sys_unc.append(SystematicUncertainty(
6 name=f"{syst}_{year}", typ="shape",
7 processes=all_process_names, years=[year], value=1.0,
8 datacard_name=f"{syst}_{year}", # per-year nuisance in the card
9 coffea_name_alias=syst, # but read the year-agnostic variation
10 ))
Per-process variation names use coffea_name_alias as a dict. This is the
right tool when the same physical uncertainty is stored under different variation
names for different process groups (for example a flavour-split scale factor
where some processes carry a group-specific variation while the rest use a
generic one):
1# {process: variation_name_as_stored_in_the_histogram}
2alias = {
3 "ttbar": "sf_btag_groupA",
4 "singletop": "sf_btag_groupA",
5 "vjets": "sf_btag_groupB",
6 "diboson": "sf_btag_groupB",
7}
8
9sys_unc.append(SystematicUncertainty(
10 name="sf_btag", typ="shape",
11 processes=list(alias.keys()), years=years, value=1.0,
12 coffea_name_alias=alias,
13))
Multiple independent shape sources (e.g. the regrouped JES sources) are usually generated from a dict mapping each variation name to the list of years it applies to — the list encodes the correlation:
1jes_sources = {
2 "JES_Absolute": years, # correlated across years
3 "JER": years,
4 "JES_Absolute_2024": ["2024"], # per-year, decorrelated
5 # ...
6}
7
8for syst, syst_years in jes_sources.items():
9 sys_unc.append(SystematicUncertainty(
10 name=syst, typ="shape",
11 processes=all_process_names, years=syst_years, value=1.0,
12 ))
Collect everything:
1systematics = Systematics(sys_unc)
2print("lnN:", [s.name for s in systematics.values() if s.typ == "lnN"])
3print("shape:", [s.name for s in systematics.values() if s.typ == "shape"])
4. Building the datacards#
One Datacard is built per category. The constructor signature is:
| argument | meaning |
|---|---|
histograms |
the histogram for this category's variable (a value of hist_dict). |
datasets_metadata |
df["datasets_metadata"]. |
cutflow |
df["cutflow"]; used for data_obs and to skip datasets empty in presel. |
years |
years to include. |
mc_processes |
the MCProcesses container. |
systematics |
the Systematics container. |
category |
the category (key of hist_dict) to slice from the histogram's category axis. |
data_processes |
the DataProcesses container; default None (no observation row). |
mcstat |
True (default) adds an autoMCStats line; or pass a dict with threshold / include_signal / hist_mode. |
bins_edges |
optional rebinning edges; None keeps native binning. |
bin_prefix |
optional prefix for the Combine bin name. |
bin_suffix |
optional suffix; defaults to "_".join(years). The bin name is [prefix_]category[_suffix]. |
verbose |
True (default) prints diagnostics about skipped datasets / missing variations. |
1datacards = {}
2
3for cat, histograms in hist_dict.items():
4 datacard = Datacard(
5 histograms=histograms,
6 datasets_metadata=datasets_metadata,
7 cutflow=df["cutflow"],
8 years=years,
9 mc_processes=mc_processes,
10 data_processes=data_processes,
11 systematics=systematics,
12 category=cat,
13 bin_suffix=label,
14 bins_edges=bins_edges_dict[cat],
15 )
16
17 card_name = f"datacard_{cat}_{label}.txt"
18 shapes_name = f"shapes_{cat}_{label}.root"
19 datacard.dump(directory=output_dir, card_name=card_name, shapes_name=shapes_name)
20 datacards[card_name] = datacard
Datacard.dump(directory, card_name, shapes_name) writes the text card and a
ROOT file holding every process_year_nominal and process_year_<nuis>{Up,Down}
template into directory. Useful attributes/properties: datacard.bin (the
Combine bin name), datacard.observation, datacard.rate(process).
autoMCStats is enabled by default with threshold=0, include_signal=0,
hist_mode=1 (see the Combine bin-wise stats docs).
Pass mcstat=False to disable it, or a dict to override the values.
A rateParam is emitted automatically for every non-signal process declared
with has_rateParam=True, as SF_<name> rateParam * <name>_<year> 1 [0,5].
Formatting caveat. The card columns are padded by process-name length. When processes carry long per-year suffixes the
rate/systematic columns can render tightly; you may want to post-process the written card to widen the spacing. This is cosmetic —combineCards.py/text2workspace.pyparse the card regardless.
5. Combining categories#
combine_datacards writes a shell script that runs combineCards.py over all
per-category cards and then text2workspace.py to produce the workspace:
1combine_datacards(
2 datacards, # {filename: Datacard}
3 directory=output_dir,
4 path=f"combine_datacards_{label}.sh", # script to write (must end in .sh)
5 card_name=f"datacard_combined_{label}.txt",
6 workspace_name=f"workspace_{label}.root",
7 channel_masks=False,
8)
The generated script combines the cards using each datacard’s bin name as the
channel label (combineCards.py <bin>=<filename> ... > combined.txt) and then
builds the workspace. Set channel_masks=True to append --channel-masks to
text2workspace.py (useful for blinding or fitting subsets of categories).
Run the generated .sh inside a CMSSW + Combine environment to obtain the final
workspace, which is then passed to combine.