pocket_coffea.utils.stat package

pocket_coffea.utils.stat package#

Submodules#

pocket_coffea.utils.stat.combine module#

Datacard Class and Utilities for CMS Combine Tool

class pocket_coffea.utils.stat.combine.Datacard(histograms: dict[str, dict[str, Hist]], datasets_metadata: dict[str, dict[str, dict]], cutflow: dict[str, dict[str, float]], years: list[str], mc_processes: MCProcesses, systematics: Systematics, category: str, data_processes: DataProcesses | None = None, mcstat: bool | dict = True, bins_edges: list[float] | None = None, bin_prefix: str | None = None, bin_suffix: str | None = None, verbose: bool = True, shape_only_for_rateparam: bool = False, rateparam_norm_categories: list[str] | None = None)#

Bases: object

Datacard containing processes, systematics and write utilities.

Parameters:

histograms (dict[str, dict[str, hist.Hist]]) – Dict with histograms for each sample
datasets_metadata (dict[str, dict[str, dict]]) – Metadata for datasets
cutflow (dict[str, dict[str, float]]) – Cutflow information for datasets
years (list[str]) – Years of data taking
mc_processes (MCProcesses) – mc_processes
systematics (Systematics) – systematic uncertainties
category (str) – Category in datacard
data_processes (DataProcesses, optional) – Data processes, defaults to None
mcstat (bool | dict, optional) – Whether to include MC statistics, you can also pass a dict with the options accepted by combine, defaults to True
bins_edges (list[float], optional) – Bin edges for rebinning histograms, defaults to None
bin_prefix (str, optional) – prefix for the bin name, defaults to None
bin_suffix (str, optional) – suffix for the bin name, defaults to None

property adjust_columns#

property adjust_first_column#

property adjust_syst_colum#

property bin: str#: Name of the bin in the datacard

compute_rateparam_shape_scales() → dict#

Compute shape-only rescaling factors for processes with a free rateParam.

For each (rateParam process, shape systematic, shift) the factor is Σ nominal / Σ varied, summed with flow over all rateparam_norm_categories and over the process years where the systematic applies. Applying it to every Up/Down template holds the process total fixed (the normalization the rateParam already absorbs) while preserving the bin-to-bin shape and the region/year migration. The totals are observable-independent (flow included), so every per-category Datacard derives the same factors.

Returns:: mapping (process_name, systematic.datacard_name, shift) -> float
Return type:: dict

content(shapes_filename: str) → str#

Generate the content of the datacard.

Parameters:: shapes_filename (str) – The filename of the root file containing the shape histograms.
Returns:: Content of the datacard as a string.
Return type:: str

create_shape_histogram_dict(is_data: bool = False) → dict[str, Hist]#

Create a dictionary of histograms for each process and systematic.

Parameters:: is_data (bool, optional) – Flag to indicate if the datacard is for data, defaults to False
Returns:: dictionary of histograms, keys are process_systematic
Return type:: dict[str, hist.Hist]

dump(directory: PathLike, card_name: str = 'datacard.txt', shapes_name: str = 'shapes.root') → None#

Dump datacard and shapes to a directory.

Parameters:

directory (os.PathLike) – Directory to dump the datacard and shapes
card_name (str, optional) – name of the datacard file, defaults to “datacard.txt”
shapes_filename (str, optional) – name of the shapes file, defaults to “shapes.root”

expectation_section() → str#

get_datasets_by_sample(sample: str, year: str | None = None) → list[str]#

Retrieve the list of dataset names for a given sample and optionally a specific year.

Parameters:

sample (str) – The sample name for which to retrieve datasets.
year (str, optional, default=None) – The year (data-taking period) to filter datasets. If None (default), datasets from all years in self.years are returned.

Returns:

List of dataset names corresponding to the sample (and year, if specified).

Return type:

list[str]

property imax#: Number of bins in the datacard

is_empty_dataset(dataset: str) → bool#: Check if dataset is empty

property jmax#: Number of background processes + number of signal processes - 1

property kmax#: Number of nuisance parameters in the datacard

property mcstat_config: dict#: Return the configuration for MC statistics.

mcstat_section() → str#

property observation#: Number of observed events in the datacard

observation_section() → str#

preamble() → str#

rate(process: str, systematic='nominal') → float#

Rate of a process in the datacard.

Negative bins are clipped to zero before the shape template is written to ROOT (Combine cannot handle negative bin contents, see _clip_negative_bins). The rate reported here must match the integral of that written template, otherwise Combine would rescale the clipped shape to a rate that no longer equals its integral. So we apply the same clipping before summing (in-range bins only, consistent with how both .sum() and _clip_negative_bins treat the flow bins).

rate_parameters_section() → str#

rearrange_histograms(is_data: bool = False, category: str | None = None) → Hist#

Rearrange histograms from pocket_coffea output format to match processes and systematics in one histogram.

Parameters:

is_data (bool, optional) – Flag to indicate if the datacard is for data, defaults to False
category (str, optional) – Category to select; defaults to self.category. Allows building the rearranged single-category histogram for any region from the same input.

Returns:

Rearranged histogram

Return type:

hist.Hist

shape_section(shapes_name: str) → str#: shapes process channel file histogram [histogram_with_systematics]

property shape_variations: list[str]#

systematics_section() → str#

pocket_coffea.utils.stat.combine.combine_datacards(datacards: dict[Datacard], directory: str, path: str = 'combine_cards.sh', card_name: str = 'datacard_combined.txt', workspace_name: str = 'workspace.root', channel_masks: bool = False) → None#

Write the bash script to combine datacards from different categories.

Parameters:

datacards (dict[Datacard]) – Dictionary mapping output filenames to Datacard objects to combine.
directory (str) – Directory to save the bash script and combined datacard.
path (str) – Path (relative to directory) for the bash script file. Must end with .sh.
card_name (str) – Name of the combined datacard file.
workspace_name (str) – Name of the output workspace file.
channel_masks (bool) – Whether to add –channel-masks option to text2workspace.py.

pocket_coffea.utils.stat.combine.format_rate(value, sig=6)#

Format a datacard rate preserving its magnitude.

The previous implementation sliced the plain float repr to a fixed width (f”{rate}”[:10]), which silently dropped the exponent of small rates (e.g. 3.4567890123e-05 -> “3.45678901”, inflating the rate by 1e5). A significant-figure format keeps the exponent and stays compact.

pocket_coffea.utils.stat.processes module#

Physical Processes as Dataclasses and Utilities

class pocket_coffea.utils.stat.processes.DataProcess(name: str, samples: Iterable, label: str | None = None, *, years: Iterable)#

Bases: Process

Class to store information of a Data process

Parameters:

name – Name of the process
samples – Iterable of sample names associated with the process
years – Iterable of years the process is relevant for
label – Label for the process, defaults to name if not specified

Inherits from Process and sets is_data to True by default.

years: Iterable#

class pocket_coffea.utils.stat.processes.DataProcesses(processes: list[DataProcess])#

Bases: dict[str, DataProcess]

Custom dict to store information of multiple data processes.

Parameters:: processes (list[Process]) – List of processes

class pocket_coffea.utils.stat.processes.MCProcess(name: str, samples: Iterable, label: str | None = None, *, is_signal: bool, years: Iterable, has_rateParam: bool = False)#

Bases: Process

Class to store information of a Monte Carlo process

Parameters:

name – Name of the process
samples – Iterable of sample names associated with the process
years – Iterable of years the process is relevant for
is_signal – Whether the process is a signal process
has_rateParam – Whether the process has a rate parameter, defaults to False
label – Label for the process, defaults to name if not specified

Inherits from Process and sets is_data to False by default.

has_rateParam: bool = False#

is_signal: bool#

years: Iterable#

class pocket_coffea.utils.stat.processes.MCProcesses(processes: list[MCProcess])#

Bases: dict[str, MCProcess]

Custom dict to store information of multiple MC processes.

Parameters:: processes (list[Process]) – List of processes

property background_processes: list[str]#: Names of all Background Processes.

property n_processes: int#: Number of Processes

property signal_processes: list[str]#: Names of all Signal MC Processes.

class pocket_coffea.utils.stat.processes.Process(name: str, samples: Iterable, label: str | None = None)#

Bases: object

Class to store information of a physical process

Parameters:

name – Name of the process
samples – Iterable of sample names associated with the process
label – Label for the process, defaults to name if not specified
is_data – Whether the process is data (needs to be set by subclasses)

Note

It is recommended to use the MCProcess or DataProcess subclasses directly. This base class is primarily for shared attributes and methods.

is_data: bool#

label: str = None#

name: str#

samples: Iterable#

pocket_coffea.utils.stat.systematics module#

Systematic Uncertainties and Utilities for Statistical Analysis

Bases: object

Store information about one systematic uncertainty.

Parameters:

name – Name of the systematic uncertainty.
typ – Type of the systematic uncertainty (e.g. ‘shape’, ‘lnN’).
processes – List or tuple of process names affected, or a dict mapping process names to values.
years – List or tuple of years the uncertainty applies to.
value – Value (float or tuple of floats) of the uncertainty for all processes, or None if using a dict for processes.
datacard_name – Name of the systematic uncertainty in the datacard. Defaults to name if not specified.
coffea_name_alias –
Name of the shape variation as stored in the coffea output histograms. Use this when the coffea variation name differs from the canonical name — most commonly when one logical systematic is recorded under different names per process (e.g. parton-shower weights named differently for different generators). Can be a single string applied to all processes, or a dict mapping process names to per-process alias strings. Processes missing from the dict fall back to name. Defaults to name if not specified.

Note: as a plain string this field is largely redundant with name — if you only need a global rename, just set name to the coffea variation name and let datacard_name carry the datacard-side label. coffea_name_alias earns its keep in the dict form, where the alias varies by process.

coffea_name_alias: str | dict[str, str] = None#

datacard_name: str = None#

get_coffea_name(process: str) → str#

Return the coffea variation alias for a given process.

Falls back to name when a dict alias does not list process.

name: str#

processes: list[str] | tuple[str] | dict[str, float]#

typ: str#

value: float | tuple[float] = None#

years: list[str] | tuple[str]#

class pocket_coffea.utils.stat.systematics.Systematics(systematics: list[SystematicUncertainty])#

Bases: dict[str, SystematicUncertainty]

Store information of a list of systematic uncertainties

get_systematics_by_process(process: Process) → list[SystematicUncertainty]#: List of Systematics that affect a specific process.

get_systematics_by_type(syst_type: str) → dict[SystematicUncertainty]#: Dict of Systematics of a specific type.

list_type(syst_type: str) → list[str]#: List of Names of Systematics of a specific type.

n_systematics() → int#: Number of Systematics

property variations_names: list[str]#: List of Names of Shape Variations.

Module contents#

class pocket_coffea.utils.stat.DataProcess(name: str, samples: Iterable, label: str | None = None, *, years: Iterable)#

Bases: Process

Class to store information of a Data process

Parameters:

name – Name of the process
samples – Iterable of sample names associated with the process
years – Iterable of years the process is relevant for
label – Label for the process, defaults to name if not specified

Inherits from Process and sets is_data to True by default.

is_data: bool#

name: str#

samples: Iterable#

years: Iterable#

class pocket_coffea.utils.stat.DataProcesses(processes: list[DataProcess])#

Bases: dict[str, DataProcess]

Custom dict to store information of multiple data processes.

Parameters:: processes (list[Process]) – List of processes

class pocket_coffea.utils.stat.Datacard(histograms: dict[str, dict[str, Hist]], datasets_metadata: dict[str, dict[str, dict]], cutflow: dict[str, dict[str, float]], years: list[str], mc_processes: MCProcesses, systematics: Systematics, category: str, data_processes: DataProcesses | None = None, mcstat: bool | dict = True, bins_edges: list[float] | None = None, bin_prefix: str | None = None, bin_suffix: str | None = None, verbose: bool = True, shape_only_for_rateparam: bool = False, rateparam_norm_categories: list[str] | None = None)#

Bases: object

Datacard containing processes, systematics and write utilities.

Parameters:

histograms (dict[str, dict[str, hist.Hist]]) – Dict with histograms for each sample
datasets_metadata (dict[str, dict[str, dict]]) – Metadata for datasets
cutflow (dict[str, dict[str, float]]) – Cutflow information for datasets
years (list[str]) – Years of data taking
mc_processes (MCProcesses) – mc_processes
systematics (Systematics) – systematic uncertainties
category (str) – Category in datacard
data_processes (DataProcesses, optional) – Data processes, defaults to None
mcstat (bool | dict, optional) – Whether to include MC statistics, you can also pass a dict with the options accepted by combine, defaults to True
bins_edges (list[float], optional) – Bin edges for rebinning histograms, defaults to None
bin_prefix (str, optional) – prefix for the bin name, defaults to None
bin_suffix (str, optional) – suffix for the bin name, defaults to None

property adjust_columns#

property adjust_first_column#

property adjust_syst_colum#

property bin: str#: Name of the bin in the datacard

compute_rateparam_shape_scales() → dict#

Compute shape-only rescaling factors for processes with a free rateParam.

Returns:: mapping (process_name, systematic.datacard_name, shift) -> float
Return type:: dict

content(shapes_filename: str) → str#

Generate the content of the datacard.

Parameters:: shapes_filename (str) – The filename of the root file containing the shape histograms.
Returns:: Content of the datacard as a string.
Return type:: str

create_shape_histogram_dict(is_data: bool = False) → dict[str, Hist]#

Create a dictionary of histograms for each process and systematic.

Parameters:: is_data (bool, optional) – Flag to indicate if the datacard is for data, defaults to False
Returns:: dictionary of histograms, keys are process_systematic
Return type:: dict[str, hist.Hist]

dump(directory: PathLike, card_name: str = 'datacard.txt', shapes_name: str = 'shapes.root') → None#

Dump datacard and shapes to a directory.

Parameters:

directory (os.PathLike) – Directory to dump the datacard and shapes
card_name (str, optional) – name of the datacard file, defaults to “datacard.txt”
shapes_filename (str, optional) – name of the shapes file, defaults to “shapes.root”

expectation_section() → str#

get_datasets_by_sample(sample: str, year: str | None = None) → list[str]#

Retrieve the list of dataset names for a given sample and optionally a specific year.

Parameters:

sample (str) – The sample name for which to retrieve datasets.
year (str, optional, default=None) – The year (data-taking period) to filter datasets. If None (default), datasets from all years in self.years are returned.

Returns:

List of dataset names corresponding to the sample (and year, if specified).

Return type:

list[str]

property imax#: Number of bins in the datacard

is_empty_dataset(dataset: str) → bool#: Check if dataset is empty

property jmax#: Number of background processes + number of signal processes - 1

property kmax#: Number of nuisance parameters in the datacard

property mcstat_config: dict#: Return the configuration for MC statistics.

mcstat_section() → str#

property observation#: Number of observed events in the datacard

observation_section() → str#

preamble() → str#

rate(process: str, systematic='nominal') → float#

Rate of a process in the datacard.

rate_parameters_section() → str#

rearrange_histograms(is_data: bool = False, category: str | None = None) → Hist#

Rearrange histograms from pocket_coffea output format to match processes and systematics in one histogram.

Parameters:

is_data (bool, optional) – Flag to indicate if the datacard is for data, defaults to False
category (str, optional) – Category to select; defaults to self.category. Allows building the rearranged single-category histogram for any region from the same input.

Returns:

Rearranged histogram

Return type:

hist.Hist

shape_section(shapes_name: str) → str#: shapes process channel file histogram [histogram_with_systematics]

property shape_variations: list[str]#

systematics_section() → str#

class pocket_coffea.utils.stat.MCProcess(name: str, samples: Iterable, label: str | None = None, *, is_signal: bool, years: Iterable, has_rateParam: bool = False)#

Bases: Process

Class to store information of a Monte Carlo process

Parameters:

name – Name of the process
samples – Iterable of sample names associated with the process
years – Iterable of years the process is relevant for
is_signal – Whether the process is a signal process
has_rateParam – Whether the process has a rate parameter, defaults to False
label – Label for the process, defaults to name if not specified

Inherits from Process and sets is_data to False by default.

has_rateParam: bool = False#

is_data: bool#

is_signal: bool#

name: str#

samples: Iterable#

years: Iterable#

class pocket_coffea.utils.stat.MCProcesses(processes: list[MCProcess])#

Bases: dict[str, MCProcess]

Custom dict to store information of multiple MC processes.

Parameters:: processes (list[Process]) – List of processes

property background_processes: list[str]#: Names of all Background Processes.

property n_processes: int#: Number of Processes

property signal_processes: list[str]#: Names of all Signal MC Processes.

Bases: object

Store information about one systematic uncertainty.

Parameters:

name – Name of the systematic uncertainty.
typ – Type of the systematic uncertainty (e.g. ‘shape’, ‘lnN’).
processes – List or tuple of process names affected, or a dict mapping process names to values.
years – List or tuple of years the uncertainty applies to.
value – Value (float or tuple of floats) of the uncertainty for all processes, or None if using a dict for processes.
datacard_name – Name of the systematic uncertainty in the datacard. Defaults to name if not specified.
coffea_name_alias –
Name of the shape variation as stored in the coffea output histograms. Use this when the coffea variation name differs from the canonical name — most commonly when one logical systematic is recorded under different names per process (e.g. parton-shower weights named differently for different generators). Can be a single string applied to all processes, or a dict mapping process names to per-process alias strings. Processes missing from the dict fall back to name. Defaults to name if not specified.

Note: as a plain string this field is largely redundant with name — if you only need a global rename, just set name to the coffea variation name and let datacard_name carry the datacard-side label. coffea_name_alias earns its keep in the dict form, where the alias varies by process.

coffea_name_alias: str | dict[str, str] = None#

datacard_name: str = None#

get_coffea_name(process: str) → str#

Return the coffea variation alias for a given process.

Falls back to name when a dict alias does not list process.

name: str#

processes: list[str] | tuple[str] | dict[str, float]#

typ: str#

value: float | tuple[float] = None#

years: list[str] | tuple[str]#

class pocket_coffea.utils.stat.Systematics(systematics: list[SystematicUncertainty])#

Bases: dict[str, SystematicUncertainty]

Store information of a list of systematic uncertainties

get_systematics_by_process(process: Process) → list[SystematicUncertainty]#: List of Systematics that affect a specific process.

get_systematics_by_type(syst_type: str) → dict[SystematicUncertainty]#: Dict of Systematics of a specific type.

list_type(syst_type: str) → list[str]#: List of Names of Systematics of a specific type.

n_systematics() → int#: Number of Systematics

property variations_names: list[str]#: List of Names of Shape Variations.

pocket_coffea.utils.stat package

Contents

pocket_coffea.utils.stat package#

Submodules#

pocket_coffea.utils.stat.combine module#

pocket_coffea.utils.stat.processes module#

pocket_coffea.utils.stat.systematics module#

Module contents#