pocket_coffea.utils package

pocket_coffea.utils package#

Subpackages#

pocket_coffea.utils.stat package

Submodules#

pocket_coffea.utils.benchmarking module#

pocket_coffea.utils.benchmarking.print_processing_stats(output, start_time, workers)#: Prints processing statistics using rich.Table.

pocket_coffea.utils.build_jets_calibrator module#

Nice code to build the JEC/JER and JES uncertainties taken from andrzejnovak/boostedhiggs

pocket_coffea.utils.build_jets_calibrator.build(params, filter_years=None)#: Build the factory objects from the list of JEC files for each era for ak4 and ak8 jets and same them on disk in cloudpikle format

pocket_coffea.utils.configurator module#

class pocket_coffea.utils.configurator.Configurator(workflow, parameters, datasets, skim, preselections, categories, weights, variations, variables, weights_classes=None, calibrators=None, columns=None, workflow_options=None, save_skimmed_files=None, do_postprocessing=True)#

Bases: object

Main class driving the configuration of a PocketCoffea analysis. The Configurator groups the several aspects that define an analysis run: - skims, preselections, categorization - output: variables and columns - datasets - weights and variations and the objects proving them - workflow - analysis parameters

The running environment configuration is not part of the Configurator class.

The available Weights are taken from the list of weights classes passed to the Configurator.

clone()#: Create a copy of the configurator in the loaded=False state

filter_dataset(nfiles)#

load()#: This function loads the configuration for samples/weights/variations and creates the necessary objects for the processor to use. It also loads the workflow

load_columns_config(wcfg)#

load_cuts_and_categories(skim: list, preselections: list, categories)#: This function loads the list of cuts and groups them in categories. Each cut is identified by a unique id (see Cut class definition)

load_datasets()#

load_subsamples()#

load_variations_config(wcfg, variation_type)#: This function loads the variations definition and prepares a list of weights to be applied for each sample and category

load_weights_config(wcfg)#: This function loads the weights definition and prepares a list of weights to be applied for each sample and category

load_workflow()#

perform_checks()#

save_config(output)#

set_filesets_manually(filesets)#: This function sets the filesets directly, usually before the configuration is loaded. This is useful to pickle an unloaded version of the configuration restricting the filesets a priori. It is used in the condor submission script. The filesets_loaded attribute is set to True to avoid reloading the datasets.

pocket_coffea.utils.configurator.format(data, indent=0, width=80, depth=None, compact=True, sort_dicts=True)#

pocket_coffea.utils.cutflow_utils module#

pocket_coffea.utils.dataset module#

class pocket_coffea.utils.dataset.Dataset(name, cfg, sites_cfg=None, sort_replicas: str = 'geoip', append_parents=False)#

Bases: object

check_samples()#

down_file = <parsl.app.python.PythonApp object>#

download()#

get_samples(files)#

save(append=True, overwrite=False, split=False)#

class pocket_coffea.utils.dataset.Sample(name, das_names, sample, metadata, sites_cfg, sort_replicas: str = 'geoip', **kwargs)#

Bases: object

check_files(prefix)#

get_entries_uproot(file_path)#: Queries a single file for the number of events.

get_filelist()#: Function to get the dataset filelist from DAS and from Rucio. From DAS we get the general info about the dataset (event count, file size), whereas from rucio we get the specific path at the sites without the redirector (it helps with xrootd access in coffea).

get_parentlist(inplace=False)#: Function to get the parent dataset filelist from DAS. The parent list is included as an additional metadata in the sample’s dict.

get_sample_dict(redirector=True, prefix='root://xrootd-cms.infn.it//')#

pocket_coffea.utils.dataset.build_datasets(cfg, keys=None, overwrite=False, download=False, check=False, split_by_year=False, local_prefix=None, allowlist_sites=None, include_redirector=False, blocklist_sites=None, prioritylist_sites=None, regex_sites=None, sort_replicas='geoip', parallelize=4)#

pocket_coffea.utils.dataset.do_dataset(key, config, local_prefix, allowlist_sites, include_redirector, blocklist_sites, prioritylist_sites, regex_sites, sort_replicas: str = 'geoip', **kwargs)#

pocket_coffea.utils.export module#

pocket_coffea.utils.export.export_coffea_output_to_root(coffea_output: PathLike, output_dir: PathLike, variables: Iterable[str], categories: Iterable[str], years: Iterable[str]) → None#

Export pocket_coffea output to root files.

Parameters:

coffea_output (os.PathLike) – Path to coffea output
output_dir (os.PathLike) – Output directory for root files
variables (Iterable[str]) – Names of Variables to export
categories (Iterable[str]) – Names of categories to export
years (Iterable[str]) – Years to export

Raises:

ValueError – Raise if variables are not present in the coffea output

The output is saved in the output directory. Each year and category are saved as subdirectories, with each variable saved as a root file.

pocket_coffea.utils.export.save_histogram_to_root(hist_dict: dict[str, dict[str, Hist]], year: str, category: str, output_file: PathLike) → None#

Save histograms for one variable to a root file.

Parameters:

hist_dict (dict[str, dict[str, hist.Hist]]) – Dictionary of histograms as returned by the processor for one variable
year (str) – Year to save
category (str) – Category to save
output_file (os.PathLike) – Root file to save histograms

The histograms for each sample are summed over all datasets. If the histogram has a variation axis, each variation is saved separately.

pocket_coffea.utils.filter_output module#

pocket_coffea.utils.filter_output.compare_dict_types(d1, d2, path='')#

Recursively compare the types of values between two dictionaries.

Parameters:

d1 (dict) – The first dictionary to compare.
d2 (dict) – The second dictionary to compare.
path (str) – The current path of nested keys being checked.

pocket_coffea.utils.filter_output.filter_dictionary(d, string)#

pocket_coffea.utils.filter_output.filter_output_by_category(o, categories)#

pocket_coffea.utils.filter_output.filter_output_by_year(o, year)#

pocket_coffea.utils.filter_output.get_datasets_in_output(o)#

Return the set of dataset names present in a coffea output.

cutflow[‘initial’] holds exactly one key per processed dataset (MC and data, raw or postprocessed), see workflows/base.py. sum_genweights and datasets_metadata[‘by_dataset’] are used as robust fallbacks.

pocket_coffea.utils.filter_output.remove_datasets_from_output(o, datasets)#

Recursively delete every entry keyed by a name in datasets from a coffea output dict, in place, wherever it appears.

Dataset names are unique strings (e.g. TTTo2L2Nu_2018) that appear as dict keys at various depths (sum_genweights, sumw/sumw2/cutflow/columns per category or sample, variables/processing_metadata per variable->sample, datasets_metadata[‘by_dataset’]) and inside the datasets_metadata[‘by_datataking_period’][year][sample] sets. A single generic walk removes them everywhere without hardcoding each structure. This relies on dataset names not colliding with category/sample/variable keys, which holds for the <sample>_<year> naming convention. Returns o.

pocket_coffea.utils.histogram module#

pocket_coffea.utils.histogram.rebin_hist(bins_edges: list[float] | int, histograms: dict[str, dict[str, Hist]]) → dict[str, dict[str, Hist]]#

pocket_coffea.utils.load_output module#

pocket_coffea.utils.load_output.load_output(file)#

pocket_coffea.utils.logging module#

class pocket_coffea.utils.logging.LogFormatter(color, *args, **kwargs)#

Bases: Formatter

COLOR_CODES = {10: '\x1b[1;30m', 20: '\x1b[0;37m', 30: '\x1b[1;33m', 40: '\x1b[1;31m', 50: '\x1b[1;35m'}#

RESET_CODE = '\x1b[0m'#

format(record, *args, **kwargs)#

Format the specified record as text.

The record’s attribute dictionary is used as the operand to a string formatting operation which yields the returned string. Before formatting the dictionary, a couple of preparatory steps are carried out. The message attribute of the record is computed using LogRecord.getMessage(). If the formatting string uses the time (as determined by a call to usesTime(), formatTime() is called to format the event time. If there is exception information, it is formatted using formatException() and appended to the message.

pocket_coffea.utils.logging.setup_logging(console_log_output, console_log_level, console_log_color, logfile_file, logfile_log_level, logfile_log_color, log_line_template)#

pocket_coffea.utils.logging.try_and_log_error(error_file, exit_on_error=False)#

Decorator to catch exceptions and log them to a specified error file.

Parameters:

error_file (str) – Path to the error log file. The parent directory will be created if missing.
exit_on_error (bool, optional) – If True, prints the full traceback and exits the program when an error occurs. If False, logs the error and continues execution. Default is False.

pocket_coffea.utils.network module#

pocket_coffea.utils.network.check_port(port)#

pocket_coffea.utils.network.get_proxy_path() → str#: Checks if the VOMS proxy exists and if it is valid for at least 1 hour. If it exists, returns the path of it

pocket_coffea.utils.plot_efficiency module#

class pocket_coffea.utils.plot_efficiency.EfficiencyMap(shape, config, year, outputdir, mode='standard')#

Bases: object

compute_efficiency(cat, var, era=None)#: Compute the data and MC efficiency, the scale factor and the corresponding uncertainties for a given category cat and a variation var. If the computation has to be performed for a specific data-taking era, also the argument era has to be specified.

define_1d_figures(cat, syst, save_plots=True)#: Define the figures of the 1D histogram plots.

define_datamc(cat, var, era=None)#: Define the data and MC dictionaries used for slicing the histograms self.h_data and self.h_mc, for .

define_systematics()#: Define the list of systematics, given the variations.

define_variations(syst)#: Define the variations, given a systematic uncertainty syst.

initialize_stack()#: Initialize the lists and dictionaries to save the scale factor corrections in a stack.

plot1d(cat, syst, var, save_plots=True, era=None)#: Function to plot a 1D efficiency or scale factor for a given category cat and a variation var. To save the output plots, the flag save_plots has to be set to True. If the computation has to be performed for a specific data-taking era, also the argument era has to be specified.

plot2d(cat, syst, var, save_plots=True, era=None)#: Function to plot a 2D efficiency or scale factor for a given category cat and a variation var. To save the output plots, the flag save_plots has to be set to True. If the computation has to be performed for a specific data-taking era, also the argument era has to be specified.

save1d(save_plots)#: Function to save the 1D plots as png files if save_plots is set to True.

save2d(cat, syst, var, label, save_plots, era=None)#: Function that saves the 2D plots as png files if save_plots is set to True. The category cat, the systematic uncertainty syst and the variation var have to be specified. The argument label is required to specify the map that needs to be plotted.

save_corrections()#: Function to save the dictionary of corrections containing the scale factor value, the x-axis (and y-axis for 2D maps) and the data-taking year.

pocket_coffea.utils.plot_efficiency.plot_efficiency_maps(shape, config, year, outputdir, save_plots=False)#: Function to plot 1D and 2D efficiencies and scale factors and save the corrections dictionaries for the systematic variations included in the input histograms.

pocket_coffea.utils.plot_efficiency.plot_efficiency_maps_splitHT(shape, config, year, outputdir, save_plots=False)#: Function to plot 1D and 2D efficiencies and scale factors and save the corrections dictionaries for the HT systematic variation.

pocket_coffea.utils.plot_efficiency.plot_efficiency_maps_spliteras(shape, config, year, outputdir, save_plots=False)#: Function to plot 1D and 2D efficiencies and scale factors and save the corrections dictionaries for the data-taking era systematic variation.

pocket_coffea.utils.plot_efficiency.plot_ratio(x, y, ynom, yerrnom, xerr, edges, xlabel, ylabel, syst, var, opts, ax, data=False, sf=False, **kwargs)#: Function to plot the uncertainty band corresponding to the variation of an efficiency or scale factor in the ratio plot on an axis ax. To plot the data efficiency variation, the flag data has to be set to True. To plot the scale factor variation, the flag sf has to be set to True.

pocket_coffea.utils.plot_efficiency.plot_residue(x, y, ynom, yerrnom, xerr, edges, xlabel, ylabel, syst, var, opts, ax, data=False, sf=False, **kwargs)#: Function to plot the uncertainty band corresponding to the variation of an efficiency or scale factor in the residue plot on an axis ax. To plot the data efficiency variation, the flag data has to be set to True. To plot the scale factor variation, the flag sf has to be set to True.

pocket_coffea.utils.plot_efficiency.plot_variation(x, y, yerr, xerr, xlabel, ylabel, syst, var, opts, ax, data=False, sf=False, **kwargs)#: Function to plot a variation of an efficiency or scale factor on an axis ax. To plot the data efficiency variation, the flag data has to be set to True. To plot the scale factor variation, the flag sf has to be set to True.

pocket_coffea.utils.plot_efficiency.stack_sum(stack)#: Returns the sum histogram of a stack (hist.stack.Stack) of histograms.

pocket_coffea.utils.plot_efficiency.uncertainty_efficiency(eff, den, sumw2_num=None, sumw2_den=None, mc=False)#: Returns the uncertainty on an efficiency eff=num/den given the efficiency eff, the denominator den. For MC efficiency also the sum of the squared weights of numerator and denominator (sumw2_num, sumw2_den) have to be passed as argument and the flag mc has to be set to True.

pocket_coffea.utils.plot_efficiency.uncertainty_sf(eff_data, eff_mc, unc_eff_data, unc_eff_mc)#: Returns the uncertainty on a scale factor given the data and MC efficiency (eff_data, eff_mc) and the corresponding uncertainties (unc_eff_data, unc_eff_mc).

pocket_coffea.utils.plot_functions module#

pocket_coffea.utils.plot_functions.plot_shapes_comparison(df, var, shapes, title=None, ylog=False, output_folder=None, figsize=(8, 9), dpi=100, lumi_label='$137/fb$ (13 TeV)', outputfile=None)#

This function plots the comparison between different shapes, specified in the format shapes = [ (sample,cat,year,variation, label),]

The sample, cat and year are used to retrive the shape from the df, the label is used in the plotting. The ratio of all the shapes w.r.t. of the first one in the list are printed.

The plot is saved if outputfile!=None.

pocket_coffea.utils.plot_sf module#

pocket_coffea.utils.plot_sf.plot_variation_correctionlib(file, axis_x, systematics, plot_dir, **kwargs)#

pocket_coffea.utils.plot_utils module#

pocket_coffea.utils.rucio module#

pocket_coffea.utils.rucio.get_dataset_files_from_dbs(dataset_name: str, dbs_instance: str = 'prod/global')#: This function queries the DBS server to get information about the location of each block in a CMS dataset. It is used instead of the rucio replica query when the dataset is not available in rucio.

pocket_coffea.utils.rucio.get_dataset_files_replicas(dataset, allowlist_sites=None, include_redirector=False, blocklist_sites=None, prioritylist_sites=None, regex_sites=None, mode='full', partial_allowed=False, client=None, scope='cms', sort: str = 'geoip', invalid_list=[])#

Query the Rucio server to get information about the location of all the replicas of the files in a CMS dataset.

The sites can be filtered in 3 different ways: - allowlist_sites: list of sites to select from. If the file is not found there, raise an Exception. - blocklist_sites: list of sites to avoid. If the file has no left site, raise an Exception - prioritylist_sites: list of priorised sites. Sorts these sites to front if available and sort is ‘priority’ - regex_sites: regex expression to restrict the list of sites.

The fileset returned by the function is controlled by the mode parameter: - “full”: returns the full set of replicas and sites (passing the filtering parameters) - “first”: returns the first replica found for each file - “best”: to be implemented (ServiceX..) - “roundrobin”: try to distribute the replicas over different sites

Parameters:

dataset (str)
allowlist_sites (list)
blocklist_sites (list)
prioritylist_sites (list)
regex_sites (list)
mode (str, default "full")
client (rucio Client, optional)
partial_allowed (bool, default False)
scope (rucio scope, "cms")
sort (str, default 'geoip') – Sort replicas (for details check rucio documentation)
invalid_list (list) – A list of invalid files for this dataset (to be exluded in the output). Rucio does not know of invalid files, so these need to be obtained beforehand from DAS.

Returns:

files (list) – depending on the mode option. - If mode==”full”, returns the complete list of replicas for each file in the dataset - If mode==”first”, returns only the first replica for each file.
sites (list) – depending on the mode option. - If mode==”full”, returns the list of sites where the file replica is available for each file in the dataset - If mode==”first”, returns a list of sites for the first replica of each file.
sites_counts (dict) – Metadata counting the coverage of the dataset by site

pocket_coffea.utils.rucio.get_rucio_client(proxy=None) → Client#

Open a client to the CMS rucio server using x509 proxy.

Parameters:: proxy (str, optional) – Use the provided proxy file if given, if not use voms-proxy-info to get the current active one.
Returns:: nativeClient – Rucio client
Return type:: rucio.Client

pocket_coffea.utils.rucio.get_xrootd_sites_map()#

The mapping between RSE (sites) and the xrootd prefix rules is read from /cvmfs/cms/cern.ch/SITECONF/*site*/storage.json.

This function returns the list of xrootd prefix rules for each site.

pocket_coffea.utils.rucio.query_dataset(query: str, client=None, tree: bool = False, datatype='container', scope='cms')#

This function uses the rucio client to query for containers or datasets.

Parameters:

query (str = query to filter datasets / containers with the rucio list_dids functions)
client (rucio client)
tree (bool = if True return the results splitting the dataset name in parts parts)
datatype ("container/dataset": rucio terminology. "Container"==CMS dataset. "Dataset" == CMS block.)
scope ("cms". Rucio instance)

Returns:

list of containers/datasets
if tree==True, returns the list of dataset and also a dictionary decomposing the datasets
names in the 1st command part and a list of available 2nd parts.

pocket_coffea.utils.run module#

pocket_coffea.utils.run.get_runner(executor, chunksize, maxchunks, skipbadfiles, schema, format, error_log_file, exit_on_error=True)#

Create and return a Coffea Runner wrapped with error logging, given the specified configuration parameters. :param executor: The executor type for the Coffea Runner (e.g., ‘futures’, ‘iterative’). :type executor: str :param chunksize: The number of events per chunk to process. :type chunksize: int :param maxchunks: The maximum number of chunks to process. :type maxchunks: int :param skipbadfiles: Whether to skip bad files during processing. :type skipbadfiles: bool :param schema: The schema to use for NanoEvents. :type schema: coffea.nanoevents.schemas.BaseSchema :param format: The file format (e.g., ‘root’). :type format: str :param error_log_file: Path to the error log file for logging exceptions. :type error_log_file: str :param exit_on_error: If True, exits the program on error after logging. Default is False. :type exit_on_error: bool, optional

Returns:: A Coffea Runner instance configured with the specified parameters.
Return type:: Runner

pocket_coffea.utils.skim module#

pocket_coffea.utils.skim.apply_skim_sumgenweights_override(accumulator, filesets)#

Override accumulator[‘sum_genweights’] and accumulator[‘sum_signOf_genweights’] from the authoritative dataset-level totals embedded in the dataset metadata at skim time.

Background: when a skim drops every event of an input chunk, that chunk’s contribution to the original sum_genweight is lost from the per-chunk reconstruction in BaseProcessorABC.process (the sum(skimRescaleGenWeight * genWeight) line that runs only on surviving events), because no ROOT file is written for the empty chunk. To recover it, the skim job persists the pre-skim dataset-level total into the new dataset JSON via save_skimed_dataset_definition; we read it back here and replace the (possibly under-counted) reconstructed total before rescale_sumgenweights runs.

No-op when the dataset is not flagged isSkim or when the metadata doesn’t carry the new fields — older skim outputs continue to use the per-chunk reconstruction.

Returns the list of datasets whose totals were overridden, for logging. Kept dependency-free (stdlib only) so it can be unit-tested without the heavy executor / omegaconf stack.

pocket_coffea.utils.skim.copy_file(fname: str, localdir: str, location: str, subdirs: List[str] | None = None)#

pocket_coffea.utils.skim.is_rootcompat(a)#: Is it a flat or 1-d jagged array?

pocket_coffea.utils.skim.save_skimed_dataset_definition(processing_out, fileout, check_initial_events=True, skip_initial_events_check_datasets=None)#

Build the skimmed dataset JSON from a (merged) processing output.

By default the number of initial events in the dataset metadata must match the cutflow["initial"] count for every dataset, otherwise an exception is raised (some input chunk was lost). skip_initial_events_check_datasets is a list of dataset names for which this mismatch is tolerated: a warning is printed instead of raising, which is useful when a corrupted input file had to be skipped on purpose. check_initial_events=False disables the check entirely for all datasets.

pocket_coffea.utils.skim.uproot_writeable(events)#: Restrict to columns that uproot can write compactly

pocket_coffea.utils.time module#

pocket_coffea.utils.time.wait_until(time)#: Wait until the given time

pocket_coffea.utils.utils module#

pocket_coffea.utils.utils.adapt_chunksize(nevents, run_options)#: Helper function to adjust the chunksize so that each worker has at least a chunk to process. If the number of available workers exceeds the maximum number of workers for a given dataset, the chunksize is reduced so that all the available workers are used to process the given dataset.

pocket_coffea.utils.utils.add_to_path(p)#

pocket_coffea.utils.utils.dump_ak_array(akarr: Array, fname: str, location: str, subdirs: List[str] | None = None) → None#: Dump an awkward array to disk at location/’/’.join(subdirs)/fname.

pocket_coffea.utils.utils.get_nano_version(events, params, year)#: Helper function to get the nano version from the events metadata or from the default parameters.

pocket_coffea.utils.utils.get_random_seed(metadata, salt='')#: Generate a random seed based on the current file and entry range being processed. This ensures that different files and different entry ranges will produce different seeds, while the same file and entry range will always produce the same seed. An optional salt can be provided to further differentiate the seed generation for different function.

pocket_coffea.utils.utils.load_config(cfg, do_load=True, save_config=True, outputdir=None)#: Helper function to load a Configurator instance from a user defined python module

pocket_coffea.utils.utils.load_failed_jobs(outputdir)#

Load the list of failed job names from a JSON file.

Parameters:: outputdir (str) – Output directory where the failed_jobs.json file is located
Returns:: List of dataset or group names that failed processing, or None if file doesn’t exist
Return type:: list of str or None

pocket_coffea.utils.utils.path_import(absolute_path)#

pocket_coffea.utils.utils.replace_at_indices(a, indices, a_corrected, array_builder)#

Replace elements of array a at positions specified by indices with values from a_corrected. indices is a jagged array where each sub-array contains the indices to be replaced for the corresponding sub-array in a. a_corrected is a jagged array with the same shape as indices, containing the new values to insert. The shape of a is different from that of indices and a_corrected, but they share the same outer dimension.

Parameters:#

aarray: Original array to be modified
indicesarray: Indices where replacements should occur
a_correctedarray: Corrected values to insert (same shape as indices)
array_builderak.ArrayBuilder, optional: ArrayBuilder to use. If None, a new one is created.

Returns:#

array : Modified copy of a

pocket_coffea.utils.utils.save_failed_jobs(failed_jobs_list, outputdir)#

Save the list of failed job names to a JSON file.

Parameters:

failed_jobs_list (list of str) – List of dataset or group names that failed processing
outputdir (str) – Output directory where the failed_jobs.json file will be saved

pocket_coffea.utils package

Contents

pocket_coffea.utils package#

Subpackages#

Submodules#

pocket_coffea.utils.benchmarking module#

pocket_coffea.utils.build_jets_calibrator module#

pocket_coffea.utils.configurator module#

pocket_coffea.utils.cutflow_utils module#

pocket_coffea.utils.dataset module#

pocket_coffea.utils.export module#

pocket_coffea.utils.filter_output module#

pocket_coffea.utils.histogram module#

pocket_coffea.utils.load_output module#

pocket_coffea.utils.logging module#

pocket_coffea.utils.network module#

pocket_coffea.utils.plot_efficiency module#

pocket_coffea.utils.plot_functions module#

pocket_coffea.utils.plot_sf module#

pocket_coffea.utils.plot_utils module#

pocket_coffea.utils.rucio module#

pocket_coffea.utils.run module#

pocket_coffea.utils.skim module#

pocket_coffea.utils.time module#

pocket_coffea.utils.utils module#

Parameters:#

Returns:#

Module contents#