robustipy package
Submodules
robustipy.figures module
- robustipy.figures.axis_formatter(ax: Axes, ylabel: str, xlabel: str, title: str, side: str = 'left') None[source]
Apply consistent styling to a Matplotlib Axes: grids, fonts, labels, and title placement.
- Parameters:
ax (matplotlib.axes.Axes) – The axes object to format.
ylabel (str) – Label text for the y-axis.
xlabel (str) – Label text for the x-axis.
title (str) – Title text for the plot.
side ({'left', 'right'}, default='left') – Side on which to draw the y-axis label and title. - ‘left’: y-label on left, title aligned slightly left. - ‘right’: y-label on right, title aligned to the right side of the axes.
- Returns:
This function modifies ax in place and does not return anything.
- Return type:
None
- robustipy.figures.plot_bdist(results_object, oddsratio: bool, specs: List[List[str]] | None = None, ax: Axes | None = None, title: str = '', despine_left: bool = True, legend_bool: bool = False, bw_adjust: float = 0.5, highlights: bool = True, colormap: str | Colormap = 'viridis') Axes[source]
Plot density‐scaled histograms and KDEs of coefficient distributions, in a fully generalisable way. KDE is smoothed with a bandwidth adjustment factor.
- Parameters:
results_object (object)
oddsratio (bool) – If True, exponentiate the coefficient estimates before plotting.
specs (list of control-lists, optional) – Up to three specs to highlight. Default: None (no highlights).
ax (matplotlib.axes.Axes, optional) – Axes on which to draw; if None a new (4×3) figure and axes are created.
title (str, default='') – Title to display above the plot.
despine_left (bool, default=True) – If True, move y-axis ticks & label to the right spine; otherwise keep on the left.
legend_bool (bool, default=False) – If True, draw a custom legend for the highlighted specifications.
bw_adjust (float, default=0.5) – Bandwidth adjustment factor for the KDE; larger values make the curve smoother.
highlights (bool, default=True) – If True, highlights the full model and the null model in the plot.
colormap (str or Colormap, default='viridis') – Colormap used for highlighted specifications.
- Returns:
ax – The axes containing the completed plot.
- Return type:
matplotlib.axes.Axes
- robustipy.figures.plot_bma(results_object, colormap: str | Colormap, ax: Axes, feature_order: Sequence[str], title: str = '') Axes[source]
Plot Bayesian Model Averaging (BMA) inclusion probabilities as a horizontal bar chart.
- Parameters:
results_object (object) –
- Must implement compute_bma() returning a DataFrame with columns:
’control_var’
’probs’
colormap (str or Colormap) – Matplotlib colormap name or object used to pick the bar color.
ax (matplotlib.axes.Axes) – Axes on which to draw the horizontal bar chart.
feature_order (sequence of str) – Ordered list of control variable names to display on the y-axis.
title (str, default='') – Title to display above the plot.
- Returns:
ax – The axes containing the completed BMA plot.
- Return type:
matplotlib.axes.Axes
- robustipy.figures.plot_curve(results_object, loess: bool = True, ci: float = 1, oddsratio: bool = False, specs: List[List[str]] | None = None, ax: Axes | None = None, highlights: bool = True, inset: bool = True, title: str = '', colormap: str | Colormap = 'viridis') Axes[source]
Plot the specification-curve of median and CI for coefficient estimates.
- Parameters:
results_object (object) – Must expose .summary_df (with columns ‘median’), .specs_names, .estimates (DataFrame of bootstrap draws), .draws, .kfold, and .inference dict.
loess (bool, default=True) – Whether to smooth the lower/upper CI bounds with LOESS.
ci (float, default=1) – The confidence-level (e.g. 0.95 for a 95% interval).
oddsratio (bool, default=False) – If True, exponentiate the estimates before plotting.
specs (list of control-lists, optional) – Up to three specs to highlight. Default: None (no highlights).
ax (matplotlib.axes.Axes, optional) – Axes to draw on. Default: current axes.
colormap (str or Colormap, default='viridis') – Colormap for highlights and related elements.
title (str, optional) – Title text for the axes.
highlights (bool, default=False) – If True, highlights the full model and the null model in the plot.
inset (bool, default=True) – If True, adds an inset with the full model and null model highlights.
- Returns:
The axes containing the plot.
- Return type:
matplotlib.axes.Axes
- robustipy.figures.plot_hexbin_log(results_object, ax: plt.Axes, fig: plt.Figure, oddsratio: bool, colormap: str | cm.Colormap, title: str = '') None[source]
Plot a hex-bin density of full-sample coefficient estimates vs. log-likelihood.
- Parameters:
results_object (object) –
- Must expose:
all_b/all_b_exp: list/array of full-sample coefficient arrays
summary_df[‘ll’] or summary_df[‘ll_gain_per_obs’]: corresponding likelihood metric values
ax (matplotlib.axes.Axes) – The axes on which to draw the hex-bin.
fig (matplotlib.figure.Figure) – Parent figure (needed to place the colorbar).
oddsratio (bool) – If True, use exponentiated estimates for plotting.
colormap (str or Colormap) – Name or object of a Matplotlib colormap.
title (str, optional) – Title displayed above the plot (default: ‘’).
- Return type:
None
- robustipy.figures.plot_hexbin_r2(results_object, ax: plt.Axes, fig: plt.Figure, oddsratio: bool, colormap: str | cm.Colormap, title: str = '', side: str = 'left') None[source]
Hex-bin density plot of boot-strapped coefficient estimates versus in-sample \(R^2\), together with a marginal colour-bar of observation counts.
- Parameters:
results_object (Any) – Must expose
results_object.estimatesandresults_object.r2_values, each supporting.stack()to obtain 1-d views.ax (matplotlib.axes.Axes) – Target axes.
fig (matplotlib.figure.Figure) – Parent figure, needed for colour-bar geometry.
oddsratio (bool) – If True, use exponentiated estimates for plotting.
colormap (str | matplotlib.colors.Colormap) – Matplotlib-compatible colormap.
title (str, optional) – Axes title.
side ({'left', 'right'}, optional) –
'left'– conventional layout: y-axis on the left, colour-bar on the right.'right'– mirror layout: y-axis (ticks, label, spine) on the right, colour-bar on the left; the left spine is removed.
- Returns:
Draws in place on ax.
- Return type:
None
- Raises:
ValueError – If side is not ‘left’ or ‘right’.
Notes
Only the presentation layer is mirrored; the data are not transformed.
- robustipy.figures.plot_ic(results_object, ic: str, specs: List[List[str]] | None = None, ax: Axes | None = None, colormap: str = 'viridis', title: str = '', despine_left: bool = True) Axes[source]
- Plots the information criterion (IC) curve, colouring:
“No Controls” in the first colormap colour
Each user‐highlighted spec in the next colours
“Full Model” in the last colormap colour
- robustipy.figures.plot_kfolds(results_object, colormap: str | Colormap, ax: Axes | None = None, title: str = '', despine_left: bool = True, tau: float = 0.6) Axes[source]
Plot the cross-validation metric distribution (density + histogram), with an adaptive legend positioned safely around the tallest bars.
- Parameters:
results_object (object) –
- Must expose:
summary_df : pandas.DataFrame containing column ‘av_k_metric’
name_av_k_metric : str, the metric name (e.g. ‘r-squared’, ‘rmse’)
colormap (str or Colormap) – Matplotlib colormap name or object used for plotting.
ax (matplotlib.axes.Axes, optional) – Axes on which to draw; if None a new (4×3) figure and axes are created.
title (str, default='') – Title to display above the plot.
despine_left (bool, default=True) – If True, move y-axis ticks & label to the right spine; otherwise keep on the left.
tau (float in (0,1), default=0.6) – Safety factor for legend placement: bars taller than tau*ylim are considered “in the way” and flip the legend to the opposite side.
- Returns:
ax – The axes containing the completed plot.
- Return type:
matplotlib.axes.Axes
- robustipy.figures.plot_results(results_object, loess: bool = True, ci: float = 0.95, specs: List[List[str]] | None = None, ic: str | None = None, colormap: str | Colormap = 'viridis', figsize: Tuple[int, int] = (16, 16), ext: str = 'pdf', figpath=None, highlights=True, oddsratio=False, project_name: str = None, spec_matrix_bins: int = 128, spec_matrix_threshold: int = 128) None[source]
Plots the coefficient estimates, IC curve, and distribution plots for the given results object.
- Parameters:
results_object (object) – An OLSResult-like object (must expose attributes y_name, x_name, shap_return, summary_df, specs_names, etc.).
loess (bool, default=True) – Whether to apply LOESS smoothing to the coefficient–specification curve.
ci (float, default=0.95) – The confidence interval to use.
specs (list of list of str, optional) – Up to three specs (lists of control names) to highlight in the curve, IC, and distribution panels.
ic (str, optional) – Information criterion name to plot (one of ‘aic’,’bic’,’hqic’).
colormap (str or Colormap, default='viridis') – Colormap used consistently for all panels.
figsize ((width, height), default=(16,16)) – Size of the full figure in inches.
figpath (str or Path, optional) – Directory in which to save outputs; if None, uses current working dir.
ext (str, default='pdf') – File extension to save each panel (e.g. ‘png’,’pdf’).
project_name (str, default=None) – Directory and filename prefix under ./figures/.
spec_matrix_bins (int, default=128) – Number of bins to use for the binned spec matrix heatmap.
spec_matrix_threshold (int, default=128) – Minimum number of specifications required to switch from dots to heatmap.
bool (highlights) – Whether to exponentiate the coefficients (e.g. for odds ratios).
default=False – Whether to exponentiate the coefficients (e.g. for odds ratios).
bool – Whether to highlight certain specifications.
default=True – Whether to highlight certain specifications.
Notes
Saves a combined “_all” figure plus individual panels named: _R2hexbin, _OOS, _curve, _LLhexbin, _SHAP, _BMA, _IC, _bdist. for the case when len(y_name) == 1, and a subset for when >1.
- robustipy.figures.plot_spec_matrix(results_object, ax: Axes | None = None, order_idx: Sequence[int] | None = None, controls: Sequence[str] | None = None, oddsratio: bool = False, title: str = '', bins: int | None = None, heatmap_threshold: int = 128, colormap: str | Colormap = 'viridis', cbar_ax: Axes | None = None, cbar_width: float = 0.04, cbar_width_fig: float | None = None) Axes[source]
Plot a dot matrix indicating which controls are included in each specification.
- Parameters:
results_object (object) – Must expose .specs_names and (optionally) .controls.
ax (matplotlib.axes.Axes, optional) – Axes to draw on. Default: current axes.
order_idx (sequence of int, optional) – Ordering of specifications along the x-axis. If None, uses the specification-curve ordering (sorted by median).
controls (sequence of str, optional) – Controls to show on the y-axis. Defaults to results_object.controls if available; otherwise uses the union of spec controls.
oddsratio (bool, default=False) – If True, use exponentiated estimates for ordering (to match plot_curve).
title (str, optional) – Title text for the axes.
bins (int, optional) – If provided, aggregate specifications into this many bins and plot inclusion rates as a heatmap (0–1) instead of individual dots.
heatmap_threshold (int, default=128) – Minimum number of specifications required to switch from dots to heatmap.
colormap (str or Colormap, default='viridis') – Colormap used for heatmap shading (dot matrix uses a fixed blue tone).
cbar_ax (matplotlib.axes.Axes, optional) – If provided, draw the heatmap colorbar inside this axes instead of creating an inset colorbar on the right.
cbar_width (float, default=0.04) – Colorbar width as a fraction of the heatmap axes width (or the cbar axis width when cbar_ax is provided).
cbar_width_fig (float, optional) – Absolute colorbar width as a fraction of the figure width. If provided, this overrides cbar_width when cbar_ax is given, ensuring consistent absolute thickness across panels.
- Returns:
The axes containing the plot.
- Return type:
matplotlib.axes.Axes
- robustipy.figures.shap_violin(ax: Axes, shap_values: ndarray | List[ndarray] | Explanation, features: ndarray | DataFrame | List[str] | None = None, feature_names: List[str] | None = None, max_display: int = 10, color: str | Sequence | None = None, alpha: float = 1.0, cmap: str = 'viridis', use_log_scale: bool = False, title: str = '', clear_yticklabels: bool = False, cbar_ax: Axes | None = None, cbar_width: float = 0.04, cbar_width_fig: float | None = None) List[str][source]
Create a SHAP beeswarm plot, colored by feature values when they are provided.
- Parameters:
ax (matplotlib.axes.Axes) – Axes on which to draw the plot.
shap_values (array-like or Explanation) – SHAP value matrix (#samples×#features), or a list thereof for multiclass, or a SHAP Explanation object.
features (array-like, DataFrame, or list of str, optional) – Feature value matrix (#samples×#features), or just a feature_names list. Default: None (no coloring).
feature_names (list of str, optional) – Names of each feature. Default: None (will infer or auto‐label).
max_display (int, default=10) – Maximum number of top features (by mean absolute SHAP value) to show.
color (str or sequence, optional) – Single color for all points when no feature values given.
alpha (float, default=1.0) – Opacity for scatter points.
cmap (str, default='viridis') – Colormap for coloring points.
use_log_scale (bool, default=False) – If True, use symlog x-axis scaling.
title (str, optional) – Title text for the axes.
clear_yticklabels (bool, default=False) – If True, hide the y-tick labels.
cbar_ax (matplotlib.axes.Axes, optional) – If provided, draw the colorbar inside this axes instead of creating an inset colorbar on the right.
cbar_width (float, default=0.04) – Colorbar width as a fraction of the axes width (or the cbar axis width when cbar_ax is provided).
cbar_width_fig (float, optional) – Absolute colorbar width as a fraction of the figure width. If provided, this overrides cbar_width when cbar_ax is given, ensuring consistent absolute thickness across panels.
- Returns:
Ordered list of feature names actually plotted.
- Return type:
List[str]
- robustipy.figures.title_setter(ax: Axes, title: str, side: str = 'left') None[source]
Set a title on ax, aligned on the left but positioned differently depending on whether the y-axis is on the left or right.
- Parameters:
ax (matplotlib.axes.Axes) – The axes whose title you wish to set.
title (str) – The title text.
side ({'left', 'right'}, default='left') –
‘left’: standard positioning.
’right’: shifts the title so it doesn’t overlap a right-side y-axis.
robustipy.models module
robustipy.models
This module implements multivariate regression classes for Robust Inference. It includes classes for OLS (OLSRobust and OLSResult) and logistic regression (LRobust) analysis, along with utilities for model merging, plotting, and Bayesian model averaging.
- class robustipy.models.LRobust(*, y: List[str], x: List[str], data: DataFrame, model_name: str = 'Logistic Regression Robust')[source]
Bases:
BaseRobustA class to perform logistic regression analysis using statsmodels.Logit.
- Parameters:
y (list[str]) – Name of the dependent binary variable, supplied as a one-element list. Multiple binary outcomes are not currently supported in a single LRobust fit.
x (list[str]) – Predictor column name(s) included in every specification. The first element is treated as the reported focal estimand; any additional elements are fixed predictors that do not vary across the control-subset space.
data (pandas.DataFrame) – The dataset containing y, x, and any optional controls.
model_name (str, default='Logistic Regression Robust') – Custom label for this model, used in outputs and plots.
- results
Populated after calling .fit(). Contains all coefficient, p-value, and metric outputs.
- Type:
- fit(*, controls: List[str], group: str | None = None, draws: int = None, sample_size: int | None = None, kfold: int = None, oos_metric: str = None, n_cpu: int | None = None, seed: int | None = None, rescale_x: bool | None = False, rescale_y: bool | None = False, rescale_z: bool | None = False, compute_shap: bool = True, threshold: int = 1000000) LRobust[source]
Fit the logistic regression models over the specification space and bootstrap samples.
- Parameters:
controls (list of str) – Names of optional control variables to include in every spec.
group (str, optional) – Grouping variable used for grouped cross-validation and cluster bootstrap resampling. Logistic fixed-effects demeaning is not currently implemented.
draws (int, default=None) – Number of bootstrap resamples per specification.
sample_size (int, optional) – Number of observations per bootstrap draw; defaults to full dataset.
kfold (int, default=None) – Folds for out-of-sample CV; set to 0 to disable.
oos_metric (default None.) – Options: {‘pseudo-r2’, ‘mcfaddens-r2’, ‘imv’, ‘rmse’,’cross-entropy’}. Metric to compute on held-out folds.
n_cpu (int, optional) – Number of parallel jobs; defaults to all available.
seed (int, optional) – Random seed for reproducibility.
rescale_y (bool, default=False) – Rescale the dependent variable.
rescale_x (bool, default=False) – Rescale the x variable.
rescale_z (bool, default=False) – Rescale the z variables.
compute_shap (bool, default=True) – If True, compute SHAP values for plotting. Set False for fit-only profiling.
threshold (int, default=1000000) – Warn if draws * n_specs exceeds this.
- Returns:
self – Self, with .results populated as an OLSResult.
- Return type:
- get_results() Any[source]
Get the results of the logistic regression.
- Returns:
results – Object containing the regression results.
- Return type:
- multiple_y() None[source]
- Build the lists
self.y_composites – pandas Series, one per composite Y
self.y_specs – tuple[str], names that form that composite
If self.composite_sample is a positive int, draw that many random non-empty subsets of the raw Y columns before we create any Series. Otherwise enumerate all non-empty subsets (original behaviour).
- class robustipy.models.MergedResult(*, y: str, specs: Sequence[Sequence[str]], estimates: ndarray | DataFrame, p_values: ndarray | DataFrame, r2_values: ndarray | DataFrame)[source]
Bases:
ProtoresultCombine and summarize results exclusively from one or more OLSResult runs.
- Parameters:
y (str) – Dependent variable name shared by all merged results.
specs (Sequence[Sequence[str]]) – List of specifications; each inner sequence names the controls defining one spec.
estimates (array-like or pandas.DataFrame) – Coefficient estimates for each spec (rows) and bootstrap draw (columns).
p_values (array-like or pandas.DataFrame) – Corresponding p-values for each estimate.
r2_values (array-like or pandas.DataFrame) – R² values for each spec and draw.
- y_name
Name of the dependent variable.
- Type:
str
- specs_names
Each entry is the set of control variables defining a spec.
- Type:
pandas.Series[frozenset]
- estimates
Coefficient estimates by spec and draw.
- Type:
pandas.DataFrame
- p_values
P-values by spec and draw.
- Type:
pandas.DataFrame
- r2_values
R² values by spec and draw.
- Type:
pandas.DataFrame
- summary_df
Per-spec summary with median, min, max, and 95% confidence intervals.
- Type:
pandas.DataFrame
- merge(result_obj: OLSResult, left_prefix: str, right_prefix: str) MergedResult[source]
Merge the current OLSResult object with another, tagging each specification with a prefix to indicate origin.
- Parameters:
result_obj (OLSResult) – Another OLSResult object to merge.
left_prefix (str) – Prefix to tag the specifications from the current object.
right_prefix (str) – Prefix to tag the specifications from the result_obj object.
- Returns:
A new MergedResult object containing combined estimates and metadata.
- Return type:
- Raises:
TypeError – If result_obj is not an instance of OLSResult or prefixes are not strings.
ValueError – If the dependent variable names do not match between the two objects.
- plot(loess: bool = True, ci: float = 1, specs: List[List[str]] | None = None, colormap: str = 'viridis', figsize: Tuple[int, int] = (16, 14), ext: str = 'pdf', figpath: str = None, highlights: bool = True, project_name: str = None, oddsratio: bool = False) plt.Figure[source]
Plot specification results highlighting up to three specs.
- Parameters:
loess (bool) – Whether to apply LOESS smoothing to confidence intervals.
specs (list of list of str, optional) – Specifications to highlight.
colormap (str) – Matplotlib colormap name.
figsize (tuple) – Figure size (width, height).
ext (str) – File extension for saving.
figpath (str or Path, optional) – Directory in which to save outputs; if None, uses current working dir.
project_name (str) – Prefix for saved figure.
bighlights (bool) – Whether to highlight specs
bool (oddsratio) – Whether to exponentiate the coefficients (e.g. for odds ratios).
default=False – Whether to exponentiate the coefficients (e.g. for odds ratios).
- Returns:
Plot showing the regression results.
- Return type:
matplotlib.figure.Figure
- class robustipy.models.OLSResult(*, y: str, x: str, data: DataFrame, specs: list[frozenset[str]], all_predictors: list[list[str]], controls: list[str], draws: int, kfold: int, estimates: ndarray | DataFrame, estimates_ystar: ndarray | DataFrame, all_b: list[ndarray], all_p: list[ndarray], p_values: ndarray | DataFrame, p_values_ystar: ndarray | DataFrame, r2_values: ndarray | DataFrame, r2i_array: list[float], ll_array: list[float], aic_array: list[float], bic_array: list[float], hqic_array: list[float], ll_raw_array: list[float] | None = None, ll_null_array: list[float] | None = None, ll_gain_array: list[float] | None = None, ll_gain_per_obs_array: list[float] | None = None, nobs_array: list[int] | None = None, av_k_metric_array: list[float] | None = None, model_name: str, name_av_k_metric: str | None = None, shap_return: Any = None)[source]
Bases:
ProtoresultEncapsulates the results of an OLSRobust run
- y_name
Dependent variable name.
- Type:
str
- x_name
Main predictor name.
- Type:
str
- data
Original DataFrame used for all fits.
- Type:
pd.DataFrame
- specs_names
Specification sets (which controls are included, etc.).
- Type:
pd.Series[frozenset[str]]
- all_predictors
List of predictor+control sets for each specification.
- Type:
list[list[str]]
- controls
Pool of all control variables considered.
- Type:
list[str]
- draws
Number of bootstrap draws.
- Type:
int
- kfold
Number of folds for out-of-sample evaluation.
- Type:
int
- estimates
Shape (n_specs, draws), bootstrap estimates of β₁.
- Type:
pd.DataFrame
- p_values
Same shape, bootstrap p-values for β₁.
- Type:
pd.DataFrame
- estimates_ystar
Bootstrap estimates under the null (for joint inference).
- Type:
pd.DataFrame
- p_values_ystar
Bootstrap p-values under the null.
- Type:
pd.DataFrame
- r2_values
Shape (n_specs, draws), bootstrapped R².
- Type:
pd.DataFrame
- summary_df
Per-spec summary (median, CI, info criteria, cross-val metric).
- Type:
pd.DataFrame
- inference
Aggregated inference statistics (proportions, Stouffer’s Z, etc.).
- Type:
dict[str, Any]
- shap_return
Optional SHAP values and the matrix they came from.
- Type:
tuple[np.ndarray, pd.DataFrame] | None
- compute_bma() DataFrame[source]
Performs Bayesian Model Averaging (BMA) using BIC-implied priors.
- Returns:
DataFrame containing BMA results with control variable inclusion probabilities and average coefficients.
- Return type:
pd.DataFrame
- classmethod load(filename: str) OLSResult[source]
Loads an OLSResult object from a pickle file.
- Parameters:
filename – Path to the pickle file.
- Return type:
- merge(result_obj: OLSResult, left_prefix: str, right_prefix: str) MergedResult[source]
Merge this OLSResult with another, tagging each spec by prefix.
- Parameters:
result_obj (OLSResult) – Another result object with the same dependent variable.
left_prefix (str) – Tag to append to this object’s specifications.
right_prefix (str) – Tag to append to the other object’s specifications.
- Returns:
merged – A new MergedResult containing all specs, estimates, p_values, and r2_values from both.
- Return type:
- Raises:
TypeError – If result_obj is not an OLSResult, or prefixes are not strings.
ValueError – If the dependent variable names do not match.
- plot(loess: bool = True, specs: List[List[str]] | None = None, ic: str = 'aic', ci: float = 1, colormap: str = 'viridis', figsize: Tuple[int, int] = (12, 6), ext: str = ' pdf', figpath=None, project_name: str = 'no_project_name', highlights: bool = True, oddsratio: bool = False) plt.Figure[source]
Plots the regression results using specified options.
- Parameters:
loess (bool, default=True) – Whether to add a LOESS smoothed trend line.
specs (list of list of str, optional) – Up to three specific model specifications to highlight.
ic ({'bic', 'aic', 'hqic'}, default='aic') – Which information criterion to display.
ci (float, default=1) – Confidence interval.
colormap (str, default='viridis') – Name of the matplotlib colormap for the plot.
figpath (str or Path, optional) – Directory in which to save outputs; if None, uses current working dir.
figsize (tuple of int, default=(12, 6)) – Figure width and height in inches.
ext (str, default='pdf') – File extension if saving the figure (unused if not saving).
project_name (str, default='no_project_name') – Project identifier used in saved filename (unused if not saving).
bool (oddsratio) – Whether to highlight individual plots.
True (default =) – Whether to highlight individual plots.
bool – Whether to exponentiate the coefficients (e.g. for odds ratios).
default=False – Whether to exponentiate the coefficients (e.g. for odds ratios).
- Returns:
fig – The figure object containing the plot.
- Return type:
matplotlib.figure.Figure
- Raises:
ValueError – If ic is not one of {‘bic’, ‘aic’, ‘hqic’}. or if more than three specs are given,
TypeError – If specs is provided but is not a list of lists of str, or any spec is not in the computed specifications.
- class robustipy.models.OLSRobust(*, y: List[str], x: List[str], data: DataFrame, model_name: str = 'OLS Robust')[source]
Bases:
BaseRobustClass for multi-variate regression using OLS
- Parameters:
y (list[str]) – Dependent variable column name(s). If multiple dependent variables are supplied, OLSRobust constructs non-empty standardised outcome composites.
x (list[str]) – Predictor column name(s) included in every specification. The first element is treated as the reported focal estimand; any additional elements are fixed predictors that do not vary across the control-subset space. data : pandas.DataFrame The full dataset containing y, x, and any controls.
model_name (str, default='OLS Robust') – A custom label for this model run, used in outputs and plots
- fit(*, controls: List[str], group: str | None = None, draws: int = None, kfold: int = None, oos_metric: str = None, n_cpu: int | None = None, seed: int | None = None, composite_sample: int | None = None, z_specs_sample_size: int | None = None, rescale_y: bool | None = False, rescale_x: bool | None = False, rescale_z: bool | None = False, compute_shap: bool = True, threshold: int = 1000000) OLSRobust[source]
Fit the OLS models across the specification space and over bootstrap resamples.
This method explores a variety of control variable specifications by constructing different combinations (z-specs) and optionally sampling from this space. It performs bootstrap resampling and/or cross-validation depending on the arguments provided.
- Parameters:
controls (list of str) – Candidate control variables to include in model specifications.
group (str, optional) – Column name for grouping fixed effects. If provided, outcomes are de-meaned by group.
draws (int, optional) – Number of bootstrap resamples per specification. If None, bootstrapping is skipped.
kfold (int, optional) – Number of folds for out-of-sample evaluation. Requires oos_metric to be specified.
oos_metric ({'pseudo-r2', 'rmse'}, optional) – Metric to evaluate out-of-sample performance when kfold is set.
n_cpu (int, optional) – Number of parallel processes to use. Defaults to all available CPUs minus one if None.
seed (int, optional) – Random seed for reproducibility. Propagated to all random operations.
composite_sample (int, optional) – Number of non-empty outcome-composite subsets to sample when multiple dependent variables are supplied. If None, all non-empty outcome subsets are enumerated. Ignored for a single dependent variable.
z_specs_sample_size (int, optional) – Number of z specifications to randomly sample from the full set of possible combinations. If None, the full specification space is used.
rescale_y (bool, default=False) – If True, rescale the dependent variable to have mean 0 and standard deviation 1.
rescale_x (bool, default=False) – If True, rescale the x variable(s) to have mean 0 and standard deviation 1.
rescale_z (bool, default=False) – If True, rescale the z variable(s) to have mean 0 and standard deviation 1.
compute_shap (bool, default=True) – If True, compute SHAP values for plotting. Set False for fit-only profiling.
threshold (int, default=1_000_000) – If the total number of model fits (specs × draws × folds) exceeds this number, a warning is raised.
- Returns:
self – The fitted model instance. Results are stored in the .results attribute.
- Return type:
Notes
At least one of draws or kfold must be set to perform model fitting.
z_specs_sample_size samples the covariate-subset space before fitting.
composite_sample samples the outcome-composite space before fitting when len(y) > 1.
This method may be computationally intensive; parallelisation is recommended via n_cpu.
- get_results() OLSResult[source]
Return the OLSResult object once .fit() has been called.
- Returns:
The result object encapsulating all analysis outputs.
- Return type:
- sample_z_specs(controls, z_specs_sample_size)[source]
Generate a sample of z specifications by randomly selecting subsets of control variables.
This method creates a set of binary masks to determine which subsets of the given controls should be included in each z specification. The number of subsets sampled is controlled by z_specs_sample_size.
- Parameters:
controls (list of str) – A list of control variable names from which to build z specifications.
z_specs_sample_size (int) – The number of z specifications to sample. Must be a positive integer.
- Returns:
space_n_sample (int) – The number of sampled z specifications.
z_specs_sample (list of tuple of str) – A list where each element is a tuple of selected control variable names representing one sampled z specification.
Notes
The method uses sample_z_masks to generate binary inclusion masks.
If self.seed is defined, it is passed to sample_z_masks to ensure reproducibility.
Each sampled mask determines a unique subset of controls.
- robustipy.models.stouffer_method(p_values, *, two_sided=True, betas=None, p_values_ystar=None, betas_ystar=None, weights=None, clip_floor=1e-300, na_action='omit', warn: bool = True)[source]
Combine p-values via a Stouffer test aligned with OLSResult._compute_inference.
Uses observed full-sample p-values and coefficients to compute
Z_obs. If null draws (p_values_ystarandbetas_ystar) are supplied, the p-value is calibrated by two-sided Monte Carlo:p = (1 + sum(abs(Z_null) >= abs(Z_obs))) / (B + 1).Dependence is estimated from null z-vectors (PSD projection + ridge). If null draws are unavailable, the method falls back to an asymptotic two-sided p-value based on
Z_obs.- Returns:
(Z_obs, p_value). For two_sided=True this is a two-sided p-value.
- Return type:
tuple[float, float]
robustipy.prototypes module
- class robustipy.prototypes.BaseRobust(*, y: list[str], x: list[str], data: DataFrame, model_name: str = 'BaseRobust')[source]
Bases:
ProtomodelBase class for robust model estimation, including OLS and logistic.
Provides shared validation, bootstrapping, cross-validation, and composite outcome support.
- y
Dependent variable column names.
- Type:
list of str
- x
Independent variable column names.
- Type:
list of str
- data
Input dataset containing variables in y, x, controls.
- Type:
pandas.DataFrame
- model_name
Custom label for the model run.
- Type:
str
- results
Fitted result object populated after fit().
- Type:
object
- parameters
Stores initialization parameters and any derived settings.
- Type:
dict
- fit(*, controls: List[str], group: str | None = None, draws: int = 500, kfold: int = 5, oos_metric: str = 'r-squared', n_cpu: int | None = None, seed: int | None = None) None[source]
Abstract fit method; must be overridden by subclasses.
- Parameters:
controls (List[str]) – Optional control variable names to include in specifications.
group (str, optional) – Column name for grouping (fixed effects) variable.
draws (int, default=500) – Number of bootstrap draws.
kfold (int, default=5) – Number of cross-validation folds.
oos_metric (str, default='r-squared') – Out-of-sample metric (‘r-squared’, ‘rmse’, etc.).
n_cpu (int, optional) – Number of CPU cores for parallel computation.
seed (int, optional) – Random seed for reproducibility.
- Raises:
NotImplementedError – Always, since this method must be implemented by subclasses.
- multiple_y() None[source]
- Build the lists
self.y_composites – pandas Series, one per composite Y
self.y_specs – tuple[str], names that form that composite
If self.composite_sample is a positive int, draw that many random non-empty subsets of the raw Y columns before we create any Series. Otherwise enumerate all non-empty subsets (original behaviour).
robustipy.utils module
- class robustipy.utils.IntegerRangeValidator(min_value, max_value)[source]
Bases:
objectValidator that checks if an input value is an integer within a specified range.
- Parameters:
min_value (int) – The minimum allowed integer value (inclusive).
max_value (int) – The maximum allowed integer value (inclusive).
- Raises:
ValidationError – If the input is not an integer or is outside the specified range.
- Usage:
validator = IntegerRangeValidator(1, 10) validator(_, current_value) # Returns True if valid, raises ValidationError otherwise.
- exception robustipy.utils.ValidationError(*args, reason: str = '')[source]
Bases:
ExceptionFallback so IntegerRangeValidator can raise a typed error safely.
- robustipy.utils.all_subsets(ss)[source]
Generate all subsets of a given iterable.
- Parameters:
ss (iterable) – Input iterable.
- Returns:
A chain object containing all subsets of the input iterable.
- Return type:
itertools.chain
- robustipy.utils.calculate_imv_score(y_true, y_enhanced, null_mean=None)[source]
Calculate the InterModel Vigorish (IMV) score.
- Parameters:
y_true (array-like) – Binary validation labels.
y_enhanced (array-like) – Predicted probabilities from the fitted/enhanced model on the validation fold.
null_mean (float, optional) – Constant null-model probability. If provided, this should usually be the training-fold prevalence. If omitted, the validation-fold prevalence is used for backward compatibility.
- Returns:
Relative improvement of the enhanced model over the null model in IMV space.
- Return type:
float
- robustipy.utils.concat_results(objs: List[OLSResult], de_dupe=True) OLSResult[source]
Concatenate multiple
OLSResultobjects into a single result object.Per-specification fields (for example estimates, p-values, and spec names) are concatenated in lockstep. When
de_dupe=True, exact duplicates in(y_name, x_name, spec)triplets are removed across all per-spec fields.
- robustipy.utils.decorator_timer(func: callable) callable[source]
Decorator to time function execution.
- Parameters:
func (callable) – Function to wrap.
- Returns:
Wrapped function returning (result, elapsed_seconds).
- Return type:
callable
- robustipy.utils.get_colormap_colors(num_colors: int = 3, colormap: str | object = 'viridis') List[str][source]
Return
num_colorsevenly spaced colors from a Matplotlib colormap.- Parameters:
num_colors (int, optional) – Number of colors to return. Must be >= 1. Defaults to 3.
colormap (str or matplotlib.colors.Colormap, optional) – Colormap name or object to sample from. Defaults to ‘viridis’.
- Returns:
Hexadecimal color strings of length exactly
num_colors.- Return type:
List[str]
- Raises:
TypeError – If num_colors is not an integer.
ValueError – If num_colors < 1.
- robustipy.utils.get_colors(specs: List[List[str]], color_set_name: str | None = 'Set1') List[Tuple[float, float, float, float]][source]
Generate a palette of colors for a list of specifications using a categorical colormap.
- Parameters:
specs (list of list of str) – Each inner list represents one specification (set of variable names).
color_set_name (str, optional) – Name of a Matplotlib qualitative colormap (default ‘Set1’).
- Returns:
A list of RGBA tuples, one per specification.
- Return type:
List[Tuple[float, float, float, float]]
- Raises:
ValueError – If specs is not a list of lists.
- robustipy.utils.get_selection_key(specs: List[List[str]]) List[frozenset][source]
Convert list of spec lists into list of frozensets.
- Parameters:
specs (list of list of str) – Each inner list is one specification.
- Returns:
Immutable keys for each specification.
- Return type:
list of frozenset
- Raises:
ValueError – If specs is not list of lists.
- robustipy.utils.group_demean(x: DataFrame, group: str | None = None) DataFrame[source]
Demean the input data within groups.
- Parameters:
x (pd.DataFrame) – Input DataFrame.
group (str, optional) – Column name for grouping. Default is None.
- Returns:
pd.DataFrame
- Return type:
Demeaned DataFrame.
- robustipy.utils.is_interactive() bool[source]
- Return True if either:
we are inside a Jupyter notebook/lab, OR
we are running from a real terminal (both stdin and stdout are TTYs).
- robustipy.utils.join_sig_test(*, results_target, results_shuffled, sig_level, positive)[source]
Calculate joint significance test for the entire specification curve.
- Parameters:
- Returns:
Estimated p-value for the joint significance test.
- Return type:
float
- robustipy.utils.logistic_regression_sm(y, x) dict[source]
Perform logistic regression based on statsmodels.Logit.
- Parameters:
y (array-like) – Dependent variable values.
x (array-like) – Independent variable values. The matrix should be shaped as (number of observations, number of independent variables).
- Returns:
dict – AIC, BIC, and HQIC.
- Return type:
Dictionary containing regression results, including coefficients, p-values, log-likelihood,
- robustipy.utils.logistic_regression_sm_stripped(y, x) dict[source]
Perform logistic regression using statsmodels with stripped output.
- Parameters:
y (array-like) – Dependent variable values.
x (array-like) –
- Independent variable values. The matrix should be shaped as
(number of observations, number of independent variables).
- Returns:
dict – p-values (‘p’) for each independent variable.
- Return type:
A dictionary containing regression coefficients (‘b’) and corresponding
- robustipy.utils.make_inquiry(model_name, y, data, draws, kfolds, oos_metric, n_cpu, seed)[source]
Prompt the user for missing inputs if in an interactive environment; otherwise, silently fall back to default values.
- Returns:
(draws, kfolds, oos_metric, n_cpu, seed)
- Return type:
tuple[int, int, str, int, int]
- robustipy.utils.mcfadden_r2(y_true, y_prob, insample_mean)[source]
Compute McFadden’s pseudo R-squared for logistic regression.
- robustipy.utils.prepare_asc(asc_path: str) Tuple[str, List[str], List[str], str, DataFrame][source]
Load and preprocess the ASC example dataset for illustration.
- Parameters:
asc_path (str) – Path to the Stata (.dta) file containing ASC data.
- Returns:
y (str): Dependent variable name. x (List[str]): Continuous predictor names. c (List[str]): Control variable names. group (str): Grouping variable name (‘pidp’). ASC_df (pd.DataFrame): Cleaned DataFrame.
- Return type:
tuple
- robustipy.utils.prepare_union(path_to_union: str) Tuple[str, List[str], str, DataFrame][source]
Load and preprocess the classic union dataset for example analyses.
- Parameters:
path_to_union (str) – Path to the Stata (.dta) file containing union data.
- Returns:
y (str): Dependent variable name (‘log_wage’). c (List[str]): Control variable names. x (str): Treatment variable name (‘union’). final_data (pd.DataFrame): Cleaned DataFrame ready for modeling.
- Return type:
tuple
- Raises:
FileNotFoundError – If the specified file does not exist.
- robustipy.utils.pseudo_r2(y_true: Sequence, y_pred: Sequence, mean_y_train: float) float[source]
Compute the pseudo-R² (1 - MSE_model / MSE_null), coercing inputs to floats.
- Parameters:
y_pred (Sequence) – Model predictions (can be list/array of floats or strings convertible to float).
y_true (Sequence) – True target values (same length as y_pred).
mean_y_train (float) – The baseline prediction (e.g. the training‐set mean of y).
- Returns:
Pseudo‐R² = 1 - (MSE_model / MSE_null).
- Return type:
float
- Raises:
ValueError – If lengths differ, if mean‐square‐null is zero, or if conversion to float fails. Or if MSE_null is zero (division by zero for pseudo-R²).
- robustipy.utils.rescale(variable)[source]
Rescales the input variable to have zero mean and unit standard deviation.
- Parameters:
variable (array-like) – Input data to be rescaled. Can be a list, NumPy array, or similar structure.
- Returns:
out – The rescaled array with mean 0 and standard deviation 1 along the specified axis. NaN values are ignored in the computation of mean and standard deviation.
- Return type:
ndarray
Notes
This function uses np.nanmean and np.nanstd to ignore NaN values during scaling.
- robustipy.utils.reservoir_sampling(generator: Iterable, k: int) List[source]
Uniformly sample k items from a streaming generator (reservoir sampling).
- Parameters:
generator (Iterable) – An iterator or generator yielding items.
k (int) – Number of samples to retain.
- Returns:
A list of k sampled items.
- Return type:
List
- robustipy.utils.sample_y_masks(n_y: int, n_masks: int, seed: int | None = None) List[int][source]
Uniformly sample n_masks bit-masks from the non-empty power-set of n_y items without enumerating the 2^n_y possibilities.
- Returns:
which outcomes enter the composite.
- Return type:
list[int] each mask is an int whose binary representation tells
- robustipy.utils.sample_z_masks(n_z: int, n_masks: int, seed: int | None = None) List[int][source]
Uniformly sample n_masks bit-masks from the power-set of n_z items without enumerating the 2^n_z possibilities.
- Returns:
which specifications enter the composite.
- Return type:
list[int] each mask is an int whose binary representation tells
- robustipy.utils.simple_ols(y, x) dict[source]
Perform simple ordinary least squares regression.
- Parameters:
y (array-like) – Dependent variable.
x (array-like) – Independent variables.
- Returns:
dict – AIC, BIC, and HQIC.
- Return type:
Dictionary containing regression results, including coefficients, p-values, log-likelihood,
- robustipy.utils.space_size(iterable) int[source]
Calculate the size of the power set of the given iterable.
- Parameters:
iterable (iterable) – Input iterable.
- Returns:
Size of the power set of the input iterable.
- Return type:
int
- robustipy.utils.stripped_ols(y, x, add_const: bool = True) dict[source]
Perform Ordinary Least Squares (OLS) regression analysis with stripped output.
- Parameters:
y (array-like) – Dependent variable values.
x (array-like) –
- Independent variable values. The matrix should be shaped as
(number of observations, number of independent variables).
add_const (bool, default True) – Whether to add a constant column for the intercept term. Set to False when using group-demeaned data (fixed effects), where the intercept is already absorbed by the demeaning.
- Returns:
dict – regression coefficients (‘b’) and corresponding p-values (‘p’) for each independent variable.
- Return type:
dictionary
- Raises:
ValueError – If inputs x or y are empty.:
Notes
Missing values in x or y are not handled, and the function may produce unexpected results if there are missing values in the input data.
The function internally adds a constant column to the independent variables matrix x to represent the intercept term in the regression equation, unless add_const=False is specified.
Constant terms are added to x by default (add_const=True).
Module contents
robustipy package initialization.
This module intentionally avoids global warning-hook side effects at import time. If compact robustipy-only warning formatting is desired, call enable_compact_warnings() explicitly.