Hypothesis Testing

Hypothesis Testing#

Test Statistics#

Test statistics for hypothesis testing.

class everwillow.hypotest.test_statistics.TestStatistic[source]

Bases: Module

Abstract base class for test statistics.

Test statistics compute likelihood ratios and store relevant data in the TestStatResult.extras dict. The statistical interpretation (p-values) is handled separately by Distribution classes.

Subclasses must implement:

_compute: Compute the core test statistic formula.

compute(nll_fn, params, observation, poi_key, poi_test, **fit_kwargs)[source]

Compute the test statistic.

Parameters:

nll_fn (Callable[[PyTree, PyTree], float]) – Negative log-likelihood function taking (params, observation).
params (State) – Initial parameter state.
observation (PyTree) – Observed data passed to nll_fn.
poi_key (str | tuple[str | int, ...]) – Canonical key for the parameter of interest, e.g. “mu”.
poi_test (float) – Test value for the POI.
**fit_kwargs (Any) – Additional arguments passed to fit().

Returns:

TestStatResult with value, test, q_asimov, and extras.

Return type:

TestStatResult

class everwillow.hypotest.test_statistics.CowanTestStatistic(mu_asimov=0.0)[source]

Bases: TestStatistic

Cowan test statistics with Asimov-based variance estimation.

There are five Cowan test statistics:

TMu: used for two-sided confidence intervals
TMuTilde: used for confidence intervals with a positive signal (Feldman-Cousins)
Q0: used for discovery tests (rejecting the \(\mu = 0\) hypothesis)
QMu: used for exclusion of a non-zero signal hypothesis
QMuTilde: used for exclusion of a non-zero signal hypothesis with a positive signal

This subclass extends TestStatistic with methods to efficiently compute the variance of \(\hat{\mu}\) using the Asimov dataset, as described in the Cowan paper.

Asimov data can be provided in two ways:

asimov_observation: pre-computed Asimov dataset.
predict_fn: generate Asimov at mu_asimov (default depends on the test statistic; override via the mu_asimov kwarg).

If neither is provided, q_asimov will be None. This can cause p-value computations for test statistics that require q_asimov to fail.

mu_asimov

Default POI value for Asimov dataset generation.

Type:: float

compute(nll_fn, params, observation, poi_key, poi_test, *, asimov_observation=None, predict_fn=None, mu_asimov=None, **fit_kwargs)[source]

Compute the test statistic.

Parameters:

nll_fn (Callable[[PyTree, PyTree], float]) – Negative log-likelihood function taking (params, observation).
params (State) – Initial parameter state.
observation (PyTree) – Observed data passed to nll_fn.
poi_key (str | tuple[str | int, ...]) – Canonical key for the parameter of interest, e.g. “mu”.
poi_test (float) – Test value for the POI.
asimov_observation (PyTree | None) – Pre-computed Asimov dataset.
predict_fn (Callable[[State], PyTree] | None) – Function to generate expected observation from parameters.
mu_asimov (float | None) – POI value at which to generate the Asimov dataset. Defaults to self.mu_asimov.
**fit_kwargs (Any) – Additional arguments passed to fit().

Returns:

TestStatResult with value, test, q_asimov, and extras.

Return type:

TestStatResult

class everwillow.hypotest.test_statistics.QTilde(mu_asimov=0.0)[source]

Bases: CowanTestStatistic

Modified profile likelihood ratio for upper limits (Eq. 16).

The test statistic is:

\[\begin{split}\tilde{q}_\mu = \begin{cases} -2 \ln \frac{L(\mu, \hat{\hat{\theta}}(\mu))}{L(0, \hat{\hat{\theta}}(0))} & \text{if } \hat{\mu} < 0 \\ -2 \ln \frac{L(\mu, \hat{\hat{\theta}}(\mu))}{L(\hat{\mu}, \hat{\theta})} & \text{if } 0 \leq \hat{\mu} \leq \mu \\ 0 & \text{if } \hat{\mu} > \mu \end{cases}\end{split}\]

where \(\hat{\hat{\theta}}(\mu)\) is the conditional MLE of the nuisance parameters given \(\mu\), and \(\hat{\mu}, \hat{\theta}\) are the unconditional MLEs. The boundary at \(\hat{\mu} > \mu\) protects against excluding signal when there is an upward fluctuation.

This is the standard test statistic for CLs upper limit calculations.

class everwillow.hypotest.test_statistics.QMu(mu_asimov=0.0)[source]

Bases: CowanTestStatistic

Profile likelihood ratio for upper limits (Eq. 14).

The test statistic is:

\[\begin{split}q_\mu = \begin{cases} -2 \ln \lambda(\mu) & \text{if } \hat{\mu} \leq \mu \\ 0 & \text{if } \hat{\mu} > \mu \end{cases}\end{split}\]

where \(\lambda(\mu) = L(\mu, \hat{\hat{\theta}}(\mu)) / L(\hat{\mu}, \hat{\theta})\) is the profile likelihood ratio. The boundary at \(\hat{\mu} > \mu\) protects against excluding signal when there is an upward fluctuation.

class everwillow.hypotest.test_statistics.Q0(mu_asimov=1.0)[source]

Bases: CowanTestStatistic

Discovery test statistic for testing \(\mu = 0\) (Eq. 12).

The test statistic is:

\[\begin{split}q_0 = \begin{cases} -2 \ln \lambda(0) & \text{if } \hat{\mu} \geq 0 \\ 0 & \text{if } \hat{\mu} < 0 \end{cases}\end{split}\]

where \(\lambda(0) = L(0, \hat{\hat{\theta}}(0)) / L(\hat{\mu}, \hat{\theta})\) is the profile likelihood ratio evaluated at \(\mu = 0\). The boundary at \(\hat{\mu} < 0\) prevents “discovery” of negative signal.

mu_asimov

Default POI value for Asimov generation. Defaults to 1.0 (signal hypothesis).

Type:: float

compute(nll_fn, params, observation, poi_key, poi_test, *, asimov_observation=None, predict_fn=None, mu_asimov=None, **fit_kwargs)[source]

Compute q_0 discovery test statistic.

Note

The poi_test argument is ignored; Q0 always tests μ=0 by design.

Return type:: TestStatResult

class everwillow.hypotest.test_statistics.TMu(mu_asimov=0.0)[source]

Bases: CowanTestStatistic

Profile likelihood ratio for two-sided confidence intervals (Eq. 8).

The test statistic is:

\[t_\mu = -2 \ln \lambda(\mu)\]

where \(\lambda(\mu) = L(\mu, \hat{\hat{\theta}}(\mu)) / L(\hat{\mu}, \hat{\theta})\) is the profile likelihood ratio. No boundary is applied; \(t_\mu\) can take any non-negative value regardless of \(\hat{\mu}\).

Distributions#

Distributions for converting test statistics to p-values.

class everwillow.hypotest.distributions.Distribution[source]

Bases: Module

Abstract base for test statistic distributions.

Subclasses must implement:

null_pval: p-value under null hypothesis (\(\mu'= \mu\) where \(\mu\) is the hypothesis being tested).
alt_pval: p-value under an alternative hypothesis (\(\mu'=0\) for exclusion, \(\mu'=1\) for discovery).

abstractmethod null_pval(result)[source]

p-value under the null hypothesis (\(\mu' = \mu\)).

Parameters:: result (TestStatResult) – Test statistic result.
Returns:: Null p-value, or None if required data (e.g. q_asimov) is missing.
Return type:: Array | None

abstractmethod alt_pval(result)[source]

p-value under an alternative hypothesis.

Parameters:: result (TestStatResult) – Test statistic result.
Returns:: Alternative p-value, or None if required data (e.g. q_asimov) is missing.
Return type:: Array | None

null_significance(result)[source]

Significance under the null hypothesis: \(Z = \Phi^{-1}(1 - p_\text{null})\).

Parameters:: result (TestStatResult) – Test statistic result.
Returns:: Significance Z, or None if pnull is None.
Return type:: Array | None

alt_significance(result)[source]

Significance under the alternative hypothesis: \(Z = \Phi^{-1}(1 - p_\text{alt})\).

Parameters:: result (TestStatResult) – Test statistic result.
Returns:: Significance Z, or None if palt is None.
Return type:: Array | None

expected_pvalues(result)[source]

Compute expected p-values at standard sigma bands.

Parameters:: result (TestStatResult) – Test statistic result.
Returns:: ExpectedBands with (pnull, palt) at each sigma level.
Raises:: NotImplementedError – If the distribution does not support expected p-value computation.
Return type:: ExpectedBands | None

class everwillow.hypotest.distributions.TMuAsymptotic[source]

Bases: Distribution

Asymptotic distribution for \(t_\mu\) (two-sided, Eq. 38).

Used with the \(t_\mu\) test statistic for two-sided confidence intervals.

cdf(q, mu, mu_prime, sigma)[source]

CDF: \(F(t_\mu \mid \mu') = \Phi(\sqrt{t} + \frac{\mu-\mu'}{\sigma}) + \Phi(\sqrt{t} - \frac{\mu-\mu'}{\sigma}) - 1\).

Return type:: Array

null_pval(result)[source]

Null p-value: \(p = 2(1 - \Phi(\sqrt{t_\mu}))\). No \(\sigma\) needed.

Return type:: Array

alt_pval(result)[source]

Alt p-value: \(p = 2 - \Phi(\sqrt{t} + \sqrt{q_A}) - \Phi(\sqrt{t} - \sqrt{q_A})\).

\(q_A = \mu^2/\sigma^2\) (Asimov under \(\mu'=0\)), so \(\sqrt{q_A} = \mu/\sigma = (\mu-\mu')/\sigma\).

Return type:: Array | None

class everwillow.hypotest.distributions.TMuTildeAsymptotic[source]

Bases: Distribution

Asymptotic distribution for \(\tilde{t}_\mu\) (two-sided with physical bound, Eq. 40/44).

Used with the \(\tilde{t}_\mu\) test statistic for two-sided tests with the physical constraint \(\mu \geq 0\). The CDF has a piecewise structure with the \(\Phi + \Phi - 1\) form in both regions (Eq. 44).

cdf(q, mu, mu_prime, sigma)[source]

CDF: \(F(\tilde{t}_\mu \mid \mu')\) — piecewise at threshold \(\mu^2/\sigma^2\).

Return type:: Array

null_pval(result)[source]

Null p-value (\(\mu' = \mu\)), where \(q_A = \mu^2/\sigma^2\).

\[\begin{split}p_{\mu'=\mu} = \begin{cases} 2\bigl(1 - \Phi(\sqrt{\tilde{t}})\bigr) & \text{if } \tilde{t} \leq q_A \\ 2 - \Phi(\sqrt{\tilde{t}}) - \Phi\!\left(\frac{\tilde{t} + q_A}{2\sqrt{q_A}}\right) & \text{if } \tilde{t} > q_A \end{cases}\end{split}\]

Return type:: Array | None

alt_pval(result)[source]

Alt p-value (\(\mu' = 0\)), where \(q_A = \mu^2/\sigma^2\).

\[\begin{split}p_{\mu'=0} = \begin{cases} 2 - \Phi(\sqrt{\tilde{t}} + \sqrt{q_A}) - \Phi(\sqrt{\tilde{t}} - \sqrt{q_A}) & \text{if } \tilde{t} \leq q_A \\ 2 - \Phi(\sqrt{\tilde{t}} + \sqrt{q_A}) - \Phi\!\left(\frac{\tilde{t} - q_A}{2\sqrt{q_A}}\right) & \text{if } \tilde{t} > q_A \end{cases}\end{split}\]

Return type:: Array | None

class everwillow.hypotest.distributions.Q0Asymptotic[source]

Bases: Distribution

Asymptotic distribution for \(q_0\) (discovery, Eq. 49).

Used with the \(q_0\) test statistic for discovery significance.

cdf(q, mu, mu_prime, sigma)[source]

CDF: \(F(q_0 \mid \mu') = \Phi(\sqrt{q_0} - \mu'/\sigma)\).

Return type:: Array

null_pval(result)[source]

Null p-value: \(p = 1 - \Phi(\sqrt{q_0})\). No \(\sigma\) needed.

Return type:: Array

alt_pval(result)[source]

Alt p-value: \(p = 1 - \Phi(\sqrt{q_0} - \sqrt{q_A})\).

\(q_A = \mu_\text{asimov}^2/\sigma^2\) (Asimov under signal), so \(\sqrt{q_A} = \mu_\text{asimov}/\sigma\).

Return type:: Array | None

expected_pvalues(result)[source]

Expected p-values at \(\pm N\sigma\) fluctuations under signal hypothesis.

\(q_A = \mu_\text{asimov}^2/\sigma^2\) (Asimov under signal), so \(\sqrt{q_A} = \mu_\text{asimov}/\sigma\). \(q = \max(0, \sqrt{q_A} + N)^2\). Upward fluctuations (\(+N\)) increase discovery significance, opposite to exclusion tests.

Parameters:: result (TestStatResult) – Must contain q_asimov for \(\sqrt{q_A}\).
Returns:: ExpectedBands with (pnull, palt) at each sigma level, or None if q_asimov is missing.
Return type:: ExpectedBands | None

class everwillow.hypotest.distributions.QMuAsymptotic[source]

Bases: Distribution

Asymptotic distribution for \(q_\mu\) (upper limit, Eq. 57).

Used with the \(q_\mu\) test statistic for upper limit calculations.

cdf(q, mu, mu_prime, sigma)[source]

CDF: \(F(q_\mu \mid \mu') = \Phi(\sqrt{q_\mu} - (\mu - \mu')/\sigma)\).

Return type:: Array

null_pval(result)[source]

Null p-value: \(p = 1 - \Phi(\sqrt{q_\mu})\). No \(\sigma\) needed.

Return type:: Array

alt_pval(result)[source]

Alt p-value: \(p = 1 - \Phi(\sqrt{q_\mu} - \sqrt{q_A})\).

\(q_A = \mu^2/\sigma^2\) (Asimov under \(\mu'=0\)), so \(\sqrt{q_A} = \mu/\sigma\).

Return type:: Array | None

expected_pvalues(result)[source]

Expected p-values at \(\pm N\sigma\) fluctuations under background-only.

\(q_A = \mu^2/\sigma^2\) (Asimov under \(\mu'=0\)), so \(\sqrt{q_A} = \mu/\sigma\). At band \(N\), the expected \(\hat{\mu} = N\sigma\), giving \(\sqrt{q} = \max(0, \mu/\sigma - N)\). Synthetic TestStatResult objects are passed through the existing null_pval/alt_pval methods to reuse the CDF logic.

Parameters:: result (TestStatResult) – Must contain q_asimov for \(\sigma\) extraction.
Returns:: ExpectedBands with (pnull, palt) at each sigma level, or None if q_asimov is missing.
Return type:: ExpectedBands | None

class everwillow.hypotest.distributions.QTildeAsymptotic[source]

Bases: Distribution

Asymptotic distribution for \(\tilde{q}_\mu\) (upper limit with physical bound, Eq. 64).

Used with the \(\tilde{q}_\mu\) test statistic for hypothesis testing with the physical constraint \(\mu \geq 0\). The CDF is piecewise at \(\tilde{q} = \mu^2/\sigma^2 = q_\text{asimov}\).

cdf(q, mu, mu_prime, sigma)[source]

CDF: \(F(\tilde{q}_\mu \mid \mu')\) — piecewise at threshold \(\mu^2/\sigma^2\).

Return type:: Array

null_pval(result)[source]

Null p-value (\(\mu' = \mu\)), where \(q_A = \mu^2/\sigma^2\).

\[\begin{split}p_{\mu'=\mu} = \begin{cases} 1 - \Phi(\sqrt{\tilde{q}}) & \text{if } \tilde{q} \leq q_A \\ 1 - \Phi\!\left(\frac{\tilde{q} + q_A}{2\sqrt{q_A}}\right) & \text{if } \tilde{q} > q_A \end{cases}\end{split}\]

Return type:: Array | None

alt_pval(result)[source]

Alt p-value (\(\mu' = 0\)), where \(q_A = \mu^2/\sigma^2\).

\[\begin{split}p_{\mu'=0} = \begin{cases} 1 - \Phi(\sqrt{\tilde{q}} - \sqrt{q_A}) & \text{if } \tilde{q} \leq q_A \\ 1 - \Phi\!\left(\frac{\tilde{q} - q_A}{2\sqrt{q_A}}\right) & \text{if } \tilde{q} > q_A \end{cases}\end{split}\]

Return type:: Array | None

expected_pvalues(result)[source]

Expected p-values at \(\pm N\sigma\) fluctuations under background-only.

At band \(N\), \(\hat{\mu} = N\sigma\), so the expected test statistic is (with \(q_A = \mu^2/\sigma^2\)):

\[\begin{split}\tilde{q}_\text{exp} = \begin{cases} \max(0,\; \mu/\sigma - N)^2 & \text{if } N \geq 0 \\ (\mu/\sigma)^2 - 2(\mu/\sigma)\,N & \text{if } N < 0 \end{cases}\end{split}\]

Parameters:: result (TestStatResult) – Must contain q_asimov for \(\sigma\) extraction.
Returns:: ExpectedBands with (pnull, palt) at each sigma level, or None if q_asimov is missing.
Return type:: ExpectedBands | None

class everwillow.hypotest.distributions.EmpiricalDistribution(q_null, q_alt=None)[source]

Bases: Distribution

Base class for distributions built from toy test statistics.

Stores the raw test statistic arrays from toy generation and provides the from_toys factory method. Subclass this and override null_pval / alt_pval to implement custom p-value computation methods (e.g. KDE smoothing, tail extrapolation).

q_null

Test statistics under the tested hypothesis (poi_test).

Type:: jax.Array

q_alt

Test statistics under the alternative hypothesis (poi_alt). None if poi_alt was not provided to the ToyGenerator.

Type:: jax.Array | None

classmethod from_toys(toys)[source]

Construct from a ToyResult.

Parameters:: toys (ToyResult) – Raw toy generation output containing q_null and optionally q_alt.
Returns:: An instance of this distribution class.
Return type:: EmpiricalDistribution

class everwillow.hypotest.distributions.SimpleEmpiricalDistribution(q_null, q_alt=None)[source]

Bases: EmpiricalDistribution

Empirical p-values via simple tail counting.

\(p_\text{null} = \text{fraction of } q_\text{null} \geq q_\text{obs}\), \(p_\text{alt} = \text{fraction of } q_\text{alt} \geq q_\text{obs}\).

null_pval(result)[source]

Empirical p-value under tested hypothesis: fraction of \(q_\text{null} \geq q_\text{obs}\).

Return type:: Array

alt_pval(result)[source]

Empirical p-value under alternative: fraction of \(q_\text{alt} \geq q_\text{obs}\).

Returns None if q_alt was not provided (no alternative toys generated).

Return type:: Array | None

expected_pvalues(result)[source]

Compute expected p-values at standard sigma bands using toy quantiles.

Uses quantiles of q_alt at standard sigma percentiles as synthetic test statistic values, then evaluates empirical p-values at each via _build_expected_bands.

Parameters:: result (TestStatResult) – Test statistic result (used as template for synthetic results).
Returns:: ExpectedBands with empirical p-values at each sigma level.
Raises:: ValueError – If q_alt is None (no alternative toys generated).
Return type:: ExpectedBands

Calculators#

Hypothesis test calculators.

class everwillow.hypotest.calculators.HypoTestCalculator(nll_fn, params, observation, poi_key, test_statistic=<factory>, distribution=<factory>)[source]

Bases: Module

Generic hypothesis test calculator.

Orchestrates hypothesis testing by: 1. Computing the test statistic on observed data 2. Delegating p-value computation to a Distribution object

The calculator stores all model-specific arguments at construction time, so test(poi_test) only takes the varying parameter. Additional keyword arguments to test() are forwarded to the test statistic.

nll_fn

Negative log-likelihood function taking (params, observation).

Type:: Callable[[jaxtyping.PyTree, jaxtyping.PyTree], float]

params

Initial parameter state.

Type:: everwillow._src.statelib.state.State

observation

Observed data passed to nll_fn.

Type:: jaxtyping.PyTree

poi_key

Canonical key for the parameter of interest, e.g. “mu”.

Type:: str | tuple[str | int, …]

test_statistic

Test statistic to use. Defaults to QTilde.

Type:: everwillow._src.inference.hypotest.test_statistics.TestStatistic

distribution

Distribution for p-value computation. Defaults to QTildeAsymptotic.

Type:: everwillow._src.inference.hypotest.distributions.Distribution

test(poi_test, **kwargs)[source]

Run hypothesis test.

Parameters:

poi_test (float) – Test value for the POI.
**kwargs (Any) – Forwarded to the test statistic. Includes both test-statistic-specific args (e.g. predict_fn, mu_asimov for Cowan test statistics) and fit options.

Returns:

HypoTestResult with observed p-values.

Return type:

HypoTestResult

cls(result)[source]

Compute CLs = pnull / palt from a hypothesis test result.

Parameters:: result (HypoTestResult) – HypoTestResult from test().
Returns:: CLs value, or None if either p-value is None.
Return type:: Array | None

expected(result)[source]

Compute expected p-values at standard sigma bands.

Delegates to the distribution’s expected_pvalues method.

Parameters:: result (HypoTestResult) – HypoTestResult from test().
Returns:: ExpectedBands with p-values at each sigma level.
Raises:: NotImplementedError – If the distribution does not support expected p-value computation.
Return type:: ExpectedBands | None

class everwillow.hypotest.calculators.AsymptoticCalculator(nll_fn, params, observation, poi_key, test_statistic=<factory>, distribution=<factory>, predict_fn=None, mu_asimov=0.0, asimov_observation=None)[source]

Bases: HypoTestCalculator

Calculator for Cowan et al. asymptotic hypothesis tests.

Extends HypoTestCalculator with Asimov dataset configuration. These fields are injected into the test statistic call automatically by test().

The Asimov dataset can be provided in two ways:

Pre-computed: pass asimov_observation directly. This is useful when the Asimov dataset is expensive to generate or when the model prediction function is not available (e.g. combined models with multiple observation channels).
On-the-fly: pass predict_fn and mu_asimov. The Asimov dataset is generated at each test() call by setting the POI to mu_asimov and calling predict_fn.

When both are provided, asimov_observation takes precedence and predict_fn / mu_asimov are ignored.

Example

>>> calc = AsymptoticCalculator(
...     nll_fn=nll_fn, params=params, observation=observed,
...     poi_key="mu", predict_fn=my_predict_fn,
... )
>>> result = calc.test(poi_test=1.0)

predict_fn

Function mapping parameter state to expected observation. Used to create the Asimov dataset at mu_asimov.

Type:: Callable[[everwillow._src.statelib.state.State], jaxtyping.PyTree] | None

mu_asimov

POI value for Asimov dataset generation. Defaults to 0.0 (background-only, for exclusion tests). Use 1.0 for discovery tests.

Type:: float

asimov_observation

Pre-computed Asimov dataset. When provided, this is used directly instead of generating one via predict_fn / mu_asimov.

Type:: jaxtyping.PyTree | None

test(poi_test, **kwargs)[source]

Run asymptotic hypothesis test.

Injects predict_fn, mu_asimov, and asimov_observation from init fields, unless overridden in kwargs.

Parameters:

poi_test (float) – Test value for the POI.
**kwargs (Any) – Additional arguments forwarded to the test statistic (e.g. fit options). Can override predict_fn, mu_asimov, or asimov_observation for one-off use.

Returns:

HypoTestResult with observed p-values.

Return type:

HypoTestResult

Results#

Result containers for hypothesis testing.

class everwillow.hypotest.results.TestStatResult(value, test, q_asimov=None, extras=<factory>)[source]

Bases: Module

Result of computing a test statistic.

value

Test statistic value.

Type:: jax.Array

test

POI value being tested (\(\mu\)).

Type:: jax.Array

q_asimov

Test statistic evaluated on Asimov data. None if not computed.

Type:: jax.Array | None

extras

Arbitrary additional data (e.g., fits, mu_hat).

Type:: dict[str, Any]

class everwillow.hypotest.results.ToyResult(q_null, q_alt=None)[source]

Bases: Module

Raw output from toy generation.

Contains the test statistic arrays under both hypotheses, decoupled from any particular p-value computation method.

q_null

Test statistic values under the tested hypothesis (poi_test).

Type:: jax.Array

q_alt

Test statistic values under the alternative hypothesis (poi_alt). None if poi_alt was not provided to the ToyGenerator.

Type:: jax.Array | None

class everwillow.hypotest.results.BandValues(minus_2sigma, minus_1sigma, median, plus_1sigma, plus_2sigma)[source]

Bases: Module

Scalar values at standard \(\pm N\sigma\) fluctuation bands.

Supports iteration via for name, value in bv, indexing via bv["median"], and len(bv) == 5. dict(bv) produces a {name: value} mapping, and BandValues(**dict(bv)) roundtrips.

minus_2sigma

Value at \(-2\sigma\) fluctuation.

Type:: jax.Array

minus_1sigma

Value at \(-1\sigma\) fluctuation.

Type:: jax.Array

median

Value at median (\(0\sigma\)).

Type:: jax.Array

plus_1sigma

Value at \(+1\sigma\) fluctuation.

Type:: jax.Array

plus_2sigma

Value at \(+2\sigma\) fluctuation.

Type:: jax.Array

class everwillow.hypotest.results.ExpectedBands(null_pvalue, alt_pvalue, cl_s, null_sig, alt_sig)[source]

Bases: Module

Expected quantities at standard sigma bands.

All derived quantities (CLs, significance) are eagerly computed at construction time so access is a simple attribute lookup.

null_pvalue

p-value under null hypothesis (\(p_\mu\)) at each band.

Type:: everwillow._src.inference.hypotest.results.BandValues

alt_pvalue

p-value under alternative hypothesis (\(\text{CL}_b\)) at each band.

Type:: everwillow._src.inference.hypotest.results.BandValues

cl_s

\(\text{CL}_s = p_\text{null}/p_\text{alt}\) at each band.

Type:: everwillow._src.inference.hypotest.results.BandValues

null_sig

Null significance \(\Phi^{-1}(1 - p_\text{null})\) at each band.

Type:: everwillow._src.inference.hypotest.results.BandValues

alt_sig

Alternative significance \(\Phi^{-1}(1 - p_\text{alt})\) at each band.

Type:: everwillow._src.inference.hypotest.results.BandValues

class everwillow.hypotest.results.HypoTestResult(q_obs, pnull, palt, test_stat_result)[source]

Bases: Module

Result of a hypothesis test.

q_obs

Observed test statistic value.

Type:: jax.Array

pnull

p-value under the tested hypothesis (poi_test).

Type:: jax.Array | None

palt

p-value under the alternative hypothesis (poi_alt / background-only).

Type:: jax.Array | None

test_stat_result

Full test statistic result with fit information.

Type:: everwillow._src.inference.hypotest.results.TestStatResult

Toys#

Toy generation for hypothesis testing.

class everwillow.hypotest.toys.ToyGenerator(test_statistic, ntoys=1000, map_fn=<function vmap>)[source]

Bases: Module

Generates toy experiments for hypothesis testing.

Creates toys under the null hypothesis (poi_null) and optionally under an alternative hypothesis (poi_alt). Returns a ToyResult with raw test statistic arrays that can be fed into any EmpiricalDistribution subclass for p-value computation.

test_statistic

Test statistic to compute for each toy.

Type:: everwillow._src.inference.hypotest.test_statistics.TestStatistic

ntoys

Number of toys per hypothesis. Defaults to 1000.

Type:: int

map_fn

Function that maps a scalar function over an array of keys. Defaults to jax.vmap. Replace with e.g. lambda fn: partial(jax.lax.map, fn, batch_size=8) to process toys in groups instead of all at once, or a Python loop for step-through debugging.

Type:: Callable

Example

>>> toy_gen = ToyGenerator(test_statistic=QTilde(), ntoys=10000)
>>> toys = toy_gen.generate(
...     nll_fn, params, observed, "mu", 1.0,
...     poi_alt=0.0,
...     key=jax.random.key(42),
...     predict_fn=my_predict_fn,
... )
>>> # Choose how to interpret the toys (open-world)
>>> dist = SimpleEmpiricalDistribution.from_toys(toys)
>>> # Use with HypoTestCalculator
>>> calc = HypoTestCalculator(
...     nll_fn=nll_fn,
...     params=params,
...     observation=observed,
...     poi_key="mu",
...     test_statistic=QTilde(),
...     distribution=dist,
... )
>>> result = calc.test(1.0)

generate(nll_fn, params, observation, poi_key, poi_null, *, poi_alt=None, key, sample_fn=None, predict_fn=None, **fit_kwargs)[source]

Generate toys and return raw test statistic arrays.

Parameters:

nll_fn (Callable[[PyTree, PyTree], float]) – Negative log-likelihood function taking (params, observation).
params (State) – Initial parameter state.
observation (PyTree) – Observed data (used to profile nuisance parameters).
poi_key (str | tuple[str | int, ...]) – Canonical key for the parameter of interest, e.g. “mu”.
poi_null (float) – Null hypothesis POI value. Toys generated under this hypothesis populate q_null. The test statistic is evaluated at this value for each toy.
poi_alt (float | None) – Alternative hypothesis POI value. If provided, toys are generated under both hypotheses. If None, only null toys are generated and q_alt will be None in the result. For exclusion tests, typically 0.0. For discovery, typically 1.0.
key (Key[Array, ''] | UInt32[Array, '2']) – JAX PRNG key for reproducibility.
sample_fn (Callable[[State, Key[Array, ''] | UInt32[Array, '2']], PyTree] | None) – Function to generate toy data. Called as sample_fn(params_state, key) -> toy_observation. If None, a default Poisson sampler is created using predict_fn.
predict_fn (Callable[[State], PyTree] | None) – Function returning expected observation given parameters. Used to create default Poisson sampler if sample_fn is None.
**fit_kwargs (Any) – Additional arguments passed to fit().

Returns:

ToyResult with q_null (always) and q_alt (if poi_alt provided).

Raises:

ValueError – If neither sample_fn nor predict_fn is provided.

Return type:

ToyResult

Upper Limits#

Upper limit finding via root search.

everwillow.hypotest.upper_limit.expected_upper_limit(band_objective_fn, bounds, level=0.05, **solver_kwargs)[source]

Find expected upper limits at each sigma band.

Calls upper_limit() five times — once per band — extracting the corresponding scalar from the BandValues returned by band_objective_fn.

Parameters:

band_objective_fn (Callable[[float], BandValues]) – Function mapping POI value to BandValues of the objective quantity (e.g. expected CLs at each sigma band).
bounds (tuple[float, float]) – (lower, upper) search range for POI value.
level (float) – Target level (default 0.05 for 95% CL).
**solver_kwargs (Any) – Additional arguments passed to upper_limit() (e.g., rtol, atol, max_steps).

Returns:

BandValues where each entry is the upper limit at that sigma band.

Return type:

BandValues

Example

>>> def band_cls_objective(poi):
...     result = calc.test(poi)
...     bands = calc.expected(result)
...     return bands.cl_s
>>> brazil = expected_upper_limit(band_cls_objective, bounds=(0, 5))
>>> for name, val in brazil:
...     print(f"  {name}: {float(val):.4f}")

everwillow.hypotest.upper_limit.upper_limit(objective_fn, bounds, level=0.05, *, solver=None, rtol=0.0001, atol=1e-06, max_steps=100)[source]

Find POI value where objective function equals target level.

Uses root-finding to find where objective_fn(poi) = level. Pure JAX implementation via optimistix, fully JIT-compatible.

This is a generic root finder - the user composes the objective function to implement their desired exclusion criterion (CLs, p_alt, etc.).

Parameters:

objective_fn (Callable[[float], Array]) – Function mapping POI value to quantity of interest. Must be JAX-traceable (no float() calls on traced values). Should be monotonic within bounds for reliable convergence.
bounds (tuple[float, float]) – (lower, upper) search range for POI value.
level (float) – Target value for the objective function (default 0.05).
solver (AbstractRootFinder | None) – Root-finding solver to use. Defaults to optx.Bisection.
rtol (float) – Relative tolerance for convergence (used by default solver).
atol (float) – Absolute tolerance for convergence (used by default solver).
max_steps (int) – Maximum iterations.

Returns:

POI value where objective_fn(poi) = level.

Return type:

Array

Note

The objective function is JIT-compiled by optimistix. Avoid calling float() or other Python operations that break JAX tracing.

Examples

>>> # Find where CLs = 0.05 (95% CL upper limit)
>>> def cls_objective(poi):
...     return calc.cls(calc.test(poi))
>>> limit = upper_limit(cls_objective, bounds=(0, 5), level=0.05)

>>> # With custom solver
>>> solver = optx.Newton(rtol=1e-5, atol=1e-5)
>>> limit = upper_limit(cls_objective, bounds=(0, 5), solver=solver)

everwillow.hypotest.upper_limit.upper_limit_scan(objective_fn, scan, level=0.05)[source]

Find POI value where objective equals target level via grid scan.

Evaluates the objective function on a grid of POI values, then interpolates to find where it crosses the target level. Fully JIT-compatible via jax.vmap and jnp.interp.

This is useful when: - The objective function is expensive and you want to reuse evaluations - You need to visualize the objective curve - Root-finding fails due to non-monotonicity

Parameters:

objective_fn (Callable[[float], Array]) – Function mapping POI value to quantity of interest. Must be JAX-traceable.
scan (Array) – Array of POI values to evaluate. Should be monotonically increasing and span the expected limit location.
level (float) – Target value for the objective function (default 0.05).

Returns:

POI value where objective_fn(poi) = level, found by interpolation.

Return type:

Array

Note

The accuracy depends on the density of scan points near the crossing. For CLs limits, the objective typically decreases as POI increases.

Examples

>>> # Scan CLs on a grid
>>> scan = jnp.linspace(0, 2, 50)
>>> limit = upper_limit_scan(
...     lambda poi: calc.cls(calc.test(poi)),
...     scan,
...     level=0.05,
... )

>>> # With finer grid near expected limit
>>> scan = jnp.concatenate([
...     jnp.linspace(0, 0.5, 10),
...     jnp.linspace(0.5, 1.5, 30),
...     jnp.linspace(1.5, 3, 10),
... ])
>>> limit = upper_limit_scan(cls_objective, scan, level=0.05)

everwillow.hypotest.upper_limit.upper_limit_toys(objective_fn, bounds, key, level=0.05, *, tol=0.01, max_iterations=100)[source]

Find POI value where stochastic objective equals target level.

Uses bisection search with fresh PRNG keys at each iteration. Pure JAX implementation via jax.lax.while_loop, fully JIT-compatible.

Parameters:

objective_fn (Callable[[float, Key[Array, ''] | UInt32[Array, '2']], Array]) – Function mapping (poi, key) to quantity of interest. Should be monotonic (typically decreasing) as POI increases. Must be JAX-traceable (no float() calls on traced values).
bounds (tuple[float, float]) – (lower, upper) search range for POI value.
key (Key[Array, ''] | UInt32[Array, '2']) – JAX PRNG key for reproducibility.
level (float) – Target value for the objective function (default 0.05).
tol (float) – Stop when objective is within tol of level (default 1e-2).
max_iterations (int) – Maximum bisection iterations (default 100).

Returns:

POI value where objective_fn(poi, key) ≈ level.

Return type:

Array

Note

The result has statistical uncertainty from Monte Carlo sampling. The tolerance should account for this.

Examples

>>> # CLs-based upper limit with toys
>>> def cls_objective(poi, key):
...     result = toy_calc(
...         nll_fn, params, "mu", poi,
...         sample_fn=sample_fn, nll_factory=nll_factory, key=key
...     )
...     return result.cl_s
>>> limit = upper_limit_toys(cls_objective, bounds=(0, 5), key=key)

Utilities#

Utilities for hypothesis testing.

everwillow.hypotest.utils.cl_s(pnull, palt)[source]

Compute \(\text{CL}_s = p_\text{null} / p_\text{alt}\).

\(\text{CL}_s = P(q \geq q_\text{obs} \mid \text{signal+background}) / P(q \geq q_\text{obs} \mid \text{background})\)

The CLs method protects against excluding signal hypotheses when there is no sensitivity: if palt is small (background also finds data unlikely), CLs stays large.

Parameters:

pnull (Array) – p-value under null hypothesis (\(\mu' = \mu\), signal+background).
palt (Array) – p-value under alternative hypothesis (\(\mu' = 0\), background-only).

Returns:

CLs value. Protected against division by zero.

Return type:

Array

everwillow.hypotest.utils.constrained_fit(nll_fn, params, observation, poi_fixed, **fit_kwargs)[source]

Perform constrained fit, merging POI constraint with user-fixed params.

Merges poi_fixed (the POI constraint from the test statistic) with any user-specified fixed params in fit_kwargs. When all parameters end up fixed, the NLL is evaluated directly without running the optimizer.

Parameters:

nll_fn (Callable[[PyTree, PyTree], float]) – Negative log-likelihood function taking (params, observation).
params (State) – Initial parameter state.
observation (PyTree) – Observed data passed to nll_fn.
poi_fixed (State) – State specifying the POI value to constrain.
**fit_kwargs (Any) – Additional arguments passed to fit(). If fixed is present, it is merged with poi_fixed (poi_fixed wins on overlapping keys).

Returns:

FitResult with fitted parameters and NLL value.

Return type:

FitResult

everwillow.hypotest.utils.make_asimov(predict_fn, params, poi_key, mu_asimov)[source]

Generate an Asimov dataset at a given POI value.

Sets the POI to mu_asimov in the parameter state and calls predict_fn to produce the expected observation.

Parameters:

predict_fn (Callable[[State], PyTree]) – Function mapping parameter state to expected observation.
params (State) – Parameter state (used as template).
poi_key (str | tuple[str | int, ...]) – Canonical key for the parameter of interest, e.g. “mu”.
mu_asimov (float) – POI value at which to generate the Asimov dataset.

Returns:

Expected observation (Asimov dataset).

Return type:

PyTree

everwillow.hypotest.utils.sigma_from_asimov(mu, q_asimov, mu_asimov=0.0)[source]

Extract \(\sigma\) (uncertainty on \(\hat{\mu}\)) from an Asimov test statistic.

Uses the relation \(t_{\mu,A} \approx (\mu - \mu')^2/\sigma^2\) to solve for \(\sigma\).

Parameters:

mu (Array) – POI value being tested.
q_asimov (Array) – Test statistic evaluated on Asimov data.
mu_asimov (float) – POI value used to generate the Asimov dataset. Defaults to 0.0 (background-only, for exclusion tests).

Returns:

Estimated \(\sigma = |\mu - \mu_\text{asimov}| / \sqrt{q_\text{asimov}}\).

Return type:

Array

everwillow.hypotest.utils.significance(p)[source]

Convert p-value to significance: \(Z = \Phi^{-1}(1 - p)\).

Parameters:: p (Array) – p-value (scalar or array).
Returns:: Significance Z.
Return type:: Array

Hypothesis Testing

Contents

Hypothesis Testing#

Test Statistics#

Distributions#

Calculators#

Results#

Toys#

Upper Limits#

Utilities#