Introduction to zmctp: Zero-Modified Complex Tri-parametric Pearson Distribution

Rasheedat Oladoja

2026-03-30

Abstract

The zmctp package extends the Complex Tri-parametric Pearson (CTP) distribution with zero-modified versions for overdispersed count data. This package addresses limitations of existing implementations when the parameter \(b\) approaches zero, providing distribution functions, maximum likelihood estimation, and diagnostic tools based on Rodríguez-Avi et al. (2003).

1. Introduction

1.1 Background

In count data analysis, the Poisson regression model is the standard approach when successive events occur independently at the same rate. A unique feature of the Poisson distribution is equi-dispersion: the mean and variance are equal. However, real-world data often violates this assumption:

When overdispersion occurs, the Negative Binomial (NB) model is commonly recommended. For cases with excess zeros, Zero-Inflated models (ZIP, ZINB) are used. However, these models have limitations, particularly when handling complex dispersion patterns.

1.2 The Complex TriParametric Pearson (CTP) Distribution

The CTP distribution, introduced by Rodríguez-Avi et al. (2003), belongs to the Pearson discrete distribution family and is derived from the Gaussian Hypergeometric Probability Distribution (GHPD). Unlike Poisson-based mixtures, the CTP provides a fundamentally different approach to modeling count data.

Mathematical Definition

The CTP distribution has probability mass function:

\[f(x|a,b,\gamma) = f_0 \frac{\Gamma(a+ib+x) \Gamma(a-ib+x)}{\Gamma(\gamma+x) \cdot x!}, \quad x = 0, 1, 2, ...\]

where: - \(a \in \mathbb{R}\) (real parameter, can be negative) - \(b \geq 0\) (imaginary parameter) - \(\gamma > 2a + 2\) (shape parameter, ensures variance exists) - \(i\) is the imaginary unit - \(f_0\) is the normalizing constant:

\[f_0 = \frac{\Gamma(\gamma-a-ib)\Gamma(\gamma-a+ib)}{\Gamma(\gamma)\Gamma(\gamma-2a)}\]

Moments

The distribution has closed-form expressions for its moments:

Mean: \[\mu = \frac{a^2 + b^2}{\gamma - 2a - 1}\]

Variance: \[\sigma^2 = \frac{\mu(\mu + \gamma - 1)}{\gamma - 2a - 2}\]

Dispersion: - Equidispersed if \(a = -\frac{\mu + 1}{2}\) - Underdispersed if \(a < -\frac{\mu + 1}{2}\) - Overdispersed if \(a > -\frac{\mu + 1}{2}\)

Notably, the CTP is overdispersed whenever \(a \geq 0\).

1.3 Why zmctp?

Problem with Existing Implementations

The cpd package (available on CRAN) implements the CTP distribution but has a significant limitation: when fitting data with large sample sizes or zero-inflation, it often estimates \(b \approx 0\), which reduces the model to a simpler form and loses flexibility.

The zmctp Solution

This package addresses these limitations through:

  1. Robust optimization with reparameterization (\(\gamma = 2a + 2 + e^\eta\))
  2. Zero-Modified CTP for excess/deficit zeros
  3. Better handling of small \(b\) values
  4. Comprehensive diagnostics and visualization tools

2. Basic Usage

2.1 Distribution Functions

The package provides standard distribution functions following R conventions:

# Probability mass function
dctp(0:5, a = 1, b = 0.5, gama = 5)
#> [1] 0.698789363 0.174697341 0.061871975 0.027253132 0.013839481 0.007765487

# Cumulative distribution function
pctp(3, a = 1, b = 0.5, gama = 5)
#> [1] 0.9626118

# Quantile function (inverse CDF)
qctp(c(0.25, 0.5, 0.75), a = 1, b = 0.5, gama = 5)
#> [1] 0 0 1

# Random generation
set.seed(123)
x <- rctp(30, a = 1, b = 0.5, gama = 5)
cat("Sample mean:", mean(x), "\nSample variance:", var(x), "\n")
#> Sample mean: 0.9333333 
#> Sample variance: 3.443678

2.2 Checking for Overdispersion

# Generate overdispersed data
set.seed(456)
x_over <- rctp(20, a = 1.2, b = 0.3, gama = 6)
head(x_over)
#> [1] 0 0 1 1 1 0
# Calculate dispersion index
dispersion_index <- var(x_over) / mean(x_over)
cat("Dispersion Index:", dispersion_index, "\n")
#> Dispersion Index: 0.6842105

if (dispersion_index > 1) {
  cat("Data is OVERDISPERSED\n")
} else if (dispersion_index < 1) {
  cat("Data is UNDERDISPERSED\n")
} else {
  cat("Data is EQUIDISPERSED\n")
}
#> Data is UNDERDISPERSED

2.3 Fitting the CTP Model

# Fit CTP model to moderate-sized data
set.seed(456)
x_over <- rctp(200, a = 1.2, b = 0.3, gama = 6)
fit_ctp <- ctp.fit(x_over)
print(fit_ctp)
CTP Distribution - Maximum Likelihood Estimates
===============================================

Parameter Estimates:
     Estimate Std.Error
a      1.4593    0.3585
b      0.0011    0.0000
gama   8.5912    3.1878

Goodness-of-Fit Statistics:
  Log-Likelihood: -178.1003
  AIC: 362.2006
  BIC: 372.0956
  Pearson Chi-sq: 19.6799
  Wald Chi-sq: 189.8432

Convergence: YES

2.4 Diagnostic Plots

plot(fit_ctp, type = "frequency")
plot(fit_ctp, type = "cdf")
plot(fit_ctp, type = "qq")

3. Zero-Modified CTP Distribution

3.1 When to Use ZM-CTP

Use the Zero-Modified CTP when:

3.2 Mathematical Definition

The ZM-CTP probability mass function is:

\[P(X=0) = \omega + (1-\omega) P_{CTP}(0)\] \[P(X=k) = (1-\omega) P_{CTP}(k), \quad k > 0\]

where \(0 < \omega < 1\) is the zero-modification parameter.

Interpretation: - \(\omega > 0.5\): Zero-inflation - \(\omega < 0.5\): Zero-deflation - \(\omega \approx 0\): Standard CTP is sufficient

3.3 Example: Zero-Inflated Data

# Generate zero-inflated data

x_zi <- rzictp(300, a = 1, b = 0.5, gama = 6, omega = 0.3)
cat("Proportion of zeros:", mean(x_zi == 0), "\n")
cat("Expected P(X=0) under CTP:", dctp(0, 1, 0.5, 6), "\n")
fit_ctp <- ctp.fit(x_zi)
fit_zm <- zictp.fit(x_zi)
summary(fit_zm)
cat("Standard CTP AIC:", fit_ctp$AIC, "\n")
cat("ZM-CTP AIC:", fit_zm$AIC, "\n")
cat("Omega estimate:", fit_zm$estimates["omega"], "\n")
Proportion of zeros: 0.7966667 
Expected P(X=0) under CTP: 0.7570218 

=== Zero-Modified CTP Distribution Fit Summary ===


Zero-Modified CTP Distribution - Maximum Likelihood Estimates
=============================================================

Parameter Estimates:
      Estimate Std.Error
a       2.5646        NA
b       0.0017        NA
gama   16.8737        NA
omega   0.4656        NA

Goodness-of-Fit Statistics:
  Log-Likelihood: -216.5545
  AIC: 441.1090
  BIC: 455.9241
  Pearson Chi-sq: 2.7242
  Wald Chi-sq: 287.6122

Convergence: YES 

--- Moments ---
  Mean: 0.3271 (empirical: 0.3267)
  Variance: 0.6467 (empirical: 0.6220)
  P(X=0): 0.7967 (empirical: 0.7967)

--- Observed vs Expected Frequencies ---
  x Freq    expected
1 0  239 239.0001656
2 1   40  38.7105351
3 2   11  13.7597381
4 3    5   5.0633781
5 4    4   1.9722958
6 5    1   0.8143692

Standard CTP AIC: 440.03 
ZM-CTP AIC: 441.109 
Omega estimate: 0.4656324

5. Comparison with cpd Package

5.1 The Problem with Large Sample Sizes

From Oladoja (2021):

“Data sets previously utilized while fitting this model have sample sizes that are less than 2000. In this study, the CTP was fitted to different over-dispersed and under-dispersed count data sets with both small and large sample sizes.”

“The ‘cpd’ package in R… estimated the parameter b as 0 for cases where the sample size n is large.”

5.2 Demonstration

# Install cpd if needed
# install.packages("cpd")

library(zmctp)

# Generate data where cpd struggles
set.seed(100)
x_problem <- rzictp(2000, a = 1, b = 0.001, gama = 8, omega = 0.2)

# Compare results
cpd_fit <- cpd::fitctp(x_problem, astart=1, bstart=0.001, gammastart=8)
zmctp_fit <- zictp.fit(x_problem)
cat("cpd b estimate:", cpd_fit$coefficients[2], "\n")
cat("zmctp b estimate:", zmctp_fit$estimates["b"], "\n")
cat("zmctp omega estimate:", zmctp_fit$estimates["omega"], "\n")

cpd b estimate: 8.072796e-07 
zmctp b estimate: 0.0007942037 
zmctp omega estimate: 0.04831814

6. Model Selection Guide

6.1 Decision Tree

Is your data count data?
  ├─ NO → Use appropriate model for your data type
└─ YES → Calculate Dispersion Index (DI = variance/mean)
├─ DI ≈ 1 → Poisson
├─ DI > 1 (Over-dispersed)
│   ├─ Excess zeros? → Try ZIP, ZINB, ZM-CTP
│   ├─ Excess non-zeros? → Try CTP, NB
│   └─ Large n & b→0 in cpd? → Use zmctp!
  └─ DI < 1 (Under-dispersed) → Try CTP

7. Properties and Theoretical Results

7.1 Skewness

The CTP distribution has positive skewness:

\[\mu_3 = \frac{(a^2 + b^2)[4b^2 + (\gamma - 1)^2] + [b^2 + (\gamma - 1 - a)^2]}{(\gamma - 2a - 1)^3(\gamma - 2a - 2)(\gamma - 2a - 3)}\]

Since both numerator and denominator are always positive, the distribution is right-skewed.

7.2 Mode

The distribution is uni-modal at:

\[y = \left\lfloor \frac{(a-1)^2 + b^2}{\gamma - 2a + 1} \right\rfloor\]

## 7.3 Convergence

The CTP distribution converges to: - Poisson when \(\gamma\) and \(a^2 + b^2 \rightarrow \infty\) at the same rate - Normal when \(\gamma\) and \(\sqrt{a^2 + b^2}\) have the same order of convergence # 9. Conclusion

The zmctp package provides a robust implementation of the CTP distribution that:

✅ Handles both over-dispersion and under-dispersion
✅ Works with large sample sizes (n > 2000)
✅ Addresses the b → 0 problem in existing implementations
✅ Provides zero-modified variants for zero-inflated data
✅ Includes comprehensive diagnostics and visualization

9.1 Key Findings from Oladoja (2021)

“The CTP produced similar fit to the NB model while correcting for over-dispersion caused by excess zeros but is a better fit than the NB when over-dispersion is caused by excessive non-zero values in a data set.”

“The CTP model behaved better with under-dispersed data sets than the Poisson model.”

“The package ‘cpd’ in R estimates b as 0 for zero-inflated data sets while the program utilized in this study works well.”

References

Böhning, D., Dietz, E., & Schlattmann, P. (1999). The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology. Journal of the Royal Statistical Society: Series A, 162(2), 195-209.

Deb, P., & Trivedi, P. K. (1997). Demand for medical care by the elderly: a finite mixture approach. Journal of Applied Econometrics, 12(3), 313-336.

Oladoja, R. O. (2021). Complex Tri-Parametric Pearson Distribution and Its Applications. M.Sc. Thesis, Kwara State University, Malete, Nigeria.

Rodríguez-Avi, J., Conde-Sánchez, A., & Sáez-Castillo, A. J. (2003). A new class of discrete distributions with complex parameters. Statistical Papers, 44(1), 67-88. https://doi.org/10.1007/s00362-002-0134-7

Session Info

sessionInfo()
#> R version 4.5.1 (2025-06-13 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=C                          
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/Chicago
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] zmctp_0.1.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.39     R6_2.6.1          fastmap_1.2.0     xfun_0.53        
#>  [5] cachem_1.1.0      knitr_1.51        htmltools_0.5.9   rmarkdown_2.30   
#>  [9] lifecycle_1.0.4   cli_3.6.5         sass_0.4.10       jquerylib_0.1.4  
#> [13] compiler_4.5.1    rstudioapi_0.17.1 tools_4.5.1       evaluate_1.0.5   
#> [17] bslib_0.9.0       yaml_2.3.12       otel_0.2.0        jsonlite_2.0.0   
#> [21] rlang_1.1.6