The zmctp package extends the Complex Tri-parametric Pearson (CTP) distribution with zero-modified versions for overdispersed count data. This package addresses limitations of existing implementations when the parameter \(b\) approaches zero, providing distribution functions, maximum likelihood estimation, and diagnostic tools based on Rodríguez-Avi et al. (2003).
In count data analysis, the Poisson regression model is the standard approach when successive events occur independently at the same rate. A unique feature of the Poisson distribution is equi-dispersion: the mean and variance are equal. However, real-world data often violates this assumption:
When overdispersion occurs, the Negative Binomial (NB) model is commonly recommended. For cases with excess zeros, Zero-Inflated models (ZIP, ZINB) are used. However, these models have limitations, particularly when handling complex dispersion patterns.
The CTP distribution, introduced by Rodríguez-Avi et al. (2003), belongs to the Pearson discrete distribution family and is derived from the Gaussian Hypergeometric Probability Distribution (GHPD). Unlike Poisson-based mixtures, the CTP provides a fundamentally different approach to modeling count data.
The CTP distribution has probability mass function:
\[f(x|a,b,\gamma) = f_0 \frac{\Gamma(a+ib+x) \Gamma(a-ib+x)}{\Gamma(\gamma+x) \cdot x!}, \quad x = 0, 1, 2, ...\]
where: - \(a \in \mathbb{R}\) (real parameter, can be negative) - \(b \geq 0\) (imaginary parameter) - \(\gamma > 2a + 2\) (shape parameter, ensures variance exists) - \(i\) is the imaginary unit - \(f_0\) is the normalizing constant:
\[f_0 = \frac{\Gamma(\gamma-a-ib)\Gamma(\gamma-a+ib)}{\Gamma(\gamma)\Gamma(\gamma-2a)}\]
The distribution has closed-form expressions for its moments:
Mean: \[\mu = \frac{a^2 + b^2}{\gamma - 2a - 1}\]
Variance: \[\sigma^2 = \frac{\mu(\mu + \gamma - 1)}{\gamma - 2a - 2}\]
Dispersion: - Equidispersed if \(a = -\frac{\mu + 1}{2}\) - Underdispersed if \(a < -\frac{\mu + 1}{2}\) - Overdispersed if \(a > -\frac{\mu + 1}{2}\)
Notably, the CTP is overdispersed whenever \(a \geq 0\).
The cpd package (available on CRAN) implements the CTP
distribution but has a significant limitation: when fitting data
with large sample sizes or zero-inflation, it often estimates \(b \approx 0\), which reduces the
model to a simpler form and loses flexibility.
This package addresses these limitations through:
The package provides standard distribution functions following R conventions:
# Probability mass function
dctp(0:5, a = 1, b = 0.5, gama = 5)
#> [1] 0.698789363 0.174697341 0.061871975 0.027253132 0.013839481 0.007765487
# Cumulative distribution function
pctp(3, a = 1, b = 0.5, gama = 5)
#> [1] 0.9626118
# Quantile function (inverse CDF)
qctp(c(0.25, 0.5, 0.75), a = 1, b = 0.5, gama = 5)
#> [1] 0 0 1
# Random generation
set.seed(123)
x <- rctp(30, a = 1, b = 0.5, gama = 5)
cat("Sample mean:", mean(x), "\nSample variance:", var(x), "\n")
#> Sample mean: 0.9333333
#> Sample variance: 3.443678# Generate overdispersed data
set.seed(456)
x_over <- rctp(20, a = 1.2, b = 0.3, gama = 6)
head(x_over)
#> [1] 0 0 1 1 1 0
# Calculate dispersion index
dispersion_index <- var(x_over) / mean(x_over)
cat("Dispersion Index:", dispersion_index, "\n")
#> Dispersion Index: 0.6842105
if (dispersion_index > 1) {
cat("Data is OVERDISPERSED\n")
} else if (dispersion_index < 1) {
cat("Data is UNDERDISPERSED\n")
} else {
cat("Data is EQUIDISPERSED\n")
}
#> Data is UNDERDISPERSED# Fit CTP model to moderate-sized data
set.seed(456)
x_over <- rctp(200, a = 1.2, b = 0.3, gama = 6)
fit_ctp <- ctp.fit(x_over)
print(fit_ctp)CTP Distribution - Maximum Likelihood Estimates
===============================================
Parameter Estimates:
Estimate Std.Error
a 1.4593 0.3585
b 0.0011 0.0000
gama 8.5912 3.1878
Goodness-of-Fit Statistics:
Log-Likelihood: -178.1003
AIC: 362.2006
BIC: 372.0956
Pearson Chi-sq: 19.6799
Wald Chi-sq: 189.8432
Convergence: YES
Use the Zero-Modified CTP when:
The ZM-CTP probability mass function is:
\[P(X=0) = \omega + (1-\omega) P_{CTP}(0)\] \[P(X=k) = (1-\omega) P_{CTP}(k), \quad k > 0\]
where \(0 < \omega < 1\) is the zero-modification parameter.
Interpretation: - \(\omega > 0.5\): Zero-inflation - \(\omega < 0.5\): Zero-deflation - \(\omega \approx 0\): Standard CTP is sufficient
# Generate zero-inflated data
x_zi <- rzictp(300, a = 1, b = 0.5, gama = 6, omega = 0.3)
cat("Proportion of zeros:", mean(x_zi == 0), "\n")
cat("Expected P(X=0) under CTP:", dctp(0, 1, 0.5, 6), "\n")
fit_ctp <- ctp.fit(x_zi)
fit_zm <- zictp.fit(x_zi)
summary(fit_zm)
cat("Standard CTP AIC:", fit_ctp$AIC, "\n")
cat("ZM-CTP AIC:", fit_zm$AIC, "\n")
cat("Omega estimate:", fit_zm$estimates["omega"], "\n")Proportion of zeros: 0.7966667
Expected P(X=0) under CTP: 0.7570218
=== Zero-Modified CTP Distribution Fit Summary ===
Zero-Modified CTP Distribution - Maximum Likelihood Estimates
=============================================================
Parameter Estimates:
Estimate Std.Error
a 2.5646 NA
b 0.0017 NA
gama 16.8737 NA
omega 0.4656 NA
Goodness-of-Fit Statistics:
Log-Likelihood: -216.5545
AIC: 441.1090
BIC: 455.9241
Pearson Chi-sq: 2.7242
Wald Chi-sq: 287.6122
Convergence: YES
--- Moments ---
Mean: 0.3271 (empirical: 0.3267)
Variance: 0.6467 (empirical: 0.6220)
P(X=0): 0.7967 (empirical: 0.7967)
--- Observed vs Expected Frequencies ---
x Freq expected
1 0 239 239.0001656
2 1 40 38.7105351
3 2 11 13.7597381
4 3 5 5.0633781
5 4 4 1.9722958
6 5 1 0.8143692
Standard CTP AIC: 440.03
ZM-CTP AIC: 441.109
Omega estimate: 0.4656324
From Oladoja (2021):
“Data sets previously utilized while fitting this model have sample sizes that are less than 2000. In this study, the CTP was fitted to different over-dispersed and under-dispersed count data sets with both small and large sample sizes.”
“The ‘cpd’ package in R… estimated the parameter b as 0 for cases where the sample size n is large.”
# Install cpd if needed
# install.packages("cpd")
library(zmctp)
# Generate data where cpd struggles
set.seed(100)
x_problem <- rzictp(2000, a = 1, b = 0.001, gama = 8, omega = 0.2)
# Compare results
cpd_fit <- cpd::fitctp(x_problem, astart=1, bstart=0.001, gammastart=8)
zmctp_fit <- zictp.fit(x_problem)
cat("cpd b estimate:", cpd_fit$coefficients[2], "\n")
cat("zmctp b estimate:", zmctp_fit$estimates["b"], "\n")
cat("zmctp omega estimate:", zmctp_fit$estimates["omega"], "\n")
cpd b estimate: 8.072796e-07
zmctp b estimate: 0.0007942037
zmctp omega estimate: 0.04831814
Is your data count data?
├─ NO → Use appropriate model for your data type
└─ YES → Calculate Dispersion Index (DI = variance/mean)
├─ DI ≈ 1 → Poisson
├─ DI > 1 (Over-dispersed)
│ ├─ Excess zeros? → Try ZIP, ZINB, ZM-CTP
│ ├─ Excess non-zeros? → Try CTP, NB
│ └─ Large n & b→0 in cpd? → Use zmctp!
└─ DI < 1 (Under-dispersed) → Try CTP
The CTP distribution has positive skewness:
\[\mu_3 = \frac{(a^2 + b^2)[4b^2 + (\gamma - 1)^2] + [b^2 + (\gamma - 1 - a)^2]}{(\gamma - 2a - 1)^3(\gamma - 2a - 2)(\gamma - 2a - 3)}\]
Since both numerator and denominator are always positive, the distribution is right-skewed.
The distribution is uni-modal at:
\[y = \left\lfloor \frac{(a-1)^2 + b^2}{\gamma - 2a + 1} \right\rfloor\]
## 7.3 Convergence
The CTP distribution converges to: - Poisson when \(\gamma\) and \(a^2 + b^2 \rightarrow \infty\) at the same rate - Normal when \(\gamma\) and \(\sqrt{a^2 + b^2}\) have the same order of convergence # 9. Conclusion
The zmctp package provides a robust implementation of the CTP distribution that:
✅ Handles both over-dispersion and under-dispersion
✅ Works with large sample sizes (n > 2000)
✅ Addresses the b → 0 problem in existing implementations
✅ Provides zero-modified variants for zero-inflated data
✅ Includes comprehensive diagnostics and visualization
“The CTP produced similar fit to the NB model while correcting for over-dispersion caused by excess zeros but is a better fit than the NB when over-dispersion is caused by excessive non-zero values in a data set.”
“The CTP model behaved better with under-dispersed data sets than the Poisson model.”
“The package ‘cpd’ in R estimates b as 0 for zero-inflated data sets while the program utilized in this study works well.”
Böhning, D., Dietz, E., & Schlattmann, P. (1999). The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology. Journal of the Royal Statistical Society: Series A, 162(2), 195-209.
Deb, P., & Trivedi, P. K. (1997). Demand for medical care by the elderly: a finite mixture approach. Journal of Applied Econometrics, 12(3), 313-336.
Oladoja, R. O. (2021). Complex Tri-Parametric Pearson Distribution and Its Applications. M.Sc. Thesis, Kwara State University, Malete, Nigeria.
Rodríguez-Avi, J., Conde-Sánchez, A., & Sáez-Castillo, A. J. (2003). A new class of discrete distributions with complex parameters. Statistical Papers, 44(1), 67-88. https://doi.org/10.1007/s00362-002-0134-7
sessionInfo()
#> R version 4.5.1 (2025-06-13 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: America/Chicago
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] zmctp_0.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.39 R6_2.6.1 fastmap_1.2.0 xfun_0.53
#> [5] cachem_1.1.0 knitr_1.51 htmltools_0.5.9 rmarkdown_2.30
#> [9] lifecycle_1.0.4 cli_3.6.5 sass_0.4.10 jquerylib_0.1.4
#> [13] compiler_4.5.1 rstudioapi_0.17.1 tools_4.5.1 evaluate_1.0.5
#> [17] bslib_0.9.0 yaml_2.3.12 otel_0.2.0 jsonlite_2.0.0
#> [21] rlang_1.1.6