--- title: "Nonparametric Analysis of Doubly Truncated Data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Nonparametric Analysis of Doubly Truncated Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 5, fig.height = 3, fig.align = "center" ) ``` ```{r setup} library(survdt) ``` ## Doubly truncated data Double truncation occurs when a data sample only includes subjects whose event time lies within a random truncation interval. For example, the `survdt` package includes the doubly truncated dataset `aids`, which holds records of AIDS incubation times from a retrospective sample of patients with transfusion-acquired HIV. The inclusion criteria were that the patient was diagnosed with AIDS between the discovery of the disease in 1982 and the end of the study in 1986, with time measured in months since HIV infection. This sampling scheme was chosen so that the precise HIV infection time could be ascertained retrospectively from a blood transfusion registry. The event time of interest is the AIDS incubation time `incu`, with the truncation times contained in `ltrunc` and `rtrunc` respectively. See `?survdt::aids` for more details. The `survdt` package has a special `Survdt` data structure designed to store doubly truncated data. Below we create a `Survdt` object using the `aids` dataset and plot a simple visualization of the sample. See `?survdt::Survdt` for more details. ```{r} y <- Survdt(aids$incu, aids$ltrunc, aids$rtrunc) head(y) plot(y) ``` Each row of this plot contains a subject's truncation interval as a gray rectangle and the subject's event time in black, with rows sorted by event time. ## Nonparametric maximum likelihood estimation The nonparametric maximum likelihood estimator (NPMLE) of the event time cumulative distribution function (cdf) is a right-continuous step function with point masses at the observed event times. Think of it like an empirical distribution function that is adjusted for double truncation bias. It is estimated iteratively via an EM algorithm. The underlying assumptions for the NPMLE are discussed in [Assumptions required for the NPMLE]. The `survdt` package includes the function `npmle()` to compute the NPMLE and its standard errors. The first input argument should be a `Survdt` object or expression, which is evaluated using the variables found in the `data` argument (if provided). See `?survdt::npmle` for more details. Below we fit the NPMLE on the `aids` data and plot both the estimated survival function and estimated cumulative hazard function, along with pointwise 95% confidence intervals. Other plot options include the estimated selection (non-truncation) probability over time and the estimated inverse probability weights. See `?survdt::plot.npmle` for more details. ```{r} npfit <- npmle(Survdt(incu, ltrunc, rtrunc), data = aids, cumhaz = TRUE) npfit plot(npfit) ``` ```{r} plot(npfit, target = "cumhaz") ``` These confidence intervals are computed efficiently through closed-form standard errors, i.e. without relying on bootstrap or other resampling methods, in contrast to some other R packages for double truncation. ### Testing for event time differences across multiple groups The `aids` data has the factor variable `age_gp` which categorizes each subject based on age at HIV infection. The three age groups are children (age $\leq 4$), adults (age $4-59$), and elderly (age $>59$). The group sizes are: ```{r} summary(aids$age_gp) ``` In order to test for differences in incubation times across the three age groups, the `survdt` package provides two general options. First, the data can be analyzed through a Cox proportional hazards model using methods that correct for double truncation bias. See `vignette("cox-regression-quasiindep", package = "survdt")` and `?survdt::coxph_indtrunc` for more details on how to do this with the `survdt` package. Second, the data can be analyzed nonparametrically to detect any difference in the incubation time distributions without assuming proportional hazards. The `survdt` package implements this in the function `test_samplediff_indtrunc()`. The first input argument should be a formula with a `Survdt` response and variables defining the groups, i.e. strata, to test across on the right hand side. The test can be based on the supremum difference between survival functions or between cumulative hazards, both estimated by the NPMLE. The p-value is computed by simulating from the estimated asymptotic null distribution of the test statistic. See `?survdt::test_samplediff_indtrunc` for more details. Below we test for differences in the AIDS incubation time distribution across the three age groups. We limit the range of the test to at most 70 months since HIV infection to avoid the high variability associated with the tail of the distribution (see the plot in [Nonparametric maximum likelihood estimation]). ```{r} agetest <- test_samplediff_indtrunc(Survdt(incu, ltrunc, rtrunc) ~ age_gp, data = aids, restr_times = c(NA, 70)) agetest ``` We find highly significant evidence that the incubation time distribution differs across age groups. The test result can be visualized through the following plot, which compares the group-specific survival function estimates with the unstratified estimates. Under the null hypothesis of equality across strata, the stratified estimates should all lie within the 95% uniform confidence band, denoted by dashed lines. ```{r, fig.height = 3.5} plot(agetest) ``` Plotting the test result is recommended in general, since it provides a better understanding of the estimated effect size and power of the test. ### Assumptions required for the NPMLE The NPMLE requires two key conditions. First, all possible event times must have a positive probability of being observed. This can be determined from the study design, as well as by comparing the observed event times to the range of possible event times based on external knowledge. Second, the truncation times must be independent of the event time within the observable data region, commonly known as quasi-independence. See [Testing for quasi-independent truncation] for more information on how to test this assumption with the `survdt` package. ### Testing for quasi-independent truncation The `survdt` package includes the function `test_quasiindep_ctau()` to test for quasi-independence between the event and truncation times using the conditional Kendall's tau rank correlation. The first input argument should be a `Survdt` object or expression, which is evaluated using the variables found in the `data` argument (if provided). It computes the (event time, left truncation time) and (event time, right truncation time) rank correlations (`tau`) accounting for double truncation, as well as a p-value for testing whether both are zero. See `?survdt::test_quasiindep_ctau` for more details. Below we apply this quasi-independence test to the `aids` data. ```{r} test_quasiindep_ctau(Survdt(incu, ltrunc, rtrunc), data = aids) ``` Based on the small estimated correlations and moderate p-value, we do not find strong evidence against quasi-independence in the `aids` data. ### Testing for ignorable sampling bias A unique feature of doubly truncated data, compared to one-sided truncation, is that it is possible in some cases for there to be no effective sampling bias induced by truncation. This occurs when the sampling probability is constant across all event times. If the sampling bias is ignorable, standard methods which do not correct for double truncation bias may be used to analyze the data. The `survdt` package includes the function `test_ignorability_indtrunc` to test for ignorable sampling bias nonparametrically through the NPMLE. The first input argument can be a `Survdt` object/expression or the fitted NPMLE object returned by `npmle()`. The test can be based on the supremum difference between the time-specific and overall selection probabilities or the supremum difference between the NPMLE CDF and the empirical CDF (not adjusted for double truncation). The p-value is computed by simulating from the estimated asymptotic null distribution of the test statistic. See `?survdt::test_ignorability_indtrunc` for more details. Below we apply these tests to the `aids` data. The output from `test_ignorability_indtrunc` has a `plot` method to help visualize the deviation of the NPMLE from its reference value using a uniform confidence band. For the CDF-based test, we limit the range of the test to at most 80 months since HIV infection to avoid the high variability associated with the tail of the distribution (see the plot in [Nonparametric maximum likelihood estimation]). ```{r, fig.height=3.5} igtest_prob <- test_ignorability_indtrunc(npfit, test_type = "sel_prob") igtest_prob plot(igtest_prob) igtest_cdf <- test_ignorability_indtrunc(npfit, test_type = "cdf", restr_times = c(NA, 80)) igtest_cdf plot(igtest_cdf) ``` Both tests show highly significant evidence against ignorable sampling bias in the `aids` data. Thus this data should be analyzed with methods that directly account for double truncation bias. To illustrate the utility of plotting the test results, we run the CDF-based test without restricting the time range for the test statistic. ```{r, fig.height=3.5} igtest_cdf2 <- test_ignorability_indtrunc(npfit, test_type = "cdf") igtest_cdf2 plot(igtest_cdf2) ``` This unrestricted test returns a much larger p-value, and after plotting the result it is evident that this is due to the large variance of the NPMLE cdf at long incubation times. Thus the unrestricted CDF test has low power to detect what is clearly a large deviation from ignorability. This issue can occur when the estimated selection probabilities approach zero within the event time range. The plot below confirms that this is the case for the `aids` data. ```{r} plot(npfit, target = "sel_prob") ```