Getting Started with contentValidity

library(contentValidity)

Background

When developing a new questionnaire, scale, or test, researchers typically ask a panel of subject-matter experts to rate each candidate item for relevance to the construct being measured. The expert ratings are then summarized into content validity indices that quantify how well the items represent the intended construct.

The contentValidity package implements the standard set of content validity indices used in nursing, education, psychology, and health sciences research:

The example dataset

The package ships with cvi_example, a simulated set of expert ratings for a 10-item depression screening instrument, with 6 expert raters using a 4-point relevance scale (1 = not relevant, 4 = highly relevant).

data(cvi_example)
head(cvi_example)
#>         item1 item2 item3 item4 item5 item6 item7 item8 item9 item10
#> expert1     4     3     3     2     3     4     3     4     2      4
#> expert2     4     4     3     3     2     4     3     4     3      4
#> expert3     4     4     4     3     3     4     2     4     2      3
#> expert4     4     4     3     4     3     3     3     4     3      4
#> expert5     4     3     4     3     4     4     3     4     3      4
#> expert6     4     4     3     3     2     4     4     3     2      4

Item-level analysis

The simplest place to start is icvi(), which gives the proportion of experts rating each item as 3 or 4:

icvi(cvi_example)
#>     item1     item2     item3     item4     item5     item6     item7     item8 
#> 1.0000000 1.0000000 1.0000000 0.8333333 0.6666667 1.0000000 0.8333333 1.0000000 
#>     item9    item10 
#> 0.5000000 1.0000000

By Polit and Beck (2006), I-CVI ≥ 0.78 is considered excellent with six or more experts. Items 5 and 9 in our example (0.67 and 0.50) would be flagged for revision.

Plain I-CVI doesn’t correct for chance agreement. With small panels, a high I-CVI can be partly luck. Modified kappa addresses this:

mod_kappa(cvi_example)
#>     item1     item2     item3     item4     item5     item6     item7     item8 
#> 1.0000000 1.0000000 1.0000000 0.8160920 0.5646259 1.0000000 0.8160920 1.0000000 
#>     item9    item10 
#> 0.2727273 1.0000000

Notice that item 9 drops sharply (0.50 → 0.27) — its I-CVI was inflated by chance agreement among only six raters.

Aiken’s V uses the full rating scale rather than dichotomizing relevant/not-relevant. A “4” contributes more than a “3”:

aiken_v(cvi_example, lo = 1, hi = 4)
#>     item1     item2     item3     item4     item5     item6     item7     item8 
#> 1.0000000 0.8888889 0.7777778 0.6666667 0.6111111 0.9444444 0.6666667 0.9444444 
#>     item9    item10 
#> 0.5000000 0.9444444

Scale-level analysis

Two scale-level indices summarize content validity across all items:

scvi_ave(cvi_example)   # average of I-CVIs
#> [1] 0.8833333
scvi_ua(cvi_example)    # proportion of items with universal agreement
#> [1] 0.6

Polit and Beck (2006) recommend reporting both. S-CVI/Ave ≥ 0.90 indicates excellent overall content validity; S-CVI/UA gives a stricter view of how many items achieved unanimous endorsement.

All indices at once

content_validity() is the workhorse function for routine analysis. It returns the complete set of item-level and scale-level indices in one tidy structure:

result <- content_validity(cvi_example)
result
#> Content Validity Analysis
#> -------------------------
#> Experts: 6
#> Items:   10
#> 
#> Item-level indices:
#>    item   icvi mod_kappa aiken_v gwet_ac1 gwet_ac2
#>   item1 1.0000    1.0000  1.0000   1.0000   1.0000
#>   item2 1.0000    1.0000  0.8889   1.0000   0.8964
#>   item3 1.0000    1.0000  0.7778   1.0000   0.8964
#>   item4 0.8333    0.8161  0.6667   0.5385   0.8286
#>   item5 0.6667    0.5646  0.6111   0.0400   0.6940
#>   item6 1.0000    1.0000  0.9444   1.0000   0.9494
#>   item7 0.8333    0.8161  0.6667   0.5385   0.8286
#>   item8 1.0000    1.0000  0.9444   1.0000   0.9494
#>   item9 0.5000    0.2727  0.5000  -0.2000   0.8714
#>  item10 1.0000    1.0000  0.9444   1.0000   0.9494
#> 
#> Scale-level indices (overall):
#>   scvi_ave    scvi_ua mean_kappa   mean_ac1   mean_ac2 
#>     0.8833     0.6000     0.8470     0.6917     0.8864

The result is an object you can subset, just like a list:

result$items
#>      item      icvi mod_kappa   aiken_v   gwet_ac1  gwet_ac2
#> 1   item1 1.0000000 1.0000000 1.0000000  1.0000000 1.0000000
#> 2   item2 1.0000000 1.0000000 0.8888889  1.0000000 0.8964029
#> 3   item3 1.0000000 1.0000000 0.7777778  1.0000000 0.8964029
#> 4   item4 0.8333333 0.8160920 0.6666667  0.5384615 0.8285714
#> 5   item5 0.6666667 0.5646259 0.6111111  0.0400000 0.6940000
#> 6   item6 1.0000000 1.0000000 0.9444444  1.0000000 0.9494382
#> 7   item7 0.8333333 0.8160920 0.6666667  0.5384615 0.8285714
#> 8   item8 1.0000000 1.0000000 0.9444444  1.0000000 0.9494382
#> 9   item9 0.5000000 0.2727273 0.5000000 -0.2000000 0.8714286
#> 10 item10 1.0000000 1.0000000 0.9444444  1.0000000 0.9494382
result$scale
#>   scvi_ave    scvi_ua mean_kappa   mean_ac1   mean_ac2 
#>  0.8833333  0.6000000  0.8469537  0.6916923  0.8863692

Publication-ready tables

apa_table() formats the result for journal manuscripts:

apa_table(result)
#>      Item I-CVI Modified Kappa Kappa Interpretation Aiken's V Gwet's AC1
#> 1   item1  1.00           1.00            Excellent      1.00       1.00
#> 2   item2  1.00           1.00            Excellent      0.89       1.00
#> 3   item3  1.00           1.00            Excellent      0.78       1.00
#> 4   item4  0.83           0.82            Excellent      0.67       0.54
#> 5   item5  0.67           0.56                 Fair      0.61       0.04
#> 6   item6  1.00           1.00            Excellent      0.94       1.00
#> 7   item7  0.83           0.82            Excellent      0.67       0.54
#> 8   item8  1.00           1.00            Excellent      0.94       1.00
#> 9   item9  0.50           0.27                 Poor      0.50      -0.20
#> 10 item10  1.00           1.00            Excellent      0.94       1.00
#>    Gwet's AC2
#> 1        1.00
#> 2        0.90
#> 3        0.90
#> 4        0.83
#> 5        0.69
#> 6        0.95
#> 7        0.83
#> 8        0.95
#> 9        0.87
#> 10       0.95

For R Markdown output (HTML, PDF, Word), use the appropriate format argument. The function returns a knitr::kable() object that renders correctly in your document:

apa_table(result, format = "markdown")
Content validity indices (N = 6 experts, 10 items; S-CVI/Ave = 0.88, S-CVI/UA = 0.60).
Item I-CVI Modified Kappa Kappa Interpretation Aiken’s V Gwet’s AC1 Gwet’s AC2
item1 1.00 1.00 Excellent 1.00 1.00 1.00
item2 1.00 1.00 Excellent 0.89 1.00 0.90
item3 1.00 1.00 Excellent 0.78 1.00 0.90
item4 0.83 0.82 Excellent 0.67 0.54 0.83
item5 0.67 0.56 Fair 0.61 0.04 0.69
item6 1.00 1.00 Excellent 0.94 1.00 0.95
item7 0.83 0.82 Excellent 0.67 0.54 0.83
item8 1.00 1.00 Excellent 0.94 1.00 0.95
item9 0.50 0.27 Poor 0.50 -0.20 0.87
item10 1.00 1.00 Excellent 0.94 1.00 0.95

Lawshe’s CVR

CVR uses a different rating convention: each expert classifies items as essential, useful but not essential, or not necessary. Use Lawshe-style coding (1 = essential, 2 = useful, 3 = not necessary) and call cvr() directly:

# 10 experts rating 3 items on Lawshe's scale
lawshe_ratings <- matrix(
  c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2,    # 8 of 10 essential
    1, 1, 1, 2, 2, 2, 2, 3, 3, 3,    # 3 of 10 essential
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1),   # 10 of 10 essential
  nrow = 10,
  dimnames = list(NULL, paste0("item", 1:3))
)

cvr(lawshe_ratings)
#> item1 item2 item3 
#>   0.6  -0.4   1.0

Compare each item’s CVR to the critical value for the panel size, using the corrected Wilson, Pan, and Schumsky (2012) thresholds:

cvr_critical(n_experts = 10)        # one-tailed alpha = 0.05
#> [1] 0.8
cvr_critical(n_experts = 10, alpha = 0.01)
#> [1] 1

In this example, only items 1 and 3 (CVR = 0.6 and 1.0) reach the critical value of 0.8 at α = 0.05. Item 2 would be revised or dropped.

What’s new in v0.2.0

Bootstrap confidence intervals

All six relevance-scale indices and Lawshe’s CVR now accept an optional ci = TRUE argument that returns bootstrap confidence intervals alongside the point estimate. The CI is the percentile bootstrap by default (Efron & Tibshirani, 1993); ci_method = "bca" requests the bias-corrected accelerated interval (DiCiccio & Efron, 1996), which is preferable when the bootstrap distribution is skewed (common for I-CVI near 1.0). Default 2000 replicates, configurable via n_boot. The resampling unit is the expert (row), not the item (column), matching the standard inferential frame for inter-rater reliability analyses (Gwet, 2014).

icvi(cvi_example, ci = TRUE, n_boot = 1000, seed = 1)
#>      item      icvi  ci_lower  ci_upper  ci_method conf_level n_boot
#> 1   item1 1.0000000 1.0000000 1.0000000 percentile       0.95   1000
#> 2   item2 1.0000000 1.0000000 1.0000000 percentile       0.95   1000
#> 3   item3 1.0000000 1.0000000 1.0000000 percentile       0.95   1000
#> 4   item4 0.8333333 0.5000000 1.0000000 percentile       0.95   1000
#> 5   item5 0.6666667 0.3333333 1.0000000 percentile       0.95   1000
#> 6   item6 1.0000000 1.0000000 1.0000000 percentile       0.95   1000
#> 7   item7 0.8333333 0.5000000 1.0000000 percentile       0.95   1000
#> 8   item8 1.0000000 1.0000000 1.0000000 percentile       0.95   1000
#> 9   item9 0.5000000 0.1666667 0.8333333 percentile       0.95   1000
#> 10 item10 1.0000000 1.0000000 1.0000000 percentile       0.95   1000

Gwet’s AC1 and AC2

Two new chance-corrected agreement coefficients are available: gwet_ac1() for binary classification (dichotomized at the relevance threshold) and gwet_ac2() for the full ordinal scale with a weight matrix. Both use Gwet’s marginal-adjusted chance-correction, which differs from Polit’s modified kappa (fixed p = 0.5 null) and gives substantively different answers when the prevalence of “relevant” ratings is far from 0.5 — the common case in content-validity work.

gwet_ac1(cvi_example)
#>      item1      item2      item3      item4      item5      item6      item7 
#>  1.0000000  1.0000000  1.0000000  0.5384615  0.0400000  1.0000000  0.5384615 
#>      item8      item9     item10 
#>  1.0000000 -0.2000000  1.0000000
gwet_ac2(cvi_example, categories = 1:4)
#>     item1     item2     item3     item4     item5     item6     item7     item8 
#> 1.0000000 0.8964029 0.8964029 0.8285714 0.6940000 0.9494382 0.8285714 0.9494382 
#>     item9    item10 
#> 0.8714286 0.9494382

For AC2, always pass the full theoretical rating scale via categories (e.g., 1:4 for a standard 4-point relevance scale). If omitted, the function infers categories from the observed ratings, which can silently collapse the weight matrix and give incorrect results when extreme categories are unused.

The implementation matches irrCAC::gwet.ac1.raw() (by Kilem Gwet, the original author of AC1/AC2) bit-for-bit on the same inputs.

Sample-size planning

cv_sample_size_icvi() answers “how many expert raters do I need to estimate I-CVI within a given confidence-interval half-width?” — a question that has been answered only by rule-of-thumb in the content-validity literature (Lynn, 1986; Polit & Beck, 2006).

# Anticipating I-CVI ≈ 0.85 with target half-width ≤ 0.10
cv_sample_size_icvi(expected = 0.85, half_width = 0.10)
#> [1] 49

# Sensitivity table across plausible expected I-CVI values
sapply(seq(0.70, 0.95, by = 0.05), function(p) {
  cv_sample_size_icvi(expected = p, half_width = 0.10)
})
#> [1] 81 73 62 49 35 19

A useful caveat: the function typically recommends 20+ experts for realistic targets, well above Lynn’s rule-of-thumb minimum of 6 — worth flagging in study protocols and grant applications.

Multi-dimensional / subscale analysis

For instruments structured into subscales (e.g., a depression scale with cognitive, somatic, and behavioral domains), content_validity() now accepts a subscale argument that maps items to subscales and computes scale-level indices per subscale in addition to the overall scale.

# Treat items 1-5 as subscale "Cognitive" and 6-10 as "Somatic"
result_multi <- content_validity(
  cvi_example,
  subscale = c(rep("Cognitive", 5), rep("Somatic", 5))
)
result_multi$subscales
#>    subscale n_items  scvi_ave scvi_ua mean_kappa  mean_ac1  mean_ac2
#> 1 Cognitive       5 0.9000000     0.6  0.8761436 0.7156923 0.8630754
#> 2   Somatic       5 0.8666667     0.6  0.8177638 0.6676923 0.9096629

The items data frame also carries the subscale assignment, which makes it easy to filter or facet downstream analyses.

Visualization

plot.content_validity() produces a scatter of I-CVI against an agreement index (modified kappa by default; choose gwet_ac1, gwet_ac2, or aiken_v via y_index). Reference lines mark the adequacy region and items outside it are highlighted in red and labeled.

plot(result_multi, y_index = "gwet_ac2")

By default, items are flagged (“Below I-CVI or AC2 threshold”) if they fail either criterion. This is the conservative “needs any review” default. When the plot is presenting one index specifically, you may prefer to flag only items that fail on that axis:

# Flag only items below the AC2 threshold (ignores I-CVI verdict)
plot(result_multi, y_index = "gwet_ac2", flag_logic = "y_index")


# Flag only items below the I-CVI threshold (ignores AC2 verdict)
plot(result_multi, y_index = "gwet_ac2", flag_logic = "icvi")

The legend always names the criterion that drives the flag, so the plot stays unambiguous about why an item is highlighted.

Per-index interpretation in APA tables

apa_table() accepts interpretation_index to choose which agreement index drives the verdict column (“Excellent” / “Good” / etc.). The interpretation column is positioned immediately adjacent to its source column to avoid confusion when the table contains multiple indices.

apa_table(result_multi, interpretation_index = "gwet_ac2")
#>      Item I-CVI Modified Kappa Aiken's V Gwet's AC1 Gwet's AC2
#> 1   item1  1.00           1.00      1.00       1.00       1.00
#> 2   item2  1.00           1.00      0.89       1.00       0.90
#> 3   item3  1.00           1.00      0.78       1.00       0.90
#> 4   item4  0.83           0.82      0.67       0.54       0.83
#> 5   item5  0.67           0.56      0.61       0.04       0.69
#> 6   item6  1.00           1.00      0.94       1.00       0.95
#> 7   item7  0.83           0.82      0.67       0.54       0.83
#> 8   item8  1.00           1.00      0.94       1.00       0.95
#> 9   item9  0.50           0.27      0.50      -0.20       0.87
#> 10 item10  1.00           1.00      0.94       1.00       0.95
#>    AC2 Interpretation
#> 1           Very good
#> 2           Very good
#> 3           Very good
#> 4           Very good
#> 5                Good
#> 6           Very good
#> 7           Very good
#> 8           Very good
#> 9           Very good
#> 10          Very good

Citing the package

If you use contentValidity in published research, please run:

citation("contentValidity")

to get a current citation block in BibTeX or plain-text form.

References

Aiken, L. R. (1985). Three coefficients for analyzing the reliability and validity of ratings. Educational and Psychological Measurement, 45(1), 131–142. https://doi.org/10.1177/0013164485451012

Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel Psychology, 28(4), 563–575. https://doi.org/10.1111/j.1744-6570.1975.tb01393.x

Lynn, M. R. (1986). Determination and quantification of content validity. Nursing Research, 35(6), 382–385. https://doi.org/10.1097/00006199-198611000-00017

Polit, D. F., & Beck, C. T. (2006). The content validity index: Are you sure you know what’s being reported? Critique and recommendations. Research in Nursing & Health, 29(5), 489–497. https://doi.org/10.1002/nur.20147

Polit, D. F., Beck, C. T., & Owen, S. V. (2007). Is the CVI an acceptable indicator of content validity? Appraisal and recommendations. Research in Nursing & Health, 30(4), 459–467. https://doi.org/10.1002/nur.20199

Wilson, F. R., Pan, W., & Schumsky, D. A. (2012). Recalculation of the critical values for Lawshe’s content validity ratio. Measurement and Evaluation in Counseling and Development, 45(3), 197–210. https://doi.org/10.1177/0748175612440286

Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1), 29–48. https://doi.org/10.1348/000711006X126600

Gwet, K. L. (2014). Handbook of inter-rater reliability (4th ed.). Advanced Analytics, LLC.

Wongpakaran, N., Wongpakaran, T., Wedding, D., & Gwet, K. L. (2013). A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients. BMC Medical Research Methodology, 13(1), 61. https://doi.org/10.1186/1471-2288-13-61

Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. Chapman and Hall.

DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence intervals. Statistical Science, 11(3), 189–228. https://doi.org/10.1214/ss/1032280214

Newcombe, R. G. (1998). Two-sided confidence intervals for the single proportion. Statistics in Medicine, 17(8), 857–872.

Altman, D. G. (1991). Practical statistics for medical research. Chapman and Hall.