--- title: "Introduction to the Cochran-Mantel-Haenszel Test" author: "Paul W. Egeler, M.S., GStat" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: refs.bib vignette: > %\VignetteIndexEntry{Introduction to the Cochran-Mantel-Haenszel Test} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Introduction The **Cochran-Mantel-Haenszel test** (CMH) is an inferential test for the association between two binary variables, while controlling for a third confounding nominal variable [@Cochran1954; @Mantel1959]. Essentially, the CMH test examines the *weighted* association of a set of 2 $\times$ 2 tables. A common odds ratio relating to the test statistic can also be generated [@Mantel1959]. The CMH test is a common technique in the field of biostatistics, where it is often used for case-control studies. This introduction briefly describes some of the terminology and concepts surrounding stratified tables. Examples are given which show some basic techniques for working with multidimensional tables in `R`. Functionality of the `samplesizeCMH` package is highlighted where it may augment the analysis. ## Partial and Marginal Tables Consider a contingency table comparing $X$ and $Y$ at some fixed level of $Z$. The cross-section of the three-way table examining only one level of $Z$ is called a *partial table*. On the other hand, the combined counts of $X$ and $Y$ across all levels of $Z$, *id est* a simple two-way contingency table ignoring $Z$, produce the *marginal table*. These concepts are described in depth by @Agresti [section 2.7.1]. ### Example We will use the `Titanic`{.r} dataset in the `datasets`{.r} package to illustrate. This dataset is a four-dimensional table which includes the *Class* (1^st^, 2^nd^, 3^rd^, Crew), *Sex* (Male, Female), *Age* (Child, Adult), and *Survival* (No, Yes) of the passengers of the 1912 maritime disaster. Use `help("Titanic", "datasets")`{.r} to find more information. ```{r load-data} data(Titanic, package = "datasets") str(Titanic) ``` For this illustration, we will remove the *age* dimension, transforming the four-dimensional table into a three-dimensional table. Let $X$ = sex, $Y$ = survival, and $Z$ = class. This dimensionality reduction is accomplished using the `margin.table()`{.r} function in the `base` package. ```{r partial-tables} partial_tables <- margin.table(Titanic, c(2,4,1)) partial_tables ``` Each of the tables above is a partial table: *survival by sex at a fixed level of class*. The tables can be flattened for easier viewing using the `ftable()`{.r} function in the `stats`{.r} package (not shown). The code below shows the marginal table of survival by sex, ignoring class. Again the dimensionality is reduced using the `margin.table()`{.r} function. ```{r marginal-table} marginal_table <- margin.table(Titanic, c(2,4)) marginal_table ``` As an aside, we may get the table, row, or column proportions using the `prop.table()`{.r} function. Because the `Titanic`{.r} dataset is a multidimensional table, it must first be transformed into a two-dimensional table using `margin.table()`{.r} (as was performed above). Failure to do so will produce unexpected results. ```{r prop-table} # Table proportions prop.table(marginal_table) # Row proportions prop.table(marginal_table, 1) # Column proportions prop.table(marginal_table, 2) ``` ## Conditional, Marginal, and Common Odds Ratios In comparing variables $X$ and $Y$ at a fixed $j$ level of $Z$, we may use a *conditional odds ratio*, described by @Agresti [section 2.7.4], to represent the point estimate of association between the to variables. We will denote it as $\theta_{XY(j)}$. The *marginal odds ratio* would then refer to the odds ratio of $X$ and $Y$ generated by the marginal table. It follows that the marginal odds ratio would be denoted by $\theta_{XY}$. An odds ratio estimate ($\hat\theta$) can be calculated from a table or matrix using the `samplesizeCMH`{.r} package using the `odds.ratio()`{.r} function. The `odds.ratio()`{.r} function can take either a table of frequencies, probabilities, or percents, as the results are algebraically equivalent. ### Demonstration of Algebraic Equivalence of using Frequencies or Proportions in Odds Ratio Calculation Using proportions, we see how the ratio of the row odds $o_1$ and $o_2$ are estimated. $$ \hat{\theta}= \frac{\hat{o}_1}{\hat{o}_2} = \frac{\hat{\pi}_{11} / \hat{\pi}_{12}}{\hat{\pi}_{21} / \hat{\pi}_{22}} = \frac{\hat{\pi}_{11}\hat{\pi}_{22}}{\hat{\pi}_{12}\hat{\pi}_{21}}. $$ And since row odds estimates are related row proportions, which are in turn related to to cell counts through the following, $$ \hat{o} = \frac{\hat{\pi}_1}{1 - \hat{\pi}_1} = \frac{\hat{\pi}_1}{\hat{\pi}_2} = \frac{n_1 / n_+}{n_2 / n_+} = \frac{n_1}{n_2}, $$ the odds estimate, defined as $\frac{\hat{\pi}_1}{\hat{\pi}_2}$, is equivalent to $\frac{n_1}{n_2}$. Therefore, $$ \hat{\theta}= \frac{\hat{\pi}_{11}\hat{\pi}_{22}}{\hat{\pi}_{12}\hat{\pi}_{21}} = \frac{n_{11}n_{22}}{n_{12}n_{21}}. $$ ### Example Let's first look at the marginal odds ratio of survival by sex using the Titanic data. ```{r marginal-OR} library(samplesizeCMH) odds.ratio(marginal_table) ``` The conditional odds ratios can be calculated using the partial tables. ```{r conditional-OR} apply(partial_tables, 3, odds.ratio) ``` Obviously this is more informative than a simple marginal odds ratio. Based on what we see above, survival by sex appears to vary widely by class, where women in 1^st^ class survive at a much higher rate than men, whereas 3^rd^ class women had only slightly better chance of survival than their male counterparts. We can produce a *common (weighted) odds ratio* using `mantelhaen.test()`{.r} from the `stats`{.r} package. Note that it differs slightly from the marginal odds ratio above since it takes into account the differential sizes of each partial table. ```{r mantelhaen} mantelhaen.test(partial_tables) ``` ## Conditional and Marginal Association/Independence The term *conditional association* refers to the association of the $X$ and $Y$ variables conditional on the level of $Z$. Likewise, the *marginal association* refers to the overall association between $X$ and $Y$ while ignoring $Z$. The finding of conditional association does not imply marginal association, nor vice-versa. The use of the CMH test to control for the stratifying variable in analysis serves to avoid the well-documented phenomenon of the Simpson's Paradox in which statistical significance may be found when considering the association between two variables, but where no such significance may be found after considering the stratification. Likewise, the reverse situation may arise where no association may be found between the binary variables, but may be observed when the third variable is introduced. Refer to @Agresti [section 2.7.3] for more information on the content of this section. ## Homogeneous Association/Independence *Homogeneous association* is when all the odds ratios between binary variables $X$ and $Y$ are equal for all $j$ levels of variable $Z$, such that $$\theta_{XY(1)}=\theta_{XY(2)}=...=\theta_{XY(j)}.$$ [@Agresti section 2.7.6] The Breslow-Day test can be used to check the null hypothesis that all odds ratios are equal. The Cochran-Mantel-Haenszel test can be thought of as a special case of the Breslow-Day test wherein the common odds ratio is assumed to be 1 (however, the calculations are not equivalent). Using the `Titanic` data, we can perform the Breslow-Day test using `BreslowDayTest()`{.r} from the `DescTools`{.r} package. ```{r BreslowDay} library(DescTools) BreslowDayTest(x = partial_tables, OR = 1) ``` Note the near agreement with the output from `mantelhaen.test()`{.r} from the section above. ## References