Motivation: A number of available program packages determine the significant enrichments and/or depletions of GO categories among a class of genes of interest. Whereas a correct formulation of the problem leads to a single exact null distribution, these GO tools use a large variety of statistical tests whose denominations often do not clarify the underlying P-value computations. Summary: We review the different formulations of the problem and the tests they lead to: the binomial, x2, equality of two probabilities, Fisher's exact and hypergeometric tests. We clarify the relationships existing between these tests, in particular the equivalence between the hypergeometric test and Fisher's exact test. We recall that the other tests are valid only for large samples, the test of equality of two probabilities and the x2-test being equivalent. We discuss the appropriateness of one- and two-sided P-values, as well as some discreteness and conservatism issues. Contact: isabelle.rivals@espci.fr Supplementary information: Supplementary data are available at Bioinformatics online.

A common problem in functional genomic studies is to detect
significant enrichments and/or depletions of Gene Ontology (GO)
categories within a class of genes of interest, typically the class of
significantly differentially expressed (DE) genes. Many GO
processing tools perform this task using various statistical tests refered to as:
the binomial test, the x2-test, the equality of two probabilities test,
Fisher?s exact test and the hypergeometric test (see Table 1). The
authors of some packages claim the advantages of the test(s) they
propose, often seemingly contradicting each other. For example,

We consider a total population of genes, e.g. the genes expressed in a microarray experiment, and we are interested in the property of a gene to belong to a specific GO category. The aim is to establish whether the class of the DE genes presents an enrichment and/or a depletion of the GO category of interest with respect to the total gene population. 3

Let H0 denote the null hypothesis that the property for a gene to belong to the GO category of interest and that to be DE are independent, or equivalently that the DE genes are picked at random from the total gene population. We consider successively the hypergeometric, the comparison of two probabilities, and the 2 · 2 contingency table formulations of the above problem, and introduce the exact or approximate null distributions they lead to.

Notations (see Table 2): the total number of genes is denoted by n, the total number of genes belonging to the GO category of interest by n+1, the number of DE genes by n1+: n, n+1 and n1+ are hence fixed by the experiment. The number of DE genes belonging to the GO category is denoted by n11. 3.1

The hypergeometric formulation is directly derived from the problem statement. 3.1.1 Exact null distribution If H0 is true, the random variable N11 whose realization1 is the observed value n11, has a hypergeometric distribution with parameters n, n1+, and n+1, which we denote by N11 Hyper(n, n1+, n+1), with:

PðN11 ¼ xÞ ¼ nþ1 x n n1þ n n1þ nþ1 x ¼ nþ1 x nþ2 n12 n n1þ ð1Þ

1Random variables and their realizations are denoted respectively by uppercase and lowercase letters. aThe website now proposes 3 additional tests, but they are not documented. 3.1.2 Approximate null distribution For a large sample, N11 has approximately a binomial distribution with parameters n1+ and n+1/n: N11 Bi(n1+, n+1/n). Note that if n1+ n+1/n is also large, the binomial approximation can further be approximated by a Gaussian distribution.

In a second formulation, we consider two samples, that of the DE genes of size n1+, among which n11 genes belonging to the GO category of interest, and that of the not DE genes of size n2+, among which n21 genes belonging to the GO category. The proportions of genes belonging to the GO category in the two samples are thus f1 ¼ n11/n1+ (DE genes) and f2 ¼ n21/n2+ (not DE genes). Let p1 and p2 denote the probabilities to belong to the GO category in the two samples; then N11 Bi(n1+, p1) and N21 Bi(n2+, p2). In this formulation, the null hypothesis H0 is the equality of the two probabilities p1 ¼ p2 ¼ p, i.e. there is neither enrichment nor depletion in the sample of DE genes with respect to that of the not DE genes. 3.2.1 Approximate null distribution The case of large samples arises frequently. Then, the binomial distributions can be approximated with Gaussian distributions. Under H0, n1+ and n2+ being large, the probability p can be correctly estimated with f ¼ (n11 + n21)/(n1+ + n2+) ¼ n+1/n, leading to the approximately normally distributed variable:

Z ¼

F1 F2 pffiFffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffirffiffiffi1ffiffiffiffiffiffiffiffiffiffiffiffiffi1ffiffiffiffi ð1 FÞ n1þ þ n2þ

Nð0? 1Þ:
ð2Þ
This distribution is approximate for two reasons: (1) the
replacement of the binomial distributions by Gaussian distributions holds
only for large samples (both n1+ and n2+ must be large), and (2) it has
not been taken into account that, according to our problem
statement, the sum N11 + N21, the total number of genes belonging to the
GO category, is fixed and equal to n+1.
3.2.2 Exact null distribution Without approximating the
binomial distribution, and taking into account that N11 + N21 ¼ n+1, we
naturally obtain N11 Hyper(n, n1+, n+1) (see

A third formulation is based on Table 2 seen as a 2 · 2 contingency
table. Let again H0 denote the hypothesis that the property to belong
to the GO category of interest and that to be DE are independent.
3.3.1 Approximate null distribution The case of a large sample
is frequently considered where, if H0 is true, the following variable
is asymptotically x2 distributed with one degree of freedom

n1þ!n2þ!nþ1!nþ2! :

PðfNij ¼ nijgÞ ¼ n!n11!n12!n21!n22!

PðN11 ¼ x j N1þ ¼ n1þ? Nþ1 ¼ nþ1Þ ¼ ð3Þ ð4Þ ð5Þ ð6Þ ¼ nþ1! nþ2! x!n21! n12!n22!

n! n1þ!n2þ! nþ1 x nþ2 n12

: n n1þ As expected, the exact distribution of N11 under H0 is again the hypergeometric distribution, see Equation (1). 3.4

Under H0, i.e. assuming the independence of the property to belong to the GO category of interest and of the property to be DE, or equivalently assuming p1 ¼ p2 where p1 is the probability of the DE genes to belong to the GO category, and p2 the probability of the not DE genes to belong to the GO category, the exact distribution of N11 is the hypergeometric distribution N11 Hyper(n, n1+, n+1) which, if n is large, can be approximated with the binomial distribution Bi(n1+, n+1/n). If the two samples are large, it is also possible to exhibit an approximately normal variable Z or its square D2 ¼ Z2, the latter being hence approximately x2 distributed with one degree of freedom. 4

Generally, when performing the test of a null hypothesis H0 against some alternative hypothesis Ha, one disposes of a realization x of a random variable X with known distribution under H0, the null distribution. One chooses a priori a probability a of type I error (the error to reject H0 whereas it is true) that must not be exceeded, also called significance level, the decision to reject H0 being taken when x falls in the critical region. In this context, the P-value is the minimum significance level for which H0 would be rejected, or equivalently, it is the probability, under H0, of the minimal critical region containing x.

The choice of a critical region in order to maximize the power of the test, and hence the choice of the corresponding P-value, depends on the alternative hypothesis Ha, which may be ?enrichment? (p1 > p2, one-sided test, critical region right), ?depletion? (p1 < p2 ?one-sided? test, critical region left), or ?enrichment or depletion? (p1 6¼ p2, two-sided test, critical region left and right). Enrichment, depletion and enrichment or depletion are later denoted by E, D, and E/D, respectively. 4.1

if Ha ¼ E? poneðn11Þ ¼ PðN11 if Ha ¼ D? poneðn11Þ ¼ PðN11 n11Þ : n11Þ ð7Þ If the case of a discrete distribution, like the exact hypergeometric distribution or the approximate binomial distribution, it is not possible to guaranty any value of the significance level with the rule ?reject H0 if pone(n11) a?. Due to the discreteness, the actual significance level (or size of the test) is generally smaller than the nominal (desired) significance level a, which results in a loss of power.

To minimize this loss, a good remedy is the use of mid-P-values

Another remedy is randomization, with which any desired
significance level can be achieved. However in practice, randomization
having nothing to do with the data does not make much sense

If the approximately normal variable Z is considered, we have: if Ha ¼ E? poneðzÞ ¼ PðZ > zÞ : if Ha ¼ D? poneðzÞ ¼ PðZ < zÞ If the approximately x2 distributed variable D2 is used, a one-sided test cannot be performed, since both enrichment (large observed n11) and depletion (small observed n11) lead to a large value of D2, i.e. there is a single critical region. 4.2

In the case of a two-sided test i.e. Ha ¼ E/D, and of a discrete null
distribution, there are several popular definitions of the P-value, see

X

PðN11 ¼ mÞ:
PðN11¼mÞ Pðn11Þ
ð11Þ
The minimum-likelihood approach is the only one we have
encountered in the GO tools of Table 1. A third approach
defines the P-value as the sum of the probabilities of the values
of N11 that are at least as or more extreme (with respect to the
mathematical expectation of N11) than the observed one

These definitions lead to equal P-values in the case of symmetric
distributions, i.e. when n1+ ¼ n2+; else, they possibly lead to
different P-values and corresponding test results, each of them having
advantages and disadvantages, due to the discreteness and skewness
of the hypergeometric distribution. The problem is also that these
P-values do not correspond to any well-defined two-sided test. This
issue is discussed for example in

Thus, if a single simple and computationally light (see subsection 6.3) procedure were to be recommended, we would advice the doubling approach, against which there is no strong argument, and using the mid-P-value, in order to reduce the discreteness and conservatism effects: ptdwooublingðn11Þ ¼ 2 · minðPðN11 > n11Þ þ 21 PðN11 ¼ n11Þ?

PðN11 < n11Þ þ 21 PðN11 ¼ n11ÞÞ: A mid-P-value can also be defined for the minimum-likelihood approach, as the sum of the probabilities that are smaller than the probability of the observed value n11, plus half the sum of the probabilities equal to it: ptmwion likðn11Þ ¼

X

PðN11 ¼ mÞ PðN11¼mÞ<Pðn11Þ

1 X þ 2

PðN11 ¼ mÞ:

PðN11¼mÞ¼Pðn11Þ However, we must again emphasize that the actual probability of type I error may exceed the nominal significance level.

If the approximately normal variable Z is considered (a continuous and symmetrically distributed variable), we have: ptwoðzÞ ¼ 2 · min½PðZ > zÞ? PðZ < zÞ : If the approximately x2 distributed variable D2 is considered, the P-value is computed as: ptwoðd2Þ ¼ PðD2 > d2Þ ¼ ptwoðzÞ: ð12Þ ð13Þ ð14Þ ð15Þ affected by the condition, i.e. the genes belonging to this GO category are DE (either over- or under-expressed). Such a GO category is likely to be over-represented among the DE genes, i.e. an enrichment is expected. Thus, detecting an enrichment is desirable. On the other hand, consider a GO category such that the normal expression of the corresponding genes is necessary for the condition to develop, i.e. the genes belonging to this GO category are not DE. Such a GO category is likely to be under-represented among the DE genes, i.e. a depletion is expected. Thus, detecting a depletion is also desirable, even if there is a risk to detect the depletion of a GO category corresponding to genes whose normal expression is necessary to the mere survival of the specie.

Thus, both enrichments and depletions of GO categories are potentially of interest. Hence, unless there is a specific reason not to consider enrichment or depletion, the adequate alternative hypothesis is Ha ¼ E/D, i.e. two-sided tests are appropriate. 5

To summarize, there is a single exact null distribution of N11, the
hypergeometric distribution, but different exact tests (exact in the
sense that they are based on the exact null distribution), one or
two-sided, and with several definitions of the P-value in the latter
case. These tests can equally be called hypergeometric or Fisher?s
exact tests2. Thus, it is not justified to claim, as Masseroli et

The available GO tools often do not explicitly state which
P-value is computed. For example, BINGO calls the test it performs
?hypergeometric test?

As discussed in section 4.3, two-sided tests are usually most appropriate. Be it with the doubling or the minimum-likelihood approach to the P-value, the discreteness and conservatism effects can be efficiently dealt with using mid-P-values, a possibility that is not offered by any of the GO tools of Table 1. 4.3

Consider a dataset consisting of tissues in a pathological condition
and of normal tissues, and a GO category whose genes are directly
2As a matter of fact,

4Þ ¼ Pð4Þ þ Pð5Þ þ Pð6Þ ¼ 7:04 · 10 2 þ 7:04 · 10 3 þ 1:81 · 10 4 ¼ 7:77 · 10 2:

poneð4Þ ¼ PðN11 > 4Þ þ Pð4Þ/2

¼ 7:04 · 10 3 þ 1:81 · 10 4 þ 7:04 · 10 2/2 ¼ 4:24 · 10 2: There is a substantial difference between the P-value and the mid-P-value. With a significance level a ¼ 5%, the mid-P-value leads to reject H0, whereas the P-value does not: the use of a midP-value corresponds to a less conservative test. However, the actual significance level is no longer guaranteed to be smaller than the nominal significance level 5%. 6.1.2 As for the one-sided test, there is a substantial difference between the two values. Also, with a significance level a ¼ 5%, a two-sided test does not reject H0. ¼ 1:22 · 10 1: The two-sided minimum-likelihood P-value equals:

The exact two-sided doubling P-value obtained with the hypergeometric distribution is ptdwooublingðn11Þ ¼ 3.95 · 10 2, and the two-sided mid-P-value is ptdwooublingðn11Þ ¼ 2.66 · 10 2. With the minimum-likelihood approach, ptmwion likðn11Þ ¼ 2.39 · 10 2, and the two-sided mid-P-value is ptmwion likðn11Þ ¼ 1.74 · 10 2. Note that, the null distribution being asymmetric, there is a noticeable difference between the two approaches, and, though the sample is quite large, between the P-values and the corresponding mid-P-values.

The approximate binomial test leads to a doubling P-value of 4.54 · 10 2, and to a doubling mid-P-value of 3.11 · 10 2, to a minimum-likelihood P-value of 2.75 · 10 2, and to a minimumlikelihood mid-P-value of 2.03 · 10 2. Note that though the sample is not small, there is quite a difference with the exact distribution.

The approximate test of equality of two probabilities leads to the value of an approximately normal statistic z ¼ 2.45, and to a two-sided P-value of ptwo(z) ¼ 1.42 · 10 2. This value is even less accurate than that obtained with the binomial approximation, because the DE sample is too small (n1+ ¼ 40).

The x2-test indeed leads to a statistic value d2 ¼ 6.015 ¼ z2, and hence to the same two-sided P-value.

In the case of larger samples, obtained with mouse or human pangenomic microrrays, typically with n of the order of 25 000: The approximate binomial test leads to (mid-) P-values that are very close to those of the exact hypergeometric test. However, with todays computing means, there is no decisive advantage in performing this approximation (see next section).

The approximate test of equality of two probabilities becomes closer to the exact one only if the number of DE genes is large,

which is not necessarily the case. There is thus no reason to use this test.

This is hence also true for the equivalent x2 test.

All the exact tests can be implemented ?by hand? with the hypergeometric cumulative distribution function ?phyper? and the distribution function ?dhyper?, and the binomial approximations with ?pbinom? and ?dbinom?3.

The default implementation of the exact test with R provides the two-sided minimum-likelihood P-value. The corresponding instruction is ?fisher.test(c)?, where the matrix c is the 2 · 2 contingency table [n11 n12; n21 n22]. The one-sided enrichment test is obtained with ?fisher.test(c, alternative ¼ ??greater??)?, the one-sided depletion test with ?fisher.test(c, alternative ¼ ??less??)?.

In order to evaluate the computation time of the two-sided tests, let us consider the case of a microarray with n ¼ 25 000 genes, n1+ ¼ 1000 DE genes, and 500 different GO categories. We take n+1 uniformly distributed in [0,n], and n11 uniformly distributed in [max(0, n+1+n1+?n), min(n1+, n+1)]. With R 2.1.0 running under Mac OS X on a 2 GHz two processor Macintosh (PowerPC 970 2.2), we obtain the following total elapsed times (mean and standard error on 20 runs) for the doubling approach: hypergeometric doubling P-values, computed with the functions ?dhyper? and ?phyper?: 0.17 ± 0.02 s, and 0.20 ± 0.02s for the mid-P-values. binomial doubling P-values, computed with the functions ?dbinom? and ?pbinom?: 0.16 ± 0.02s, and 0.19 ± 0.02s for the mid-P-values.

Hence, the gain in time obtained by using the binomial approximation to the hypergeometric distribution is negligible.

For the minimum-likelihood approach, the R function ?fisher. test?, (which does not only compute a P-value) is much slower than a computation ?by hand?: hypergeometric minimum-likelihood P-values, computed with the function ?fisher.test?: 17.15 ± 0.21 s. hypergeometric minimum-likelihood P-values, computed with the functions ?dhyper? and ?phyper?: 1.83 ± 0.04s and 2.10 ± 0.05s for the mid-P-values.

The computation time is hence an argument in favor of the doubling approach to the two-sided P-value.

The correct statement of the enrichment and/or depletion testing
problem leads to a unique exact null distribution of the number of
DE genes belonging to the GO category of interest, given the total
gene number and the total number of genes belonging to the GO
category. This distribution is the hypergeometric one, whose values
are equivalently given by Fisher?s formula for a 2 · 2 contingency
table. Since both enrichments and depletions of GO categories
3The code of the R functions can be found at the R project site
https://svn.rproject.org/R/trunk/src/nmath/. The best known and most complete software
for contingency table methods in general is StatXact

Funding to pay the Open Access publication charges for this article was provided by the CNRS and the city of Paris.

Conflict of Interest: none declared.