Statistics of Gene Expression Assay: Five Steps to Successful Interpretation

Over the span of some 20 years of microarray technology, a number of methods aimed at the evaluation of gene expression data have been developed and tested. Current users have on hand an assortment of standard procedures and software (e.g. dChip [14] , RMA [10] , PLIER [9] , Bayesian approaches, linear models, various clustering methods, etc.). However, data often show nonstandard features and require an individualized approach. Preliminary steps, such as quality analysis, normalization and selection of an appropriate statistical method determine the success of the data evaluation and, ultimately, success of the experiment. The basic steps of microarray data analysis can be summarized in five points:

  1. Quality control – verification of global quality parameters, identification of spurious signals and abnormal samples
  2. Dispersion analysis – revision of compatibility and consistency of microarrays, elimination of abnormal arrays, preparation of plan of analysis
  3. Normalization – correction of background, correction for differences in amplification coefficients, identification of abnormal signals
  4. Selection of appropriate methods and execution of analysis:
    • Evaluation of statistical significance
    • Clustering and classification
    • Biological interpretation, pathway analysis
  5. Critical evaluation of the results.
Each and every step is crucial for obtaining comprehensive and reliable results from a microarray experiment


Quality Control

The laboratories that carry out hybridization of samples on microarrays generally provide quality control of DNA samples, quality control of experimental protocols and other verifications of the quality of generated signals. Notwithstanding, abnormal arrays are often observed, even if all verification variables are within acceptable bounds. To make sure all arrays are suitable for statistics, we first evaluate the global quality parameters:

  • control genes, noise and background levels;
  • box plots;
  • percentages of present genes (if available);
  • frequency distributions (histograms);  
  • MvA plots;
  • Pearson or Spearman rank correlation;
  • Principal component analysis.

The significance of aberrant features is evaluated and anomalous values corrected, if possible. In case of a serious deficiency, the array is excluded from further evaluations.

Verification of the control genes is usually the first step. Significant under-expression of several control genes may disqualify an array from analysis. Sometimes, however, we encounter data with one or several inadequat controls, while all the other control parameters and results of the dispersion analysis are within the norm. Such arrays are usually accepted. The box plot provides initial information about the global parameters of data: median, 25 and 75 percentiles and minimum and maximum values. Extreme differences in medians and excessive spread are usually indicators of problematic data. Frequency distributions show how many probe sets we count within a given expression interval. Bimodal distributions, wide main body of the distribution function, long and elevated tails and strong local maxima in tails may indicate problems with data. Indeed, we expect replicates to have similar distributions, while strong tails and local maxima in tails in control-experiment comparisons may reflect a strong influence of the imposed experimental conditions. Large random noise affects mainly low-level signals and makes analysis of the significance of observed differences in low-level range unreliable. Finally, poor correlation among replicates indicates incompatible arrays, while in control-experiment comparisons it may be a sign of a substantial difference in the biological state of cells caused by imposed conditions.

Dispersion Analysis

Laboratory controls and global quality controls still do not ensure that a given array is suitable for statistical analysis. In some cases, DNA samples are affected by unnoticed changes in biological state of cells, changes in laboratory procedures or the quality of reagents, etc. Although the data of a given array may be of good quality, a difference in the random variability of replicates or an abnormal number of outliers, for example, may disqualify a given array from further considerations. Dispersion analysis displays the dispersion pattern of gene expression signals in two-way comparisons and evaluates and verifies the following:

  • distribution of random variability;
  • number and distribution of outliers;
  • concordance with normal distribution.

Dispersion analysis is carried out first among the replicates and then between the control and experiment arrays. It is the most thorough and most detailed verification of sample compatibility.

Pair-wise dispersion analysis is usually done using the consecutive sampling program [1]. The program sorts data according to the mean value of two signals, defines consecutive statistical samples, examines their properties and calculates the standard deviation. In particular:

  • verifies the normality of the frequency distribution of consecutive samples (Kolmogorov-Smirnov normality test, p < 0.05);
  • verifies the consistency of data by verifying the identity SD(Ydiff) = SD(Y1) + SD(Y2), where SD(Ydiff) and SD(Yi) are the standard deviations calculated from the difference in expressions and from the expression values of the array 1 and 2 of a given consecutive sample, respectively;

Additional subprograms calculate skewness and kurtosis and count the number of genes outside the interval corresponding to 1.96 standard deviations.



No two arrays are exactly equal. Differences in sample preparation, hybridization and labeling processes, signal detection and processing, all contribute to the observed variability of data. Under ideal conditions, dispersion plots of arrays A and B are symmetrical with respect to the 45º axis with standard deviations well approximated by a linear standard deviation function with the intercept typically between 0.5 and 2.5 and the coefficient of proportionality from 0.1 to 0.3 (arrays normalized to 100% of the overall mean) We recognize four types of deviations in profiles:

  • difference in background signals;
  • amplification coefficients are constant or vary as the same function of expression, but differ in magnitude;
  • some amplification coefficients are functions of expression, leading to deviations from linearity;
  • abnormal differences in random variability.

While the first three divergences can be rectified by careful normalization and/or readjustment of data, there is no correction for excessive random variability.

Background correction is usually needed in the case of Illumina data – Affymetrix signals obtained using both perfect match and mismatch correct for nonspecific hybridization and other background signals individually for each probe set and do not require correction. We determine the background constant by calculating the asymptotic signal value when an ordered sequence of signals approaches minimum.

To correct for constant differences between amplification coefficients, we multiply each probe set of a given array by a normalization constant that is equal to 100 divided by the average signal across the array. As a result, average values of normalized signals of all arrays are equal to 100 and normalized signals can be viewed as percentages of the mean. In case of nonlinearity we use quantile normalization.

Selection of Appropriate Statistical Methods

In case of normal data, selection of the statistical method depends mainly on the purpose of investigation and the number of replicates and, under certain conditions, the level of random fluctuations.

In the case of problematic data, we first eliminate arrays with data that cannot be corrected and then identify incompatible samples. When incompatible samples are found, we avoid integral methods such as RMA, PLIER or dChip. We classify the samples into compatible groups, if possible, and carry out pair-wise comparisons within groups using the consecutive sampling method. Coincidence analysis is then used to select candidate genes.

Evaluation of statistical significance

For the reader’s information, we describe below our standard procedures for evaluating the statistical significance of observed differences between gene expressions. These procedures are used when no special methods are requested by the client or imposed by the quality of the data or inconsistencies among the arrays.

No replicates
To assess differences in gene expression between experiment and control, we use the method of consecutive sampling [1-4] . Briefly, in two-array comparisons, the genes are ordered according to the mean signal intensity of a given gene on control and experiment array and grouped in bins containing n consecutive genes (usually n = 25). The standard deviation is then calculated for each bin and the characteristic standard deviation function in linear approximation is determined by regression. Genes are ranked according to the differences in gene expressions measured in the number of standard deviations; k genes with the largest distance are selected as candidate genes. The rank of each gene is characterized by a corresponding Kp coefficient of the standard deviation function. The significance of the observed differences of expression is estimated by comparing Kp coefficients to the standard form [4]. Affymetrix signals are evaluated using the MAS5 algorithm, unless several control-experiemnt pairs are available..

Two replicates.
Arrays are analyzed in all possible pair-wise combinations using the method of consecutive sampling [1-4]. Depending on the design of a given experiment and the conditions of the data, we then select two or more probability intervals. The genes outside a specified probability interval are identified in all possible pair-wise combinations and tested for coincidence. The genes that satisfy a threshold number of coincidences are selected as candidate genes. The false positive rate is estimated using comparison to the results of Monte Carlo simulations. Affymetrix signals are evaluated using the PLIER algorithm.

Three to five replicates

The statistical significance of observed differences is estimated using the Cyber-T software. The Bayesian method is complemented by the nonparametric method of consecutive sampling and coincidence test and/or by Mann-Whitney nonparametric statistics. Affymetrix signals are evaluated using the PLIER [9] algorithm.

Six or more replicates
Candidate genes are selected using the Cyber-T sofware [5] and the level of significance is compared to the Mann-Whitney nonparametric test. The Allison model estimates the probability of false positives.


Clustering and Classification

Gene clustering and classification are used to identify groups of co-expressed genes, to discover clusters of samples with similar expression profiles and to attribute unknown samples to recognized classes. We regularly use:

  • hierarchical clustering [12, p.738];
  • K-means [12, p.755]
  • self-organized maps [13]
  • principal component analysis [12, p.459]

A particular method is selected according to the purpose of the investigation, analysis needs and quality of data. Differential analysis and appropriate filtering always precedes clustering and classification routines.

Enrichment, Functions, Pathways

To assist researchers in interpretation of the results of statistical analysis we perform complementary analysis aimed at identification of relevant biological functions, reactions and pathways. According to client's preference we perform either gene set enrichment analysis (GSEA) or pathway analysis (GenMAPP, GO-Elite) or both. Probe set annotations are scanned for specified relevant keywords. Interpretation analysis helps in identifyingco-expressed genes that participate in regulation of specific biological processes.



December 09, 2011   ©2009; 2011 GenexAnalysis

microglia progenitor

analysis of gene expression matrix

Quality control: Spearman correlation

Pair-wise correlations: MvA plot

Expression analysis: dispersion plot

Consecutive sampling analysis: standard deviation functions

Gene expression correlation

hierarchical clustering - heat map

probability intervals, candidate genes