Analysis of microarray data: overview



Quality control

Quality control comprises an overview of the control genes and an evaluation of various quality parameters, in particular:

  • expression levels of the control genes, noise and background levels;
  • box plots;
  • frequency distributions (histograms);
  • MvA plots;
  • Pearson and Spearman rank correlations;
  • Principal component analysis (PCA).

The quality control parameters are measured and compared. Significance of aberrant features is evaluated and in case of serious deficiencies a particular array is excluded from further evaluations. (See also Statistics of Gene Expression: Quality Control.)


Dispersion analysis

Dispersion analysis compares gene expression signals in two-way comparisons. It evaluates and verifies the following:

  • distribution and consistence of random variability;
  • linearity
  • numbers and distribution of outliers;
  • concordance with normal distribution.

Dispersion analysis is the most thorough and most detailed verification of the compatibility of arrays. It enables the statistician to detect and correct or eliminate abnormal data and/or abnormal arrays and prepare a plan of analysis. (See also Statistics of Gene Expression: Dispersion Analysis.)



Normalization balances differences in amplification coefficients or functions, which relate abundance of RNA to intensity of the detected scanner light. It corrects typical observed deficiencies in correlations gene expressions vs. detected signals by:

  • background correction;
  • equalizing magnitude of amplification coefficients, when coefficients are independent of expression value (global normalization);
  • equalizing amplification coefficients that are functions of expression value (local normalization).
Disparity in the level of random fluctuations cannot be corrected by normalization. If the disproportion is considerable, a particular array may be excluded from further considerations. (See also Statistics of Gene Expression: Normalization.)


Methods of evaluation of statistical significance

The methods of evaluation of statistical significance of observed differences in expression vary depending on the experiment design, purpose of the experiment and the number and quality of replicates. (See also Statistics of Gene Expression: Evaluation of statistical significance.) Raw expression signals are summarized using the PLIER algorithm [9] or RMA [10]. Under typical conditions, we proceed according to the following schema:

  1. No replicates

Experiment/control comparison is analyzed using the nonparametric method of consecutive sampling [1]. The significance of observed differences is estimated by comparing Kp coefficients of a given characteristic standard deviation function to the standard form [1,4].

  1. Two replicates.

Pair-wise comparisons are analyzed using the method of consecutive sampling [1]. Depending on the design of a given experiment and conditions of data, we select two or more probability intervals. Genes outside a specified probability interval are tested for coincidence [2,3]. Particular genes that satisfy a threshold number of coincidences are selected as candidate genes. Number of false positives is estimated by comparison to the results obtained by Monte Carlo simulations.

  1. Three to five replicates

The statistical significance of observed differences is estimated using the Cyber-T software [5]. The Bayesian statistics is complemented by the nonparametric method of consecutive sampling and coincidence test.

  1. Six or more replicates

Candidate genes are selected using the Cyber-T test [5] and p level of significance is compared to the non-parametric Mann-Whitney test. The Benjamini-Hochberg model estimates the false discovery rate.



Clustering and Classification

Clustering and classification are used to identify groups of co-expressed genes, to discover clusters of samples with similar expression profiles and to attribute unknown samples to recognized classes. We regularly use:

A particular method is selected according to the purpose of the investigation, analysis needs and quality of data. Differential analysis and appropriate filtering always precedes clustering and classification routines. (See also Statistics of Gene Expression: Clustering and Classification.)



Enrichment, Functions, Pathways

To assist researchers in interpretation of the results of statistical analysis we perform enrichment analysis and extract relevant information from databases containing data on biological functions, reactions and pathways. According to client's preference we perform either gene set enrichment analysis (GSEA) or pathway analysis or both. Annotations are scanned for specified relevant keywords. Interpretation analysis helps in identifying co-expressed genes that participate in regulation of specific biological processes. (See also Statistics of Gene Expression: Enrichment.)



Consultations in experiment design include analysis of pilot data, if available, recommendation of microarray platform, suggestions for the experiment setup (number of control arrays, number of experiment replicates, etc.) and recommendations for the plan of analysis. Assistance in the preparation of manuscripts includes a description of the methods of analysis and deposition of the data into GEO. Excel Visual Basic programs for handling Excel data tables, search for keywords and statistics are available upon request.



December 09, 2011   ©2009; 2011 GenexAnalysis; revised April 2013 image

glial cells

Affymetrix GeneChip

GeneChip scan

cancer cells

Illumina microarrays


expression signals




microarray robot


DNA two-color scan