Note: Performing your original search, The Wavelet Transformation for Temporal Gene Expression Analysis , in PubMed Central will retrieve 8 citations. | |||||||||||
Copyright © 2007 J. Z. Song et al. The Wavelet-Based Cluster Analysis for Temporal Gene
Expression Data *J. Z. Song: Email: songj88@umd.edu Recommended by Ahmed H. Tewfik Received December 4, 2005; Revised October 1, 2006; Accepted March 4, 2007. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. | |||||||||||
Abstract A
variety of high-throughput methods have made it possible to generate
detailed temporal expression data for a single gene or large numbers of
genes. Common methods for analysis of these large data sets can be
problematic. One challenge is the comparison of temporal expression
data obtained from different growth conditions where the patterns of
expression may be shifted in time. We propose the use of wavelet
analysis to transform the data obtained under different growth
conditions to permit comparison of expression patterns from experiments
that have time shifts or delays. We demonstrate this approach using
detailed temporal data for a single bacterial gene obtained under 72
different growth conditions. This general strategy can be applied in
the analysis of data sets of thousands of genes under different
conditions. | |||||||||||
High-throughput gene expression techniques, such as oligonucleotide and cDNA microarrays, SAGE (series analysis gene expression), and promoter arrays [1–5], make it possible to obtain large amounts of time series gene expression data in different organisms under various conditions. These large datasets prove to be invaluable for determining coordinately regulated genes and the underlying regulatory networks among genes. Temporal gene expression patterns have been used to define cell cycle regulated genes and metabolic and genetic networks [6–10]. However, how to extract expression patterns in temporal gene expression data represents a challenging analytical problem particularly when comparing data obtained under different growth conditions. Because high-throughput gene expression technologies involve thousands of genes (or variables), reducing the dimensionality of the data can be a crucial issue for identifying coordinately regulated gene or inferring gene regulation networks. The current solutions include clustering coregulated genes from thousands of genes by similar expression profiles via unsupervised analysis [11–13], and Bayesian networks modeling [14]. Each method has its own merits and shortcomings. In temporal gene expression analysis, a main challenge is to extract the continuous representation of all genes through the time course of the experiment. Aligning gene expression time series profiles based on dynamic time warping [15], hidden Markov model [16], local clustering [17] and fitting time series data with cubic splines [18–20] have been used. However, a significant challenge remains in the comparisons of high-throughput temporal expression profiles obtained from same genes in different experimental conditions where patterns may be shifted in time. The current analysis methods do not specifically address the issue of time delays between experiments or conditions. Many mathematical and statistical methods have been developed for identifying underlying patterns in complex data with varieties of applications, such as signal classification in speech processing, electrocardiography and sleep research. These methods cluster points in multidimensional space, and are routinely used in gene expression analysis. For example they have been used to identify genes whose expression correlated with the cell cycle [21–23]. These methods are readily applicable to many datasets. However, these strategies have limitations when comparisons of temporal data between different conditions are being carried out. Over the past few years, the wavelet has become an essential tool in genome analysis [24–27]. In this study, we propose the use of wavelet transformation as a method to characterize structure at multiple positions and length scales. Wavelet transforms are capable of providing the time and frequency information simultaneously, hence giving a time-frequency representation of the temporal gene expression signals, the wavelet transformed data can be further analyzed by cluster analysis. We demonstrate this approach with temporal expression profiles for a single gene under 72 growth conditions. Clustering of the data after wavelet transformation overcomes the problem of temporal shifts in expression patterns observed under different experimental conditions. | |||||||||||
2.1. Gene expression data Temporal gene expression profiles were obtained using promoter
fusion technique. Briefly, the promoters of interest are clones
into a promoterless luxCDABE operon on a plasmid vector
pMS402 [28]. Promoter activity correlates with light
production generated by the luxCDABE gene products.
Therefore, the activity of the promoter fused upstream
luxCDABE is directly measured as light production after
the fusion construct is introduced into the bacterium. The
promoter regions of the Pseudomonas aeruginosa rpoS gene
was amplified from P. aeruginosa PAO1 chromosomal DNA by
PCR using oligonucleotide primers [28]. The PCR amplified promoter region were then cloned into the
XhoI-BamHI sites of pMS402 upstream of the
promoterless luxCDABE genes and transformed into PAO1 by
electroporation. PCR, DNA manipulation and transformation were
performed following general procedures. Overnight cultures of the
reporter strain were diluted 1 : 200 in a 96-well microtiter plate
and the promoter activity (CPS) and optical density at 620 nm
(OD620) were measured every 30 minutes for 24 hours in a
victor2 multilabel counter. The details of the 72 growth
conditions will be described elsewhere.2.2. Expression data wavelet transformation and
clustering analysis To overcome the gene expression profile shift issue (time delay)
among different conditions, we first used continuous wavelet
analysis to transform all expression data by wavelet transform; it
decomposes temporal gene expression data in both time and
frequency domains. In wavelet transform we take a real/complex
valued continuous time function with two main properties, (1) it will integrate to zero; (2) it is square integrable. This function
is called the mother wavelet. The CWT or continuous wavelet
transform of a function f (t) with respect to a wavelet ψ(t)
is defined as
| |||||||||||
3.1. The variation of gene expression profile A large data set was generated from a unique gene expression
experiment where activity of the promoter of the rpoS
gene in P. aeruginosa was measured under 72 growth
conditions. For each condition, measurements were obtained at 48
time points. Figure 1 shows the expression profile
variation of this gene in different experimental conditions.
Because the strength of expression of the rpoS gene
varies among conditions, that is, the expression pattern may be
similar although the magnitude of expression may vary, we
normalized each expression profile with its maximum, so all
expression level is in the range between 0 and 1. As expected, the
gene expression profile varies significantly in different
experiments and conditions. As shown in Figure 2, the
time point of maximum expression of the rpoS shifts among
conditions, that is, with clear expression profile shift or time
delay phenomena. To further evaluate the variation of the
rpoS promoter activity over 72 conditions, we determined
the mean and standard deviation of the gene in each condition. The
fluctuation of the mean and standard deviation of expression
levels of the rpoS, as shown in Figure 3, highlights the variation of expression level and expression
strength among conditions. These results clearly show the
expression profiles and levels are condition-specific, that is,
the regulation of the rpoS gene varies in different conditions.3.2. The wavelet transformation of gene
expression profile Wavelet transformation is an analysis method that uses both time
and the frequency domains. It allows a time series to be viewed in
multiple resolutions, and each resolution reflects a different
frequency. The wavelet technique takes averages and differences of
a signal and breaks the signal down into spectrum. In the gene
expression analysis, we assume that any gene expression level is a
comprehensive result of gene effects and condition effects, that
is, the expression profile shift or time delay is caused by the
conditions which dictate the activation order and expression
strength of the rpoS gene. The profile shifts or time
delays certainly make comparison of expression patterns among
conditions problematic. Overcoming this time delay, the wavelet
transform addresses it by using dB2 (Figure 1) mother
function that can be scaled. If the signal and wavelet are in a
good match, then the correlation between the signal and the
wavelet is high, resulting in a large coefficient. The
coefficients of wavelet transformation indicate correlation
intensities between wavelet function and expression profile if the
expression signal level is between 0 and 1. When the wavelet is
highly compressed it extracts the localized high-frequency details
of the expression signal. When the wavelet is fully diluted, the
length of the wavelet is more comparable to the length of the gene
expression signal and therefore it extracts the low frequency
trends of the signal. In order to overcome the issue in temporal
gene expression data analysis we take an approach using wavelet
transformation. The transformation results of the gene rpoS
over 72 conditions, as shown in Figure 4, demonstrate
the squared coefficients with a bell-shaped curve, the curves of
the different conditions vary in skew and kurtosis which represent
the difference of expression profiles. If expression profiles are
similar, the bell-shaped curve will be very similar and close;
otherwise, they will disperse. The wavelet analysis is able to overcome the profile
shift problem, meanwhile, it is worth noting that the
analysis loses time series information.3.3. Clustering analysis and evaluation To evaluate the behavior of gene expression under different
culture conditions, expression profiles are typically compared
using cluster analysis. This provides a comparison of patterns of
expression such that those with similar patterns of expression
will fall close together on the hierarchical tree while those with
dissimilar patterns will be far apart. To evaluate the effect of
wavelet transformation, we clustered the data before and after
transformation using average linkage method. The hierarchical
cluster trees are shown in Figures 5 and 6.
Note that the numerical values used in the two figures differ and
consequently the distance measures are not directly comparable.We would predict that genes with similar expression profiles before wavelet transformation would cluster together in both Figures 5 and 6. Wavelet transformation would not make the expression patterns dissimilar. To illustrate this we have plotted the expression data for two conditions (C1T23 and C2T23) that cluster closely in Figure 7. We can see that the activity profiles of the rpoS promoter are very similar in these two conditions (Figure 7(a)) and likewise the power plots of their wavelet transformation are also similar (Figure 7(b)). As expected they also clustered together in Figure 5. To illustrate the effect of the wavelet transformation, we highlight the expression of two conditions (C1T7 and C2T7) that cluster close together after wavelet transformation (as in Figure 5) but not before it (as in Figure 4). We would predict that these will have similar expression patterns but with a time shift between the experimental conditions. This is clearly illustrated in Figure 8(a). This temporal shift is sufficient to prevent close clustering of these conditions in Figure 4. By contrast, the profiles appear very similar after wavelet transformation (Figure 8(b)) and the two conditions cluster close together in Figure 5. In this experiment, the growth medium used in C1T7 and C2T2 was the same and the expression profile would be expected to match however experimental variables leading to different initial conditions. The results indicate that wavelet transformation can extract expression pattern information and overcome difficulties that arise because of temporal delays in patterns of expressions between conditions or experiments. | |||||||||||
To deeply understand gene temporal expression behavior and interactions in cells is a fundamental task in functional genomics. While methods for obtaining high-throughput temporal gene expression data are readily available, methods and strategies for analysis of these complex data sets are still emerging. Because the unique feature of temporal gene expression data is autocorrelation between successive points, the immediate goals are to extract and to compare the fundamental patterns of gene expression inherent in the data. Most of the current methods are based on certain distances between expressed genes or variables (conditions), such as hierarchical clustering, self-organizing maps, relevance network, principal components analysis and machine learning. Application of clustering analysis directly to the expression data ignores some basic features of temporal expression data and more over can be complicated by temporal shifts or time delays between experiments. These temporal shifts arise not because of intrinsic features of the expression pattern but because of differences in initial conditions between experiments. These are often unavoidable experimental variables. Dynamic time warping is a discrete method similar to sequence alignment algorithms [5] that can be used to align time series data. It involves many degrees of freedom and the time points can “stop” or go “backwards” in the alignment. Overfitting can also be a problem with this method. The cubic spline is a powerful technique for data fitting, interpolation, extrapolation, and visualization [20], and permits the principled estimation of unobserved time-points and dataset alignment. Each temporal gene expression profile is modeled as a cubic spline (piecewise polynomial) that is estimated from the observed expression data. It constrains the spline coefficients of genes in the same cluster to have same or similar expression patterns. The splines are piecewise-smooth polynomials that can be used to represent functions over large intervals, where it would be impractical to use a single approximating polynomial. As for the clustering analysis with the cubic splines, especially in large scale of temporal gene expression data, further research and comparison are needed. In this paper, we firstly transformed temporal gene expression data with continuous wavelet analysis and then we did hierarchical clustering analysis. Average linkage method was used because it proceeds by first finding pairs of expression profiles that are most similar, joining them, calculating the (sometimes weighted) average between the members of the joined cluster, recalculating the pairwise distance, and treating the average profile as one profile, and repeating the procedure until all profiles are joined. Average linkage clustering can be conducted using all-pairwise-sample average of differences or using cluster average differences. The latter is also known as centroid clustering, but centroids can be calculated using methods other than simple averages. It is worth noting that wavelet analysis and the Fourier transformation (FT) are two widely used methods in signal processing. In its original form, the FT assumes that the expression signal exists for all time. This for practical purposes is not a realistic assumption in temporal gene expression and does not give any information about how the expression signal changes with respect to time. This is not a problem when the gene expression signal being analyzed is stationary, that is when the statistical properties of the expression signal are not changing with time. All gene expression signals, however, are nonstationary. It is especially necessary to identify and locate the changing frequency characteristics of the gene expression signals. An alternative FT, which is called the short-time Fourier transform (STFT), is a time-dependent or windowed-Fourier transformation. It attempts to analyze nonstationary signals by dividing the whole signal into shorter data frames, but one of the limitations of the STFT is that the timeframe for analysis is fixed. Wavelet transformation is a measure of similarity between the basis functions (wavelets) and gene expression profiles, and the calculated CWT coefficients refer to the closeness of the gene expression profile to the wavelet at the current scale. The flexible approach uses a scalable window. The advantages of the method are a compressed window for analyzing high-frequency details and a diluted window for uncovering low-frequency trends within the signal. Wavelets are also well localized in frequency, although not as well as sinusoids. Since wavelet analysis incorporates the concept of scale into the wavelet equation it is suited to resolve the transient nature of gene expression data. Then choosing appropriate scales and the number of scales are imminent issues. Scale is the inverse of frequency. Once the mother wavelet is chosen, the computation will start from high frequencies and proceed towards low frequencies. This first value of scale will correspond to the most compressed wavelet. As the value of scale is increased, the wavelet will dilate. Smaller scales (high frequencies) have better scale resolution which corresponds to poorer frequency resolution. Similarly, large scales have better frequency resolution. From the results presented here, it is apparent that wavelets are better suited to the analysis of transient gene expression signals, since they are well localized in time, whereas sinusoids extend over all time. We also need to emphasize that although the wavelet analysis overcomes the time delay or profile shift, the transformation will lose temporal information if we need it, so the analysis is application dependent. In summary, the paper presents an alternative way to extract expression patterns in temporal gene expression data with continuous wavelet analysis. It has been demonstrated that the application of wavelet transformation to gene temporal expression data is feasible. We anticipate that the wavelet analysis and transformation could be used in large scale temporal gene expression research and single cell experiments. It is of particular value in comparison of temporal expression profiles obtained under different conditions or from different experiments. The pattern recognition is of important value on monitoring simultaneously the expression patterns of thousands of genes during cellular differentiation and responses. | |||||||||||
ACKNOWLEDGMENTS The authors thank members of the Surette lab for helpful discussions. This work was supported by the Canadian Institutes of Health Research, Genome Canada through the University of Saskatchewan. M.G.S. is an Alberta Heritage Foundation for Medical Research Senior Scholar and Canada Research Chair in Microbial Gene Expression. | |||||||||||
References 1. Kalir
S, McClure J, Pabbaraju K, et al. Ordering genes in a flagella pathway
by analysis of expression kinetics from living bacteria. Science. 2001;292(5524):2080–2083. [PubMed] 2. Weeraratna AT. Serial analysis of gene expression (SAGE): advances, analysis and applications to pigment cell research. Pigment Cell Research. 2003;16(3):183–189. [PubMed] 3. Schulze A, Downward J. Navigating gene expression using microarrays—a technology review. Nature Cell Biology. 2001;3(8):E190–E195. [PubMed] 4. Heller MJ. DNA microarray technology: devices, systems, and applications. Annual Review of Biomedical Engineering. 2002;4:129–153. 5. Southern EM. DNA microarrays: history and overview. Methods in Molecular Biology. 2001;170:1–15. [PubMed] 6. Tong AHY, Lesage G, Bader GD, et al. Global mapping of the yeast genetic interaction network. Science. 2004;303(5659):808–813. [PubMed] 7. Fedoroff N, Fontana W. Genetic networks: small numbers of big molecules. Science. 2002;297(5584):1129–1131. [PubMed] 8. Bundschuh R, Hayot F, Jayaprakash C. Fluctuations and slow variables in genetic networks. Biophysical Journal. 2003;84(3):1606–1615. [PubMed] 9. Spellman PT, Sherlock G, Zhang MQ, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell. 1998;9(12):3273–3297. [PubMed] 10. DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997;278(5338):680–686. [PubMed] 11. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America. 1998;95(25):14863–14868. [PubMed] 12. Banerjee N, Zhang MQ. Functional genomics as applied to mapping transcription regulatory networks. Current Opinion in Microbiology. 2002;5(3):313–317. [PubMed] 13. Törönen P, Kolehmainen M, Wong G, Castrén E. Analysis of gene expression data using self-organizing maps. FEBS Letters. 1999;451(2):142–146. [PubMed] 14. Friedman N, Linial M, Nachman I, Pe'er D. Using Bayesian networks to analyze expression data. Journal of Computational Biology. 2000;7(3-4):601–620. [PubMed] 15. Aach J, Church GM. Aligning gene expression time series with time warping algorithms. Bioinformatics. 2001;17(6):495–508. [PubMed] 16. Schliep A, Schönhuth A, Steinhoff C. Using hidden Markov models to analyze gene expression time course data. Bioinformatics. 2003;19(supplement 1):i255–i263. [PubMed] 17. Qian
J, Dolled-Filhart M, Lin J, Yu H, Gerstein M. Beyond synexpression
relationships: local clustering of time-shifted and inverted gene
expression profiles identifies new, biologically relevant interactions.
Journal of Molecular Biology. 2001;314(5):1053–1066. [PubMed] 18. Bar-Joseph
Z, Gerber G, Simon I, Gifford DK, Jaakkola TS. Comparing the continuous
representation of time-series expression profiles to identify
differentially expressed genes. Proceedings of the National Academy of Sciences of the United States of America. 2003;100(18):10146–10151. [PubMed] 19. Bar-Joseph Z. Analyzing time series gene expression data. Bioinformatics. 2004;20(16):2493–2503. [PubMed] 20. Bar-Joseph Z, Gerber GK, Gifford DK, Jaakkola TS, Simon I. Continuous representations of time-series gene expression data. Journal of Computational Biology. 2003;10(3-4):341–356. [PubMed] 21. Hampson S, Kibler D, Baldi P. Distribution patterns of over-represented k-mers in non-coding yeast DNA. Bioinformatics. 2002;18(4):513–528. [PubMed] 22. Futcher B. Transcriptional regulatory networks and the yeast cell cycle. Current Opinion in Cell Biology. 2002;14(6):676–683. [PubMed] 23. Cho RJ, Campbell MJ, Winzeler EA, et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell. 1998;2(1):65–73. [PubMed] 24. Liò P. Wavelets in bioinformatics and computational biology: state of art and perspectives. Bioinformatics. 2003;19(1):2–9. [PubMed] 25. Song JZ, Ware T, Liu S-L, Surette M. Comparative genomics via wavelet analysis for closely related bacteria. EURASIP Journal on Applied Signal Processing. 2004;2004(1):5–12. 26. Song JZ, Ware A, Liu S-L. Wavelet to predict bacterial ori and ter: a tendency towards a physical balance. BMC Genomics. 2003;4(1):17. [PubMed] 27. Liò P, Vannucci M. Finding pathogenicity islands and gene transfer events in genome data. Bioinformatics. 2000;16(10):932–940. [PubMed] 28. Duan KM, Dammel C, Stein J, Rabin H, Surette M. Modulation of Pseudomonas aeruginosa gene expression by host microflora through interspecies communication. Molecular Microbiology. 2003;50(5):1477–1491. [PubMed] 29. Sokal RR, Michener CD. A statistical method for evaluating systematic relationships. University of Kansaa Science Bulletin. 1958;38:1409–1438. | |||||||||||