The Wavelet-Based Cluster Analysis for Temporal Gene
Expression Data

doi:10.1155/2007/39382

Note: Performing your original search, The Wavelet Transformation for Temporal Gene Expression Analysis , in PubMed Central will retrieve 8 citations.

Journal List > EURASIP J Bioinform Syst Biol > v.2007; 2007

EURASIP J Bioinform Syst Biol. 2007; 2007: 39382.

Published online 2007 May 15. doi: 10.1155/2007/39382.

The Wavelet-Based Cluster Analysis for Temporal Gene Expression Data

J. Z. Song,¹^* K. M. Duan,² T. Ware,³ and M. Surette²

¹Department of Animal and Avian Science, 2413 Animal Science Center, University of Maryland, College Park, MD 20742, USA

²Department of Microbiology and Infectious Diseases, and Department of Biochemistry and Molecular Biology, Health Sciences Centre, University of Calgary, Calgary, AB, Canada T2N 4N1

³Department of Mathematics, University of Calgary, Calgary, AB, Canada T2N 4N1

*J. Z. Song: Email: songj88@umd.edu

Recommended by Ahmed H. Tewfik

Received December 4, 2005; Revised October 1, 2006; Accepted March 4, 2007.

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

A variety of high-throughput methods have made it possible to generate detailed temporal expression data for a single gene or large numbers of genes. Common methods for analysis of these large data sets can be problematic. One challenge is the comparison of temporal expression data obtained from different growth conditions where the patterns of expression may be shifted in time. We propose the use of wavelet analysis to transform the data obtained under different growth conditions to permit comparison of expression patterns from experiments that have time shifts or delays. We demonstrate this approach using detailed temporal data for a single bacterial gene obtained under 72 different growth conditions. This general strategy can be applied in the analysis of data sets of thousands of genes under different conditions.

1. INTRODUCTION

High-throughput gene expression techniques, such as oligonucleotide and cDNA microarrays, SAGE (series analysis gene expression), and promoter arrays [1–5], make it possible to obtain large amounts of time series gene expression data in different organisms under various conditions. These large datasets prove to be invaluable for determining coordinately regulated genes and the underlying regulatory networks among genes. Temporal gene expression patterns have been used to define cell cycle regulated genes and metabolic and genetic networks [6–10]. However, how to extract expression patterns in temporal gene expression data represents a challenging analytical problem particularly when comparing data obtained under different growth conditions.

Because high-throughput gene expression technologies involve thousands of genes (or variables), reducing the dimensionality of the data can be a crucial issue for identifying coordinately regulated gene or inferring gene regulation networks. The current solutions include clustering coregulated genes from thousands of genes by similar expression profiles via unsupervised analysis [11–13], and Bayesian networks modeling [14]. Each method has its own merits and shortcomings. In temporal gene expression analysis, a main challenge is to extract the continuous representation of all genes through the time course of the experiment. Aligning gene expression time series profiles based on dynamic time warping [15], hidden Markov model [16], local clustering [17] and fitting time series data with cubic splines [18–20] have been used. However, a significant challenge remains in the comparisons of high-throughput temporal expression profiles obtained from same genes in different experimental conditions where patterns may be shifted in time. The current analysis methods do not specifically address the issue of time delays between experiments or conditions.

Many mathematical and statistical methods have been developed for identifying underlying patterns in complex data with varieties of applications, such as signal classification in speech processing, electrocardiography and sleep research. These methods cluster points in multidimensional space, and are routinely used in gene expression analysis. For example they have been used to identify genes whose expression correlated with the cell cycle [21–23]. These methods are readily applicable to many datasets. However, these strategies have limitations when comparisons of temporal data between different conditions are being carried out. Over the past few years, the wavelet has become an essential tool in genome analysis [24–27]. In this study, we propose the use of wavelet transformation as a method to characterize structure at multiple positions and length scales. Wavelet transforms are capable of providing the time and frequency information simultaneously, hence giving a time-frequency representation of the temporal gene expression signals, the wavelet transformed data can be further analyzed by cluster analysis. We demonstrate this approach with temporal expression profiles for a single gene under 72 growth conditions. Clustering of the data after wavelet transformation overcomes the problem of temporal shifts in expression patterns observed under different experimental conditions.

2. MATERIALS AND METHODS

2.1. Gene expression data

Temporal gene expression profiles were obtained using promoter fusion technique. Briefly, the promoters of interest are clones into a promoterless luxCDABE operon on a plasmid vector pMS402 [28]. Promoter activity correlates with light production generated by the luxCDABE gene products. Therefore, the activity of the promoter fused upstream luxCDABE is directly measured as light production after the fusion construct is introduced into the bacterium. The promoter regions of the Pseudomonas aeruginosa rpoS gene was amplified from P. aeruginosa PAO1 chromosomal DNA by PCR using oligonucleotide primers [28]. The PCR amplified promoter region were then cloned into the XhoI-BamHI sites of pMS402 upstream of the promoterless luxCDABE genes and transformed into PAO1 by electroporation. PCR, DNA manipulation and transformation were performed following general procedures. Overnight cultures of the reporter strain were diluted 1 : 200 in a 96-well microtiter plate and the promoter activity (CPS) and optical density at 620 nm (OD₆₂₀) were measured every 30 minutes for 24 hours in a victor² multilabel counter. The details of the 72 growth conditions will be described elsewhere.

2.2. Expression data wavelet transformation and clustering analysis

To overcome the gene expression profile shift issue (time delay) among different conditions, we first used continuous wavelet analysis to transform all expression data by wavelet transform; it decomposes temporal gene expression data in both time and frequency domains. In wavelet transform we take a real/complex valued continuous time function with two main properties, (1) it will integrate to zero; (2) it is square integrable. This function is called the mother wavelet. The CWT or continuous wavelet transform of a function f (t) with respect to a wavelet ψ(t) is defined as

(1)

Here, a and b are real. W(a, b) is the transform coefficient of f (t) for given a, b. Thus the wavelet transform is a function of two variables. For a given b, ψ_a,b(t) is a shift of ψ_a,0(t) by an amount b along time axis. The variable b represents time shift or translation. Since a determines the amount of time scaling or dilation, it is referred to as scale or dilation variable. If a > 1, there is stretching of ψ(t) along the time axis whereas if 0 < a < 1 there is a contraction of ψ(t). Each wavelet coefficient W(a, b) is a measure of the correlation of the input waveform with a translated and dilated version of the mother wavelet. By investigating the wavelet transform over different bases, we adopted dB2 as the mother function (see Figure 1). The output of the transform shows the correlation between the signal and the wavelet as a function of time across a range of scales. To avoid negative coefficients and in order to display differences clearly, we define

(2)

Based on the squared coefficient S(a), we clustered the 72 conditions with the average linkage method [29]. The distance between two clusters is defined by

(3)

If d(x, y) = |x − y|², then

(4)

The combinational formula is

(5)

In average linkage the distance between two clusters is the average distance between pairs of observations, one in each cluster. Average linkage tends to join clusters with small variances, and is slightly biased toward producing clusters with the same variance. All calculation was done by SAS and Matlab.

Figure 1

Mother wavelet (dB2).

3. RESULT AND ANALYSIS

3.1. The variation of gene expression profile

A large data set was generated from a unique gene expression experiment where activity of the promoter of the rpoS gene in P. aeruginosa was measured under 72 growth conditions. For each condition, measurements were obtained at 48 time points. Figure 1 shows the expression profile variation of this gene in different experimental conditions. Because the strength of expression of the rpoS gene varies among conditions, that is, the expression pattern may be similar although the magnitude of expression may vary, we normalized each expression profile with its maximum, so all expression level is in the range between 0 and 1. As expected, the gene expression profile varies significantly in different experiments and conditions. As shown in Figure 2, the time point of maximum expression of the rpoS shifts among conditions, that is, with clear expression profile shift or time delay phenomena. To further evaluate the variation of the rpoS promoter activity over 72 conditions, we determined the mean and standard deviation of the gene in each condition. The fluctuation of the mean and standard deviation of expression levels of the rpoS, as shown in Figure 3, highlights the variation of expression level and expression strength among conditions. These results clearly show the expression profiles and levels are condition-specific, that is, the regulation of the rpoS gene varies in different conditions.

Figure 2

The rpoS gene expression profiles in 72 conditions and 48 time points. Because the strength of expression of the rpoS promoter varies among conditions, the expression levels were normalized for each condition with its maximum so that the expression level (more ...)

Figure 3

The fluctuation of standard deviation of rpoS promoter activity in 72 different conditions and 48 time points. The blue line is mean and the error bar is standard deviation of gene in each condition.

3.2. The wavelet transformation of gene expression profile

Wavelet transformation is an analysis method that uses both time and the frequency domains. It allows a time series to be viewed in multiple resolutions, and each resolution reflects a different frequency. The wavelet technique takes averages and differences of a signal and breaks the signal down into spectrum. In the gene expression analysis, we assume that any gene expression level is a comprehensive result of gene effects and condition effects, that is, the expression profile shift or time delay is caused by the conditions which dictate the activation order and expression strength of the rpoS gene. The profile shifts or time delays certainly make comparison of expression patterns among conditions problematic. Overcoming this time delay, the wavelet transform addresses it by using dB2 (Figure 1) mother function that can be scaled. If the signal and wavelet are in a good match, then the correlation between the signal and the wavelet is high, resulting in a large coefficient. The coefficients of wavelet transformation indicate correlation intensities between wavelet function and expression profile if the expression signal level is between 0 and 1. When the wavelet is highly compressed it extracts the localized high-frequency details of the expression signal. When the wavelet is fully diluted, the length of the wavelet is more comparable to the length of the gene expression signal and therefore it extracts the low frequency trends of the signal. In order to overcome the issue in temporal gene expression data analysis we take an approach using wavelet transformation. The transformation results of the gene rpoS over 72 conditions, as shown in Figure 4, demonstrate the squared coefficients with a bell-shaped curve, the curves of the different conditions vary in skew and kurtosis which represent the difference of expression profiles. If expression profiles are similar, the bell-shaped curve will be very similar and close; otherwise, they will disperse. The wavelet analysis is able to overcome the profile shift problem, meanwhile, it is worth noting that the analysis loses time series information.

Figure 4

The power plot of the wavelet transformation of the rpoS gene promoter activity profiles obtained under 72 conditions. The mother wavelet id dB2 and the coefficients of wavelet transformation were squared.

3.3. Clustering analysis and evaluation

To evaluate the behavior of gene expression under different culture conditions, expression profiles are typically compared using cluster analysis. This provides a comparison of patterns of expression such that those with similar patterns of expression will fall close together on the hierarchical tree while those with dissimilar patterns will be far apart. To evaluate the effect of wavelet transformation, we clustered the data before and after transformation using average linkage method. The hierarchical cluster trees are shown in Figures 5 and 6. Note that the numerical values used in the two figures differ and consequently the distance measures are not directly comparable.

Figure 5

The cluster tree of 72 conditions of the rpoS gene expression before wavelet transformation based on the 48 time points measurements.

Figure 6

The cluster tree of 72 conditions of the rpoS gene expression after wavelet transformation based on the 48 time points measurements.

We would predict that genes with similar expression profiles before wavelet transformation would cluster together in both Figures 5 and 6. Wavelet transformation would not make the expression patterns dissimilar. To illustrate this we have plotted the expression data for two conditions (C1T23 and C2T23) that cluster closely in Figure 7. We can see that the activity profiles of the rpoS promoter are very similar in these two conditions (Figure 7(a)) and likewise the power plots of their wavelet transformation are also similar (Figure 7(b)). As expected they also clustered together in Figure 5.

Figure 7

(a) The expression profiles of the rpoS in conditions C1T23 and C2T23 and (b) the power of the wavelet transform.

To illustrate the effect of the wavelet transformation, we highlight the expression of two conditions (C1T7 and C2T7) that cluster close together after wavelet transformation (as in Figure 5) but not before it (as in Figure 4). We would predict that these will have similar expression patterns but with a time shift between the experimental conditions. This is clearly illustrated in Figure 8(a). This temporal shift is sufficient to prevent close clustering of these conditions in Figure 4. By contrast, the profiles appear very similar after wavelet transformation (Figure 8(b)) and the two conditions cluster close together in Figure 5. In this experiment, the growth medium used in C1T7 and C2T2 was the same and the expression profile would be expected to match however experimental variables leading to different initial conditions. The results indicate that wavelet transformation can extract expression pattern information and overcome difficulties that arise because of temporal delays in patterns of expressions between conditions or experiments.

Figure 8

(a) The expression profiles of the rpoS in conditions C1T7 and C2T7 and (b) the power of the wavelet transform.

4. DISCUSSION

To deeply understand gene temporal expression behavior and interactions in cells is a fundamental task in functional genomics. While methods for obtaining high-throughput temporal gene expression data are readily available, methods and strategies for analysis of these complex data sets are still emerging. Because the unique feature of temporal gene expression data is autocorrelation between successive points, the immediate goals are to extract and to compare the fundamental patterns of gene expression inherent in the data. Most of the current methods are based on certain distances between expressed genes or variables (conditions), such as hierarchical clustering, self-organizing maps, relevance network, principal components analysis and machine learning. Application of clustering analysis directly to the expression data ignores some basic features of temporal expression data and more over can be complicated by temporal shifts or time delays between experiments. These temporal shifts arise not because of intrinsic features of the expression pattern but because of differences in initial conditions between experiments. These are often unavoidable experimental variables. Dynamic time warping is a discrete method similar to sequence alignment algorithms [5] that can be used to align time series data. It involves many degrees of freedom and the time points can “stop” or go “backwards” in the alignment. Overfitting can also be a problem with this method. The cubic spline is a powerful technique for data fitting, interpolation, extrapolation, and visualization [20], and permits the principled estimation of unobserved time-points and dataset alignment. Each temporal gene expression profile is modeled as a cubic spline (piecewise polynomial) that is estimated from the observed expression data. It constrains the spline coefficients of genes in the same cluster to have same or similar expression patterns. The splines are piecewise-smooth polynomials that can be used to represent functions over large intervals, where it would be impractical to use a single approximating polynomial. As for the clustering analysis with the cubic splines, especially in large scale of temporal gene expression data, further research and comparison are needed.

In this paper, we firstly transformed temporal gene expression data with continuous wavelet analysis and then we did hierarchical clustering analysis. Average linkage method was used because it proceeds by first finding pairs of expression profiles that are most similar, joining them, calculating the (sometimes weighted) average between the members of the joined cluster, recalculating the pairwise distance, and treating the average profile as one profile, and repeating the procedure until all profiles are joined. Average linkage clustering can be conducted using all-pairwise-sample average of differences or using cluster average differences. The latter is also known as centroid clustering, but centroids can be calculated using methods other than simple averages.

It is worth noting that wavelet analysis and the Fourier transformation (FT) are two widely used methods in signal processing. In its original form, the FT assumes that the expression signal exists for all time. This for practical purposes is not a realistic assumption in temporal gene expression and does not give any information about how the expression signal changes with respect to time. This is not a problem when the gene expression signal being analyzed is stationary, that is when the statistical properties of the expression signal are not changing with time. All gene expression signals, however, are nonstationary. It is especially necessary to identify and locate the changing frequency characteristics of the gene expression signals. An alternative FT, which is called the short-time Fourier transform (STFT), is a time-dependent or windowed-Fourier transformation. It attempts to analyze nonstationary signals by dividing the whole signal into shorter data frames, but one of the limitations of the STFT is that the timeframe for analysis is fixed. Wavelet transformation is a measure of similarity between the basis functions (wavelets) and gene expression profiles, and the calculated CWT coefficients refer to the closeness of the gene expression profile to the wavelet at the current scale. The flexible approach uses a scalable window. The advantages of the method are a compressed window for analyzing high-frequency details and a diluted window for uncovering low-frequency trends within the signal. Wavelets are also well localized in frequency, although not as well as sinusoids. Since wavelet analysis incorporates the concept of scale into the wavelet equation it is suited to resolve the transient nature of gene expression data.

Then choosing appropriate scales and the number of scales are imminent issues. Scale is the inverse of frequency. Once the mother wavelet is chosen, the computation will start from high frequencies and proceed towards low frequencies. This first value of scale will correspond to the most compressed wavelet. As the value of scale is increased, the wavelet will dilate. Smaller scales (high frequencies) have better scale resolution which corresponds to poorer frequency resolution. Similarly, large scales have better frequency resolution. From the results presented here, it is apparent that wavelets are better suited to the analysis of transient gene expression signals, since they are well localized in time, whereas sinusoids extend over all time. We also need to emphasize that although the wavelet analysis overcomes the time delay or profile shift, the transformation will lose temporal information if we need it, so the analysis is application dependent.

In summary, the paper presents an alternative way to extract expression patterns in temporal gene expression data with continuous wavelet analysis. It has been demonstrated that the application of wavelet transformation to gene temporal expression data is feasible. We anticipate that the wavelet analysis and transformation could be used in large scale temporal gene expression research and single cell experiments. It is of particular value in comparison of temporal expression profiles obtained under different conditions or from different experiments. The pattern recognition is of important value on monitoring simultaneously the expression patterns of thousands of genes during cellular differentiation and responses.

ACKNOWLEDGMENTS

The authors thank members of the Surette lab for helpful discussions. This work was supported by the Canadian Institutes of Health Research, Genome Canada through the University of Saskatchewan. M.G.S. is an Alberta Heritage Foundation for Medical Research Senior Scholar and Canada Research Chair in Microbial Gene Expression.

References

Kalir S, McClure J, Pabbaraju K, et al. Ordering genes in a flagella pathway by analysis of expression kinetics from living bacteria. Science. 2001;292(5524):2080–2083. [PubMed]

Weeraratna AT. Serial analysis of gene expression (SAGE): advances, analysis and applications to pigment cell research. Pigment Cell Research. 2003;16(3):183–189. [PubMed]

Schulze A, Downward J. Navigating gene expression using microarrays—a technology review. Nature Cell Biology. 2001;3(8):E190–E195. [PubMed]

Heller MJ. DNA microarray technology: devices, systems, and applications. Annual Review of Biomedical Engineering. 2002;4:129–153.

Southern EM. DNA microarrays: history and overview. Methods in Molecular Biology. 2001;170:1–15. [PubMed]

Tong AHY, Lesage G, Bader GD, et al. Global mapping of the yeast genetic interaction network. Science. 2004;303(5659):808–813. [PubMed]

Fedoroff N, Fontana W. Genetic networks: small numbers of big molecules. Science. 2002;297(5584):1129–1131. [PubMed]

Bundschuh R, Hayot F, Jayaprakash C. Fluctuations and slow variables in genetic networks. Biophysical Journal. 2003;84(3):1606–1615. [PubMed]

Spellman PT, Sherlock G, Zhang MQ, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell. 1998;9(12):3273–3297. [PubMed]

10.

DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997;278(5338):680–686. [PubMed]

11.

Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America. 1998;95(25):14863–14868. [PubMed]

12.

Banerjee N, Zhang MQ. Functional genomics as applied to mapping transcription regulatory networks. Current Opinion in Microbiology. 2002;5(3):313–317. [PubMed]

13.

Törönen P, Kolehmainen M, Wong G, Castrén E. Analysis of gene expression data using self-organizing maps. FEBS Letters. 1999;451(2):142–146. [PubMed]

14.

Friedman N, Linial M, Nachman I, Pe'er D. Using Bayesian networks to analyze expression data. Journal of Computational Biology. 2000;7(3-4):601–620. [PubMed]

15.

Aach J, Church GM. Aligning gene expression time series with time warping algorithms. Bioinformatics. 2001;17(6):495–508. [PubMed]

16.

Schliep A, Schönhuth A, Steinhoff C. Using hidden Markov models to analyze gene expression time course data. Bioinformatics. 2003;19(supplement 1):i255–i263. [PubMed]

17.

Qian J, Dolled-Filhart M, Lin J, Yu H, Gerstein M. Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions. Journal of Molecular Biology. 2001;314(5):1053–1066. [PubMed]

18.

Bar-Joseph Z, Gerber G, Simon I, Gifford DK, Jaakkola TS. Comparing the continuous representation of time-series expression profiles to identify differentially expressed genes. Proceedings of the National Academy of Sciences of the United States of America. 2003;100(18):10146–10151. [PubMed]

19.

Bar-Joseph Z. Analyzing time series gene expression data. Bioinformatics. 2004;20(16):2493–2503. [PubMed]

20.

Bar-Joseph Z, Gerber GK, Gifford DK, Jaakkola TS, Simon I. Continuous representations of time-series gene expression data. Journal of Computational Biology. 2003;10(3-4):341–356. [PubMed]

21.

Hampson S, Kibler D, Baldi P. Distribution patterns of over-represented k-mers in non-coding yeast DNA. Bioinformatics. 2002;18(4):513–528. [PubMed]

22.

Futcher B. Transcriptional regulatory networks and the yeast cell cycle. Current Opinion in Cell Biology. 2002;14(6):676–683. [PubMed]

23.

Cho RJ, Campbell MJ, Winzeler EA, et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell. 1998;2(1):65–73. [PubMed]

24.

Liò P. Wavelets in bioinformatics and computational biology: state of art and perspectives. Bioinformatics. 2003;19(1):2–9. [PubMed]

25.

Song JZ, Ware T, Liu S-L, Surette M. Comparative genomics via wavelet analysis for closely related bacteria. EURASIP Journal on Applied Signal Processing. 2004;2004(1):5–12.

26.

Song JZ, Ware A, Liu S-L. Wavelet to predict bacterial ori and ter: a tendency towards a physical balance. BMC Genomics. 2003;4(1):17. [PubMed]

27.

Liò P, Vannucci M. Finding pathogenicity islands and gene transfer events in genome data. Bioinformatics. 2000;16(10):932–940. [PubMed]

28.

Duan KM, Dammel C, Stein J, Rabin H, Surette M. Modulation of Pseudomonas aeruginosa gene expression by host microflora through interspecies communication. Molecular Microbiology. 2003;50(5):1477–1491. [PubMed]

29.

Sokal RR, Michener CD. A statistical method for evaluating systematic relationships. University of Kansaa Science Bulletin. 1958;38:1409–1438.