Contents
The microarray database currently supports two methods for analysis of data across multiple arrays, hierarchical clustering and Self Organizing Maps. Both are implemented using the XCluster software. Details about the input and output file formats are available.
Systematic methods for organizing gene expression data require a means of measuring quantitatively if two expression profiles are similar to each other. In this regard it is useful to consider the values that make up the expression profile for a single gene as a series of coordinates, which define a vector, and to consider the data for a microarray experiment as a matrix, where the genes define the rows, and the arrays, or experiments define the columns. Once we consider these data as vectors, we can use standard mathematical techniques to measure their similarity. One distance metric that can thus be used is the Pearson Correlation, which is essentially a measure as to how similar the directions in which two expression vectors point are. The Pearson Correlation treats the vectors as if they were the same (unit) length, and is thus insensitive to the amplitude of changes that may be seen in the expression profiles. A second distance measure that can be used is the Euclidean distance, which measures the absolute distance between two points in space, which in this case are defined by two expression vectors. The Euclidean distance thus takes into account both the direction and the magnitude of the vectors.
Let Gi equal the (log-transformed) primary data for gene G in condition i. For any two genes X and Y observed over a series of N conditions, a similarity score can be computed as follows:
Equation 1 | ||
Equation 2 | where : |
In equation 1 S(X,Y) is identical to the Pearson Correlation if Goffset is set to the mean of the observations. Values of Goffset which are not the average over observations on G are used when there is an assumed unchanged or reference state represented by the value of Goffset, against which changes are to be analyzed. When clustering with an uncentered metric, Goffset is set to 0, corresponding to a fluorescence ratio of 1.0., whereas with a centered metric, Goffset is set to the mean of the observations, thus using the Standard Pearson Correlation.
Hierarchical clustering in the database is done by Average Linkage, as described in Eisen et al (which contains the formulae for both centered and uncentered Pearson correlation), and works as follows (the example is written as if clustering genes, but is equally applicable as when clustering experiments):
The expression data for each gene are treated as a vector, with as many dimensions as there are experiments. Using the specified distance metric (Pearson correlation or Euclidean distance), the 'distance' between each gene's expression vector, and every other gene's expression vector is then calculated. When using Pearson correlation, two expression vectors which point in exactly the same direction will have a correlation of 1, whereas those which point in opposite directions will have a correlation of -1 (ie are anti-correlated). Those vectors at 90° to each other are uncorrelated, and will have a correlation of zero. If using Euclidean distance as the distance metric, those vectors that are identical (ie define the same point in space) will have a Euclidean distance of zero between them. Vectors pointing in different directions, or having different lengths (even if they point in the same direction) will define points with a distance larger than 0 between them.
The two expression patterns that have either the highest Pearson correlation, or the smallest Euclidean distance between them (depending on which metric you're using), will be 'joined' into a node. A new expression vector, for which each coordinate is the average of the corresponding coordinates of the contributing expression vectors will be made. The 'distance' of this vector to all other unjoined vectors is then calculated.
As the algorithm proceeds, nodes are created that contain more and more expression vectors, until only one node, containing all vectors, remains. During node joining, the history of which genes/nodes are joined, and with what correlation, is recorded in the .gtr file. This file can then be used to construct the tree that is often seen next to clustered expression data.
When joining two nodes, or a node and a gene, a decision can be made about the way in which those nodes are joined, as nodes can be rotated around their roots. There are (something like) 2n-2 possible topologically equivalent trees, when clustering n genes, considering the two alternative rotations of every single node. While there may be an optimal (or several optimal) trees, depending on how you measure whether the tree is optimal, XCluster, used by the microarray database, does nothing to systematically solve the problem of finding an optimal tree (see Bar-Joseph et al for a more systematic algorithm). It does however employ a simple node-flipping algorithm, such that when two nodes are joined, it rotates them such that the two most similar outermost members of the nodes are placed adjacent to each other. While there is no evidence that this produces a 'better' tree, anecdotal accounts suggest that the tree produces expression data that looks smoother, or less discontinuous.
Soms were devised by Tuevo Kohonen, and first used by Tamayo et al to analyze gene expression data. The description here pertains to the SOM implementation in XCluster, which is used by the database, which may or may not be identical to Tamayo et al's implementation.
For a Self Organizing Map you must specify the x and y dimensions of the map. The x and y dimensions will determine the number of partitions that you segregate your data into, eg. if you choose 3 x 3, you will get 9 partitions. Internally 9 vectors will be initialized, one for each partition. The dimensionality of the vectors is determined by the number of experiments that there are in your datafile. For instance if you have 10 experiments, then the vectors will be 10 dimensional, ie have 10 coordinates. Each of these coordinates is randomly initialized. In addition these partitions have a relationship to each other, in that they sit in a 2-dimensional grid, as defined by the user (eg 3 x 3). In this grid, it could be said that one partition is physically closer to another partition than it is to a 3rd partition. eg the top left partition in a 3 x 3 grid is closer to the middle left partition, than it is to the bottom right partition. This concept is important when we consider how the SOM is refined. At the beginning, the relationship of the partitions in physical space has no bearing on the relationships between the vectors that associate with them. Thus the map is unorganized.
A gene from the list of genes in your data file is then picked at random, and its expression pattern is compared to each of the vectors that were initialized randomly in the first step, one vector for each partition. The vector to which the expression vector of the picked gene is most similar is then modified, so that it more closely resembles the expression vector of that gene. In addition the vectors that belong to the partitions that are physically closest (in the 2 dimensional grid) to the partition whose vector was just modified are also modified, so that they too resemble the gene's expression vector a little more closely. This process (picking a gene at random; seeing to which vector it is most similar; modifying that vector; modifying the vectors of the partitions that are close to the partition which had the most similar vector) is repeated 100,000 times.
As the iteration number increases, the amount by which the vectors are modified decreases. Also the definition of which partitions are close to each other changes. eg in the first iteration all partitions may be considered close to each other, but after 50,000 iterations a partition may be considered close to another, only if it is less than half of the width of the entire grid away. At the end no partitions may be considered close to each other. Essentially a circle is drawn on the 2 dimensional grid, with its center being in the center of the partition whose vector was most similar to the picked gene's expression pattern. Any partitions that fall within that circle are considered 'close', and so their vectors are modified. The radius of this circle decreases with each iteration. Hence with each iteration fewer vectors are modified by smaller amounts, so that the map eventually settles down. It is important to note that algorithm leads to the vectors of neighbouring partitions being somewhat similar to each other, and vectors of partitions that are physically distant are dissimilar to each other. Thus the map has become organized.
After the 100,000 iterations have occured, the genes are then partitioned. This is simply done by assigning a gene to the partition whose associated vector is most similar to its expression vector. A gene may currently only be put in one partition (though this could be changed), and a partition may theoretically contain all or none of the genes. In practice though this rarely occurs, unless there are few genes, or few partitions.
The contents of each partition is then clustered by hierarchical clustering.