PUMAdb : Data Analysis and Clustering Help

Contents

Description
Partitioning Data
Self-Organizing Maps
Seeding with random values

Distance Metrics and Centering
Correlated Genes
Image Generation Options
Visualizing Clustered Microarray Data

Related Help Documents

Data Selection for Analysis: Explanation of the steps involved in selecting and filtering data for clustering
Analysis Methods: Information about the algorithms used for hierarchical clustering and Self-Organizing Maps (SOMs)
File Formats: Information about preclustering (.pcl), clustered data table (.cdt), gene tree (.gtr) and array tree (.atr) files generated in the process of clustering data
Cluster and TreeView Manual from Michael Eisen at Lawrence Berkeley Laboratories

Description

The Clustering and Image Generation tools at PUMAdb allows you to select methods to organize, center, cluster and display microarray data. At this time, you have to select and filter the data you want to analyze using the Basic Search or Advanced Results Search. This help page describes the algorithms available and how to choose them.

Partitioning Data

When organizing your data for analysis, you can elect to treat all the microarray data as a single group or you can separate the data into related groups, or partitions. Hierarchical clustering can be performed on data that is partitioned or unpartitioned. The difference is that hierarchical clustering does not join data in different partitions, so that potentially unrelated patterns remain independent.

Self-Organizing Maps
Self-organizing maps (or SOMs) were developed by Tuevo Kohonen in the 1980s and have since been used to analyze microarray data. The implementation of SOMs in PUMAdb allows the data to be separated into partitions that are aranged in a two-dimensional grid. For example, Figure One shows a 3 x 3 map with nine partitions. Notice that each partition is more similar to its neighbors than it is to the partitions that are farther away. The implementation of SOMs in PUMAdb has several steps: initialization of the SOM, refinement of the SOM, partitioning of the genes, and hierarchical clustering of each partition.

Distance Metrics and Centering

The most obvious method for comparing sets of microarray data is to compare the data profiles. A gene expression profile can be imagined as a vector of n dimensions, where n is the number of microarray measurements for that gene and whose coordinates are the results of each measurement of that gene. By comparing these vectors, we can detect which genes show similar data profiles across a series of experiments. Similarly, experiments can be thought of as vectors of m dimensions, where m is the number of genes and each coordinate is the measurement of a single gene on that microarray. Comparison of the array vectors will show which arrays showed the most similar behavior. Once we consider these data as vectors, we can use standard mathematical techniques to measure their similarity. PUMAdb uses two distance metrics to measure the similarity between vectors: Pearson Correlation and Euclidean distance. Pearson Correlation treats the vectors as if they were the same (unit) length, and is thus insensitive to the amplitude of changes that may be seen in the expression profiles. The Euclidean distance measures the absolute distance between two points in space, which in this case are defined by two vectors. Note that Euclidean distance will be affected by both the direction and the amplitude of the vectors, so that two genes that are coordinately expressed might not be seen to be similar if one has a much higher signal than the other.

The Math Behind the Metrics

Since the Pearson Correlation is the one most often used, this document covers it in greater detail than Euclidean distance. The Pearson correlation can be computed as follows:

Equation 1
Equation 2	where :

In equation 1 S(X,Y) is identical to the textbook Pearson Correlation if G_offset in equation 2 is set to the mean of the observations. We refer to this as the Pearson Correlation centered metric. Alternatively, G_offset may be set to other values. For example, it is set to zero (corresponding to a fluorescence ratio of 1.0) when calculating the Pearson correlation of uncentered metrics.

Centering Data and Use of Centered or Uncentered Metrics
Centering data before clustering and choosing whether to use a centered metric during clustering will have significant effects on how your data cluster or how your clustered data look. For more information about how to center data prior to clustering, refer to the Data Selection for Analysis help document. Figure 2 shows an example of the same data clustered three times using different combinations of centering and distance metrics. For this example, a small dataset was duplicated, but a constant value was added to the second copy of the data.

Figure 2. Examples of clustering with different methods of centering.
In the first panel of Figure 2, the data are uncentered and an uncentered metric was used during clustering. Note that because the constant was added, the same genes very rarely cluster together. In the second panel, the data were centered before clustering with an uncentered metric. Since the average of all data for a gene was subtracted from each value, the centered data are exactly the same for each member of a pair of duplicated genes. Centering the data has completely eliminated the effect of the constant value that was added. In fact, you cannot tell by visual inspection which member of a pair has the added constant. In the third panel, uncentered data were clustered using a centered metric. The duplicated genes still cluster together (just as they would if the data had been centered), but the visual display shows you which genes have the increased values.

Figure 3. Image and options available after hierarchical clustering.

Browse the cluster

By clicking on the image itself, you can explore your data, zoom in, zoom out and more.

View averaged spot images
Spot images are simply square images with even signals that represent the actual spots on the array. It is often more convenient and intuitive to examine the broken images rather than the spots themselves. Figure 4 shows an example of a broken images.

Figure 4. Display of clustering results as broken images.
View spots
You can view your clustering results as images of the actual spots, as illustrated in Figure 5.

Figure 5. Display of clustering results as broken spots.
View spots and spot images
Figure 6 shows how you can see both the images and the spots in your clustering results.

Figure 6. Display of clustering results as joint broken images.
Notes on viewing Spot images
When several instances of the same clone (SUID) are present on the same slide and data points are collapsed by SUID, this will result in averaging of these data points. Subsequently, when spot images are displayed during clustering, the images from one of the spots is picked randomly. Naturally, this might look quite different than the averaged cluster images.
When planning on looking at spot images during clustering, the "Include SUID/LUID/SPOT in the UID column." box has to be checked during data retrieval. If this is left unchecked, the spot images can not be properly assigned to cluster images when multiple instances of the same clone (SUID) are present on the slides. A random spot image is picked and will be used for all spots. An example is shown on Figure 7.

Figure 7. shows an example when data was collapsed by 'SPOT' (i.e. no averaging of spots), but the UID column was not retrieved. IMAGE:683659 was present in 4 spots on these slides, 3 of which clustered together (lower panel) but the 4th did not (upper panel). Spot images for this 4th spot were picked randomly and used for all the instances of the clone. This results in quite obvious discrepancy with the cluster images (lower panel).