PUMAdb : Singular Value Decomposition Help

Your session is inactive. Login

Contents

Related Help Documents


Description

Singular value decomposition (SVD) is a data-driven mathematical framework that determines unique orthogonal (or uncorrelated) gene and corresponding array expression patterns (i.e. "eigengenes" and "eigenarrays," respectively) in the data. Some of these patterns may be correlated with independent biological processes, and used for classification of the genes and arrays. Alternatively some of these gene and array patterns may be instead associated with experimental artifacts, such as variations in the day of hybridization, array print size or scanner calibration. Thus SVD may allow you to identify patterns that exist in your data, to aid you in your analysis. SVD is related to the Karhunen-Loeve expansion in the field of pattern recognition and principal component analysis (PCA) in the field of statistics.

How to use the database SVD tool

  • Dataset Preparation


  • Use a .pcl file in your repository

    You can use SVD by clicking on the "SVD" button for any .pcl file in your repository. There are help documents provided for both the repository and file formats.

    In order to use SVD as it is implemented in the context of the database, you will have to first put a preclustering (.pcl) file in your repository. Although the SVD of a dataset is independent of the order of the genes and arrays in the data, a meaningful order might help you correlate dominant eigengenes with experimental artifacts that are superimposed on the data or with biological processes that are present within the data. For this reason, it is often advantageous to order your arrays by using an experiment set or array list (as one might already have done for a time series) or by clustering genes and/or arrays using the database clustering pipeline and then retrieving the data using that order.

  • Centered or non-centered data

    Centering your data around the gene average expression is mathematically equivalent to filtering out an eigengene whose expression is constant across all arrays. The ability to do this depends upon obtaining such an eigengene in the decomposition. Deciding whether to center your data prior to using SVD will depend on your experiment. If your data would be appropriately centered prior to clustering (for example, if you are clustering data from a survey of tumor types in comparison to a common reference), then you'll probably want to center the data, cluster and perform SVD on the .pcl file. If your experiment should not be centered (for example, if you are using a biologically meaningful control reference, such as time zero), then don't center prior to clustering or SVD. Usually, SVD of a non-centered dataset indeed results in an almost constant eigengene being the most dominant eigengene in the data. In these cases, SVD of the non-centered data provides some justification for centering the data. Also, SVD of the non-centered dataset will give very similar results to that of the centered dataset, the main difference being that in the decomposition of the non-centered dataset the almost constant eigengene is one of the most significant eigengenes, while in the centered dataset, this eigengene may be one of the least significant eigengenes (compare Figures 1 and 2 below).



    Figure 1: Raster Display of the Eigengenes (Left) and Bar Chart Display of the
    Probabilities of Eigenexpression (Right) of Non-Centered Yeast Cell Cycle Data



    Figure 2: Raster Display of the Eigengenes (Left) and Bar Chart Display of the
    Probabilities of Eigenexpression (Right) of Centered Human Sarcoma Tumor Data


    Once a .pcl file has been prepared and saved, you can use SVD by clicking on the "SVD" button in your repository.

  • Replacing missing values

    Since SVD cannot be performed on datasets with missing values, the first step is to decide whether to discard all genes with missing values or to estimate the values for any data that is missing. Since it is extremely common for a gene to be missing data for at least one array, you may want to replace some of the missing data to retain a larger number of potentially interesting genes. Database software allows you to replace missing data with the value that is the average for all the data for that gene. You can limit the amount of data that can be estimated, so that genes with too much missing data will be discarded from the SVD analysis. Estimated values will not be retained in the resulting .pcl file, so you will not have any "false" data propagated.

  • Ordering Genes and Arrays

    The SVD of a dataset is independent of the order of the genes and arrays in the data. However, a meaningful array order may help correlating the dominant eigengenes with experimental artifacts that are superimposed on or biological processes that are manifested in the data. Similarly, a meaningful gene order may help correlating the corresponding eigenarrays with the corresponding cellular states. For time series data, you may want to order the arrays according to the time points that they sample. For tumor data, you may want to cluster both genes and arrays before performing SVD.

  • SVD display

    The software displays the eigengenes matrix (the left-most matrix in Figure 3) in a red and green raster display alongside a bar chart display of their corresponding probabilities of eigenexpression.

    Figure 3: Database tool for viewing and using SVD in gene expression data analysis.

    Each row in the Eigengenes Raster Display represents an eigengene pattern of expression. The uppermost row in the eigengenes matrix is the first eigengene, which is the one that contributes the most to the entire dataset. From this display, there are several options that are explained below:

    Each row in the bar chart (on the right side of Figure 3) represents the probability of eigenexpression of the corresponding eigengene (and eigenarray). For example, the first (upper most) bar in the chart is the probability of eigenexpression of the first eigengene (and also the first eigenarray). There is more information about probability of eigenexpression and entropy later in this document.