PUMAdb : Data selection for Analysis Help

Contents

Description
Gene Selection Options

Retrieving Genes
Collapsing Data for Genes
Describing genes in output files
Selecting biological annotation
Labelling arrays in output files

Data Filtering Options

Retrieving Data
Filtering by spot flag
Filtering by results criteria
Collapsing and Averaging Replicate Experiment Sets
Image presentation options

Gene Filtering Options

Transform single-channel data
Filter genes based on data distribution
Center data
Zero-transform genes for time course experiments
Filter genes based on data values
Filter genes and arrays based on the amount of data passing the spot filter criteria

Viewing Clustering Results
Clustering and Image Generation Options
Browsing, Viewing, and Downloading Clustered Data

Related Help Documents

Data Selection: Explanation of the program used to select hybridizations (arrays) for viewing or analyzing data
Analysis Methods: Information about the algorithms used for hierarchical clustering and Self-Organizing Maps (SOMs)
File Formats: Information about preclustering (.pcl), clustered data table (.cdt), gene tree (.gtr) and array tree (.atr) files generated in the process of clustering data

Description

The Data Selection for Analysis tool is available only after you have selected a set of hybridized arrays using either the Basic Search or the Advanced Search programs. Once a set has been selected, Data Selection for Analysis allows you to select genes or spots to cluster, and to filter data based on a variety of parameters. This tool can be used to generate a preclustering (.pcl) file, or the files needed for viewing a cluster with TreeView. In addition, Data Selection for Analysis will lead you to tools that will let you view clustered data via the web.

Data Selection for Analysis is split into three large steps:

Gene Selection & Annotation allows you to choose the genes or spots to retrieve for analysis, how to represent and annotate the genes and how to describe the hybridized arrays you've selected.
Data Filtering Options gives you options for selecting which data column to retrieve and to filter the data retrieved based on values of any of the data associated with the results.
Gene Filtering Options allows you to filter genes based on their data as well as to transform (center) data.

Gene Selection Options

Although we use the word 'gene,' it really refers to any DNA sample spotted on the microarrays. A 'gene' might be a PCR product representing an entire section of a gene, a portion of a gene, a clone associated with a gene, an intergenic region or anything at all.

This section allows you to first specify which genes are of interest to you, then decide how to collapse your data, how to identify genes in your output file, select biological annotation and to choose a way to label the arrays you're using.

Specify genes or clones for which to retrieve results:
Use one of the following three options for deciding which genes on your arrays for which to retrieve data. Only genes that have at least one piece of data will be included in the final .pcl file - see Choose the data column to retrieve, below.
- Use all genes/clones on arrays
  You can select all the genes/clones in the experiments you have selected.
- Select a list of genes
  This will select genes based on those that exist within a genelist file, if you are an owner of a "loader" account. Shared standard files are available for many organisms. In addition, you may create your own precompiled list of genes. To do this, use the "genelists" directory in your loader account that was created automatically together with your account. Then create a tab-delimited text file that contains either the sequence NAME, SUID, LUID, or SPOT of each of the genes as the first column. (Example sequence names are YPR119W for yeast and HPY1808 for H. pylori. For cloned organisms (human, mouse, fly) cloneIDs are used, e.g. IMAGE:1542757). Names are case-sensitive, for example, the Plasmodb_ID 'PFC0885c' requires the trailing lowercase 'c'. Your files will appear in the pull-down menu under 'Select a list of genes.' Your file may contain additional columns for your own information, but the database will not read them. The one exception to this is if you check the "or keep annotation from genelist (if using one)" button in the "Biological Data To Select" section. If this radio button is checked, the second column is retained as annotation. The first line(header) of the genelist file should have then the appropriate label for the data contained within it (either NAME, SUID, LUID, or SPOT).
- Enter gene names
  You may enter gene names, one per line. All the genes you enter that have data in the chosen experiments will be selected. Use the systematic names of the database (e.g. clone IDs or ORF names, as appropriate), or the gene_name or other organism identifier (for example, plasmodb_id). All names are case sensitive. Examples of the systematic names appearing on the first selected array are provided, for guidance.
Decide how to collapse/average data
When a single gene is represented more than once on an array, you can choose how to handle the different spots. When you retrieve by SUID, you will average the results from sequences with the same identifier in the database (the same SUID). On the other hand, if you retrieve data by LUID you will only average data for spots that were derived from the same original microtiter well sample in the laboratory (those having the same LUID). You can retrieve data by spot which works only if all your arrays are from the same print. In this case, no averaging will be performed.
A new method of collapsing is now available that enables collapsing by ID/Annotation specific to the gene features for the organism, such as Gene Name, ORF Name, Cluster Id, etc. (This is similar to the existing Synthetic Genes, but not identical because Synthethic Genes averages values that were already averaged (by suid), while collapsing by ID/Annotation averages only once.) For the purpose of discussion using the screenshot example below we will collapse by ORF Name.

In the example above we have selected experiments from a yeast (SC) print, and have chosen to collapse and average the data by ORF Name. This will group and average all the suids for each ORF Name into a single row in the PCL file. If there are any remaining suids which do not map to ORF Names, you have the option of discarding them or retaining them. Each collapsed ORF Name will list the number of suids it represents in the annotation column of the PCL file. The checkbox "Genelist contains this field" is used only if you have selected "Genelist" in the prior step, instead of "All". If the genelist in this example contained ORF Name, you will need to check the box to indicate that equality. Note: if you average by "gene" then you will not be able to create spot images during the next step in the pipeline. Retrieval of additional biological annotation is described below.
Choose the contents of the UID column of the output file
If you like, you can label each row of data with the Sequence ID (SUID), the laboratory's microtiter well ID (LUID) and the spot identifier (SPOT). This information will be produced in your output preclustering file. For more information, see the File Format Help page.
This option is not available if you are collapsing by gene (see above).
Selecting biological annotation
Annotations can be contatenated within the second column of your retrieved results. Available annotations will vary, depending on the organism whose sample is being assayed. In the example above, "GENE NAME" is selected. You can select multiple types of biological data from the multiple-select menu or you may check the retain annotation from genelist (if using one) button if you are using your own precompiled list of genes. For organism-specific details, please refer to the meta-data for your organism and the Organism Annotation Tables within the database schema. Note: If you are collapsing by "gene" (see above), requests for individual reporter-specific annotations (oligo sequence, tiling coordinates, sequence description) are ignored.
Choose a label for each array/hybridization
You can label each hybridized array with either the experiment name or the slide name in the the output preclustering file. For more information, see the File Format Help page.

'Download Preclustering File' allows you to download the raw data to your machine for analysis using your own methods.
'Clustering and Image Generation' allows you to view the results after setting some final clustering option and image generation options.

Data Analysis

PUMAdb allows you to perform some data analysis on your preclustering file, using either of two methods:

Clustering Options

You have to define the following options when hierarchically clustering

Whether to cluster genes, and if so whether to use a centered, or a non-centered metric.
The centered vs non-centered metric only applies if you are using the Pearson Correlation (see below). It will not make a difference if using the Euclidean distance.
Whether to cluster experiments
The same considerations apply for experiments as described for genes above.
Whether to use the Pearson Correlation or the Euclidean distance
These are distance metrics that are used for measuring the similarity of expression between genes.
Whether to Hierarchically Cluster, or make a Self Organizing Map.
If you choose 'Self Organizing Map Cluster', be sure to specify x and y dimensions. Your settings for hierarchical clustering described above will still be used when each partition of the SOM is clustered.

If you want to generate a file of sorted correlations, the default correlation is .8. Click 'Submit' when you have chosen the appropriate options.

Image Generation Options

Here are a couple tips that will help you optimize the time it takes to analyze the experiments you selected.

Selecting 'Show spot images' will slow down the analysis.
Broken up images load faster and can be navigated more quickly than unbroken images.

Browsing, Viewing, and Downloading Clustered Data

To interactively browse the clustered data, click the red and green image in the lower left-hand corner of the window. This takes you to the 'Hierarchical Cluster View' where you can focus on specific gene sub-clusters.

The map on the left contains the entire cluster, and its size can be changed by entering new parameters in the upper left-hand corner.
Clicking on this map changes the view of the graph on the right, which contains the experiment names as columns and gene names as rows.

You can view the clustered data in the following ways.

'View broken images' displays a .gif of the clustered genes based on the average retrieved value.
'View broken spot images' displays a .gif of the clustered genes. The spots of the experiment are displayed in a way that allows you to see the variation within the spot.
'View joint broken images' places both the above .gifs in the same window. If you don't see the broken spot image, scroll left to bring it onto your screen.
Clicking on 'pcl' at the bottom of the screen allows you view the preclustering file.

The other links at the bottom of the screen download files to your machine.

'cdt' downloads the complete tree view datafile.
'gtr' downloads the genetree view datafile, which describes the tree of clustered genes.
If you chose an experiment clustering option on the previous page, you will also have the option to click on 'atr' to download the arraytree file.