PUMAdb : Synthetic Gene Collapse Help

Your session is inactive. Login

Contents


What is a synthetic gene?

A "synthetic gene" is a group of "reporters" (clones, oligos, ORFs, etc.), together with some arithmetic method of combining their expression vectors. This might be used to simplify a clustering analysis; for instance, if proliferation genes cluster strongly in your data but are not interesting biologically, you might average their vectors, call this average the "proliferation cluster," and remove the actual data from the data set. Alternately, one might average the expression of genes commonly upregulated in pancreatic cancers, in hopes of reinforcing a common trend predictive of survival.

The database supports re-annotation of .pcl files based on synthetic gene definitins, and/or simple, weighted averaging of sets of vectors in .pcl files. Thus, a synthetic gene might be defined as a list of all human clones mapping to proliferation genes, perhaps with some down-weighted if they are redundant to other clones. If a subset were found to be generally anti-correlated to the others, they could be assigned a negative weight; then their vectors would be inverted before averaging. If proliferation genes were of interest in an analysis, the synthetic gene vector could be added to the data set. On the other hand, if the proliferation genes were uninteresting and/or were found to drive clustering, their actual vectors could be removed from the data and replaced with the single synthetic vector, as a place-holder.

Synthetic genes can also be used to capture the behavior of a class of genes, such as all proteases, or all genes in a given cytoband. This application will frequently combine genes that are entirely unrelated, but in some circumstances and for some purposes may be illuminating.

Available synthetic genes

At this time, the potential uses of synthetic genes are very much under investigation. The curated synthetic gene lists are in a state of flux and may change at any time. Synthetic genes are currently supported only for human and mouse data.

Selecting data

Synthetic gene operations takes place after data retrieval and, optionally, before filtering based on expression patterns. Data must be retrieved, and the .pcl file placed in your repository. From the repository, click on the "Synth"
icon for a .pcl file of interest to select synthetic genes for averaging.

All operations ("collapse" or annotation - see below) are based on the row identifier in the .pcl file, which must exactly match the identifier used in the synthetic gene definition. Essentially this means that you must not select the "include UIDs" option in data retrieval, so that the row identifiers are the clone IDs or systematic names of the reporters.

Any number of sets of synthetic genes may be selected from the options presented, as well as any number of individual genelists from your loader genelists directory. An actual reporter (clone, oligo, etc.) may be included in multiple synthetic gene definitions, optionally with different weights in each.

Collapsing synthetic genes in data

If the "Average expression vectors" option is selected, all the data for all reporters (clone, oligo, etc.) included in each synthetic gene definition will be averaged together to produce a single, new expression vector for each synthetic gene selected. Before averaging, you must also decide whether to retain all original data, discard all original data, or discard any reporter included in a synthetic gene. The averaged synthetic gene vectors will be added to the .pcl file, original vectors will be removed according to your selection, and the .pcl file may then be filtered and/or clustered.

The resulting averaged synthetic gene can be composed of a variable number of component expression vectors. If you wish to constrain the resulting file to those vectors composed by a minimum number of precursor components, select the checkbox "averaged vector must be composed of >= N constituent expression vectors (a minimum threshold for reliability)", where N is the minimumj number of conponent vectors required. This feature is primarily for those array designs where each composite sequence (i.e. ORF orr a "gene") is represented by multiple distinct reporters on the array.

Adding synthetic gene annotations to data

If the "Add synthetic gene annotations" option is selected, the data in your .pcl file will not be altered, but the annotation for each row will be changed. The name (or annotation, if any) for each selected synthetic gene to which the reporter belongs will be prepended to the existing annotation. Optionally, you may un-select the "Retain original annotations" option, in which case any annotations present in the file before this operation will be discarded.

Defining your own synthetic genes

Any genelist may be used as a synthetic gene. It must include identifiers (clone IDs, systematic names, etc.) that are included in the .pcl file to be collapsed. Thus, only genelists of type "NAME" are allowed, and data must be retrieved without the "include UIDs" option. The genelist should contain a header row, with the column names "NAME" and "WEIGHT" (other columns will be ignored). If there is no header row, it will be assumed that the first column contains IDs, and the second contains weights. Weights must be numbers between -1 and 1, inclusive, excluding zero. Anything else will be interpreted as 1. (So if your genelists contain annotations in the second column, the annotations will be ignored and all weights will be set to 1.) The genelist filename will be used for the name/annotation of the synthetic gene.