PUMAdb : Synthetic Gene Collapse Help

Contents

What is a synthetic gene?
Available synthetic genes
Selecting data
Collapsing synthetic genes in data
Defining your own synthetic genes

What is a synthetic gene?

A "synthetic gene" is a group of "reporters" (clones, oligos, ORFs, etc.), together with some arithmetic method of combining their expression vectors. This might be used to simplify a clustering analysis; for instance, if proliferation genes cluster strongly in your data but are not interesting biologically, you might average their vectors, call this average the "proliferation cluster," and remove the actual data from the data set. Alternately, one might average the expression of genes commonly upregulated in pancreatic cancers, in hopes of reinforcing a common trend predictive of survival.

The database supports re-annotation of .pcl files based on synthetic gene definitins, and/or simple, weighted averaging of sets of vectors in .pcl files. Thus, a synthetic gene might be defined as a list of all human clones mapping to proliferation genes, perhaps with some down-weighted if they are redundant to other clones. If a subset were found to be generally anti-correlated to the others, they could be assigned a negative weight; then their vectors would be inverted before averaging. If proliferation genes were of interest in an analysis, the synthetic gene vector could be added to the data set. On the other hand, if the proliferation genes were uninteresting and/or were found to drive clustering, their actual vectors could be removed from the data and replaced with the single synthetic vector, as a place-holder.

Synthetic genes can also be used to capture the behavior of a class of genes, such as all proteases, or all genes in a given cytoband. This application will frequently combine genes that are entirely unrelated, but in some circumstances and for some purposes may be illuminating.

Available synthetic genes

At this time, the potential uses of synthetic genes are very much under investigation. The curated synthetic gene lists are in a state of flux and may change at any time. Synthetic genes are currently supported only for human and mouse data.

UniGene clusters were used to construct lists of human and mouse clones that map to the same gene. Lists are named by gene symbol or UniGene cluster ID, and annotated with the gene name.
Curated UniGene lists are generated from an analysis of co-expression of clones mapped to the same UniGene clusters, using public human data in the database. The algorithm used is exploratory and subject to change. Lists contain only clones that were observed to correlate well across multiple data sets. If more than one group of clones (two or more) were observed to correlate well, but not with other clones mapped to the same cluster, the lists are identified with the gene name or UniGene cluster ID, followed by "__*1*", "__*2*", etc. Curated UniGene lists are available only for human clones.
LocusLink synthetic genes were constructed using UniGene's mapping of human and mouse clones to LocusLink loci. Only clones with a unique mapping to a single locus are included. The lists are named by LocusLink identifier (a number), for compatibility with GeneXPress software; gene symbols and names are included in the annotations.
Golden Path mappings were used to construct lists of all human clones and oligos in the database according to their genomic positions. Lists are available for chromosome arms, cytobands, and 5 Mb tiles, for human and mouse clones.
Transcript statistics were drawn or computed from non-redundant full-length transcript data from H-Inv DB. Lists were generated for approximately 5-percentile ranges of full transcript length, coding sequence length, 3' and 5' UTR lengths, number of exons, and GC content. Full-length transcripts were mapped to UniGene clusters, and then to clones/oligos. Full-length transcripts mapped to multiple UniGene clusters were discarded, because in general we cannot identify the splice (or other) variant to which each clone maps.
Cancer Modules were derived from the LocusLink-based modules from Segal et al. (2004) These are sets of genes which have been observed to be well correlated across many microarray data sets, derived from observed clusters, common GO annotations, individual pathways, etc. Each module was translated to the list of clones in the database which were uniquely mapped to a single LocusLink ID in the original module. Some of the modules have been assigned concise descriptions. Additional information about each module is available at the supplemental website for the manuscript.
Tentative synthetic gene list are also available for human clones characteristicly expressed in various tissues, cancer types, and biological processes. These lists are subject to rapid change as we investigate their definitions and use. The suffix "_sw" appended to the name of one of these lists indicates that the consituent clones were weighted +/- 1 based on a simple heuristic.
Any genelist in your loader genelist directory can also be used for synthetic gene averaging - see below.

Selecting data

Synthetic gene operations takes place after data retrieval and, optionally, before filtering based on expression patterns. Data must be retrieved, and the .pcl file placed in your repository. From the repository, click on the "Synth"
icon for a .pcl file of interest to select synthetic genes for averaging.

All operations ("collapse" or annotation - see below) are based on the row identifier in the .pcl file, which must exactly match the identifier used in the synthetic gene definition. Essentially this means that you must not select the "include UIDs" option in data retrieval, so that the row identifiers are the clone IDs or systematic names of the reporters.

Any number of sets of synthetic genes may be selected from the options presented, as well as any number of individual genelists from your loader genelists directory. An actual reporter (clone, oligo, etc.) may be included in multiple synthetic gene definitions, optionally with different weights in each.

Collapsing synthetic genes in data

If the "Average expression vectors" option is selected, all the data for all reporters (clone, oligo, etc.) included in each synthetic gene definition will be averaged together to produce a single, new expression vector for each synthetic gene selected. Before averaging, you must also decide whether to retain all original data, discard all original data, or discard any reporter included in a synthetic gene. The averaged synthetic gene vectors will be added to the .pcl file, original vectors will be removed according to your selection, and the .pcl file may then be filtered and/or clustered.

The resulting averaged synthetic gene can be composed of a variable number of component expression vectors. If you wish to constrain the resulting file to those vectors composed by a minimum number of precursor components, select the checkbox "averaged vector must be composed of >= N constituent expression vectors (a minimum threshold for reliability)", where N is the minimumj number of conponent vectors required. This feature is primarily for those array designs where each composite sequence (i.e. ORF orr a "gene") is represented by multiple distinct reporters on the array.

Adding synthetic gene annotations to data

If the "Add synthetic gene annotations" option is selected, the data in your .pcl file will not be altered, but the annotation for each row will be changed. The name (or annotation, if any) for each selected synthetic gene to which the reporter belongs will be prepended to the existing annotation. Optionally, you may un-select the "Retain original annotations" option, in which case any annotations present in the file before this operation will be discarded.

Defining your own synthetic genes

Any genelist may be used as a synthetic gene. It must include identifiers (clone IDs, systematic names, etc.) that are included in the .pcl file to be collapsed. Thus, only genelists of type "NAME" are allowed, and data must be retrieved without the "include UIDs" option. The genelist should contain a header row, with the column names "NAME" and "WEIGHT" (other columns will be ignored). If there is no header row, it will be assumed that the first column contains IDs, and the second contains weights. Weights must be numbers between -1 and 1, inclusive, excluding zero. Anything else will be interpreted as 1. (So if your genelists contain annotations in the second column, the annotations will be ignored and all weights will be set to 1.) The genelist filename will be used for the name/annotation of the synthetic gene.