PUMAdb : The Yeast Annotation Pipeline

Contents

Description
Annotation Pipeline
Algorithm for Locating Nearest Features

Description

This help document describes the process by which annotations are assigned to Saccharomyces cerevisiae.

Annotations are assigned to Saccharomyces cerevisiae (SC) in a two-step process. The first is to download the most recent GFF file from the Saccharomyces Genome Database (SGD) and parse out ORFs and other gene features. Feature type, map coordinates, strand, and gene name are also extracted. Then, based on data from the Gene Ontology database, the GO terms are associated with the gene features.

The second step is to BLAST the yeast oligos against the genome and the transcriptome. Oligos from yeast expression arrays are BLASTed against ORF coding and RNA genomic sequences and the top feature is assigned to the oligo (the other hits are recorded as well). Oligos from ChIP-on-Chip arrays are BLASTed against the yeast genome and the nearest two features are indentified using the algorithm below. We use an e-value of 1e-9, BLAST against the plus strand for expression oligos, and against both strands for the ChIP oligos.

Algorithm for Locating Nearest Features

ChIP oligos are BLASTed against the genome and a window around the coordinates of the top hit is examined for features. The top two features are recorded with the genomic BLAST hit, and the top feature is linked to the corresponding oligo. The algorithm used to identify the top features is from Agilent; we have slightly modified it. Here are the steps taken with each BLAST result.

Establish a window of 10K bases in each direction around the hit and collect all the features which are completely within the window. For each feature we calculate its distance to the transcription start site (TSS), and place it in one of three bins: "inside", "right" and "left", depending on its relationship to the hit.
Candidates which are contained in or overlap the probe are placed in an "inside" bucket. All others are placed into "left" and "right" buckets.
If there are any "inside" candidates, the left and right buckets are examined and any candidate which does not overlap any of the inside candidates is removed. Agilent states that "this will allow a probe to be labeled as reporting for a promoter region if two genes overlap in genomic space." If there are only insides remaining, choose the one with the shortest TSS.
Next we move all remaining downstream feature from the left and right buckets into a single "downstream" bucket and record the downstream feature with the shortest TSS.
Look for divergent promoters in the left and right buckets. Divergent promoters are on opposite strands coding in opposite directions. If more than one pair is found, identify the pair with the shortest distance between them. Pick the feature in the pair with the shortest TSS. If we recorded a downstream feature in the previous step, compare its TSS with the promoter's and choose the feature with the shortest. Otherwise, choose the promoter.
If we have promoters and downstreams, and the downstream's TSS is shorter than the promoter's, choose the downstream. (This was not explicitly stated in Agilent's description, but yielded better matching results.)
If there are left and or right promoters, choose the one with the shortest TSS.
If there are no promoters but there are insides, choose the inside with the shortest TSS.
If there are downstreams, choose the one with the shortest TSS.