PUMAdb : Retrieving Public Data

Contents

Description
Retrieving all public data for an organism
Retrieving all public data for a publication
File formats

Description

Once data stored within the database have been used in a publication, the full raw data are made freely available to the public. In addition, at an experimenter's discretion, unpublished data may also be made publically available. There are search interfaces for selecting particular experiments for which you wish to download raw data, that are detailed elsewhere (Advanced Results Seach | Basic Search). This document describes in detail how to download raw data for one or many experiments, such as all public data for an organism, or all public data for a publication. The formats of the downloaded files are also detailed in this document.

Retrieving all public data for an organism

There is an ftp site where the public data for each organism are in separate directories, with one file per experiment. The base address for this ftp site is:

ftp://gen-ftp.princeton.edu/puma/organisms

Under this directory is one directory per organism, whose names are a two letter code used by PUMAdbdb to indicate the organism. A list of the organisms, and their two letter codes are available as is a simple interface to the directories, indicating the organism, and the number of available experiments.

Within each organism's directory, is one file per public experiment. These files are gzipped, so will need to be unzipped prior to use (using Winzip, Stuffit Expander, or gunzip, depending on your platform). Further details of the file format may be found below.

Retrieval by web client
The exact method you use to retrieve all the files from an organism's directory, depends on your ftp client of choice. If you are simply using a web browser, such as Netscape, or Internet Explorer, clicking on each file one at a time, and downloading them will work. However, we recommend using an alternative method for downloading many experiments, as using a web browser will be tedious and time consuming.
Retrieval by graphical ftp client
A graphical ftp client, such as Fetch on the Macintosh, or FTP Explorer on the PC, will allow you to connect to an ftp site, select one, several, or all files in a directory, then download them. This will likely save you a lot of time. The example below uses Fetch on MacOSX:
First log in - use your email address as password

Then select all files, and hit "Get..."
Retrieval using a command line ftp client
A command line ftp client can be used to easily retrieve the entire contents of a directory from an ftp site. Typically users on a unix system may use ftp on the command line. The following example is taken from the command line in MacOSX

Note the -i switch means that when you retrieve the files (using mget *gz) that it doesn't ask you to confirm that you want to retrieve each one - it just gets them all. Command line ftp is likely to be the quickest way of retrieving all the files.

Retrieving all public data for an publication

PUMAdbdb provides a simple interface by which you can list all publications whose data reside in PUMAdbdb. If you click on the Data in PUMAdbdb icon, this will show you a list of the experiment sets that are included within that paper. Here you will find a link to download all the files for an experiment set as a single gzipped tarfile.

File formats

Currently PUMAdbdb uses its own, somewhat ad hoc file format for raw data files. To indicate the organization of experiment sets within a publication, and experiments within a set, PUMAdbdb produces .meta files. These are not strict XML, though look somewhat like it. We will in future use the MAGEML standard which is currently being defined.

A meta file for an publication, named publication_XX.meta, where XX is PUMAdbdb's internal publication number, looks something like:


<publication>
!Citation=Spellman PT et al.(1998) Mol Biol Cell 9:3273-97
!Title=Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.
!PubMedID=9843569
	<experiment_set>
		!Name=Spellman et al : Alpha-factor block-release
		!ExptSetNo=526
		!Description=These data are from a time series, where yeast were arrested in alpha-factor, then the alpha-factor was washed out, and the cells were release into fresh medium. Samples were taken every 7 minutes as the cells went through the cell cycle.
	</experiment_set>
	<experiment_set>
		!Name=Spellman et al : cdc15 block-release
		!ExptSetNo=528
		!Description=Yeast cells were blocked in telophase using a cdc15-2 temperature senstive mutant at restrictive temperature. The culture was then shifted to permissive temperature (25oC), and released into the cell cycle. Sample were then taken during the course of almost three full cell cycles.
	</experiment_set>
	<experiment_set>
		!Name=Spellman et al : Elutriation time course
		!ExptSetNo=529
		!Description=Small G1 daughter yeast cells were isolated by centrigugal elutriation. They were then released into YEP ethanol, and followed through one cell cycle, with samples being taken every 30 minutes.
	</experiment_set>
	<experiment_set>
		!Name=Spellman et al : Cyclin overexpression
		!ExptSetNo=530
		!Description=Yeast cell were arrested either in G1 (for CLN3 overexpression) or in G2/M (for CLB2 overexpression). The cyclin was then induced, and samples were taken.
	</experiment_set>
</publication>

Each individual item, such as an experiment set, or a publication is enclosed by a tag to indicate its start and end. Comments about an item begin with an exclamation point, followed by the name of the type of information.

The organization of the .meta file for an experiment set is very similar. For each individual experiment file, there is a series of comment lines at the top of the file, eg:


!Exptid=29
!Experiment Name=alpha factor release sample016
!Organism=Saccharomyces cerevisiae
!Category=Cell-cycle
!Subcategory=Alpha factor block-release
!Experimenter=Paul Spellman
!Contact email=spellman@genomics.stanford.edu
!Contact Address1=School of Medicine
!Contact Address2=Department of Genetics
!State=CA
!Country=USA
!Postal Code=94305
!SlideName=y744n101
!Printname=y744
!Tip Configuration=Standard 4-tip
!Columns per Sector=44
!Rows per Sector=44
!Column Spacing=135
!Row Spacing=135
!Channel 1 Description=asynchronous control (prep3)
!Channel 2 Description=16
!Scanning Software=ScanAlyze
!Software version=2.03
!Scanning parameters=

This is then followed by the data themselves. The data includes all the raw data from the image scan, biological annotation attributed to each spot, and tracking information about the microtiter plates from which the samples were printed. The data are tab-delimited. For definitions of the various columns, please see the table specifications for the RESULT and PLATESAMPLE tables, as well as the relevant annotation tables