#################################################### README for http://polya.umdnj.edu/polyadb/download/ Last updated: September 13, 2004 E-mail: hz5@njit.edu #################################################### This directory contains the following files: ============================================ README -- this file an explanation of the PolyA_DB files. EST_cDNA_RefSeq -- a tab-delimited file containing all the EST/cDNA/RefSeq after filtering procedures(see Zhang et al Nucleic Acid Res. 2004). Columns are: 1)gbid: GenBank accession for the EST/cDNA/RefSeq. 2)llid: LocusLink ID for the gene that the corresponding EST/cDNA/RefSeq is mapped to. 3)length: length of the EST/cDNA/RefSeq. 4)tail: tail of the EST/cDNA/RefSeq, can be one of the following: A: poly(A). T: poly(T). a: potential internal priming poly(A). t: potential internal priming poly(T). N: no tail. 6)supportsite: poly(A) site ID if the EST/cDNA/RefSeq supports the existence of a poly(A) site, NULL otherwise. EST_Lib_Info -- a tab-delimited file containing all the EST library information, derived from NCBI UniGene database. Columns are: 1)libid: EST library ID. 2)title: EST library title. 3)tissue: EST library tissue information. 4)organ: EST library organ information. 5)vector: vector information. EST_Lib_Map -- a tab-delimited file containing all the information of EST to library mapping, derived from NCBI UniGene database. Columns are: 1)gbid: EST GenBank ID. 2)libid: EST library ID. exon -- a tab-delimited file containing all the information of mapping exons. Columns are: 1)gbid: EST/cDNA/RefSeq GenBank ID. 2)rankorder: exon ranked from 5' to 3' of the gene, start from 1. 3)contig: contig ID, NT_######. 6)efrom: exon 'from' coordinate on the EST/cDNA/RefSeq. 7)eto: exon 'to' coordinate on the EST/cDNA/RefSeq. 8)gfrom: exon 'from' coordinate mapped on the contig. 9)gto: exon 'to' coordinate mapped on the contig. gene -- a tab-delimited file containing information of genes, represented by LocusLink. Columns are: 1)llid: LocusLink ID. 2)contig: contig ID the gene is located, NT_######. 3)strand: strandness of the gene, 1 (+), -1 (-). 4)symbol: official symbol of the gene. 5)name: gene name. 6)unigene: UniGene ID. 7)org: organism, hs (homo sapiens), mm (mus musculus). 8)gbf: gene boundary 'from' defined by the RefSeq of this gene, the coordinate is relative to the contig the gene is located. 9)gbt: gene boundary 'to' defined by the RefSeq of this gene, the coordinate is relative to the contig the gene is located. RefSeq -- a tab-delimited file containing information of RefSeqs. Columns are: 1)RefSeq ID: GenBank accession for the RefSeq. 2)length: transcript length. 3)protein: GenBank accession for corresponding protein product. 4)start: start codon position, coordinate is annotated on the contig this RefSeq is aligned to, see file "exon". 5)stop: stop codon position, coordinate is annotated on the contig this RefSeq is aligned to, see file "exon". hs_mm_ortholog -- a tab-delimited file containing human and mouse ortholog pairs, derived from NCBI HomoloGene database. Columns are: 1)llhs: homo sapiens ortholog gene LocusLink ID. 2)llmm: mus musculus ortholog gene LocusLink ID. PAS -- a tab-delimited file containing information of poly(A) hexamer signals. Columns are: 1)siteid: Poly(A) site ID, in the format of p.###.*, where p is a letter p, ### is the LocusLink ID for the corresponding gene, * is a ranked number (ranked for all poly(A) sites of the gene from 5' to 3' of the transcript) 2)pas: Poly(A) hexamer signal id as specified in signal.gz signalid ann 1 AAUAAA 2 AUUAAA 3 AGUAAA 4 UAUAAA 5 CAUAAA 6 GAUAAA 7 AAUAUA 8 AAUACA 9 AAUAGA 10 ACUAAA 11 AAGAAA 12 AAUGAA 3).position: coordinates of the 3' most position of the hexamer toward the poly(A) sites, numbered -1 to -40 relative to the poly(A) site. polyAsite -- a tab-delimited file containing information of poly(A) sites identified. Columns are: 1)siteid: Poly(A) site ID, in the format of p.###.*, where p is a letter p, ### is the LocusLink ID for the corresponding gene, * is a ranked number (ranked for all poly(A) sites of the gene from 5' to 3' of the transcript). 2)llid: LocusLink ID of the corresponding gene. 3)contig: the contig ID, NT_######. 4)rankorder: a ranked number (ranked for all poly(A) sites of the gene from 5' to 3' of the transcript). 5)position: coordinate of the poly(A) site on the contig. 6)supportest: number of EST/cDNA/RefSeq supporting the existence of this poly(A) site. 7)cleavage: number of imprecise/heterogeneous cleavage site for this poly(A) site. signal -- a tab-delimited file containing poly(A) hexamer signals and their IDs. Columns are: 1)signalid: IDs for poly(A) hexamer signals. 2)ann: annotation, the hexamer sequence. Reference: ========== Zhang, H., Hu, J., Recce, M., and Tian, B. (2004) PolyA_DB: a database for mammalian mRNA polyadenylation. Nucleic Acids Research (submitted). Beaudoing, E., Freier, S., Wyatt, J.R., Claverie, J.M. and Gautheret, D. (2000) Patterns of variant polyadenylation signal usage in human genes. Genome Res, 10, 1001-1010. Burset, M., Seledtsov, I.A. and Solovyev, V.V. (2001) SpliceDB: database of canonical and non-canonical mammalian splice sites. Nucleic Acids Res, 29, 255-259.