PolyA_SVM: prediction of mRNA polyadenylation sites by Support Vector Machine

INTRODUCTION:

This program takes a file containing DNA/RNA sequences in the FASTA format as input, and makes predictions for putative mRNA polyadenylation sites [poly(A) sites] and/or generates results indicating the occurrences of different cis-elements.


NOTE:

Prediction mode

1. Predicts poly(A) sites with the site range, site position, and PolyA_SVM scores.

2. The program does not make predictions for sequences shorter than 120 nt. It is too short for prediction, therefore no result is given.

3a. Multiple positive regions are reported on different lines.
3b. + indicates a positive hit, - indicates a negative hit, negative hits are filtered out of the output except when a location is specified, in which case '-' is observed.

Finding mode
1. Finding mode generates sequences of symbols corresponding to the matching results for all cis-elements. The following symbols are used for similarity to the consensus:

+ highly similar, over the 75th percentile of all possible positive scores.
| very similar, between the 50th-75th percentiles of all possible positive scores.
: similar, between the 25th-50th percentiles of all possible positive scores.
. somewhat similar, 0-25th percentiles of all possible positive scores.
- negative score.

Prediction and Finding mode

1. generates both prediction and finding data

Parameters

HPR size: The High Probability Region size, where a sequence region whose size is defined by the input, is classified as an HPR if the product of the probabilities in this region is greater than the PolyA_SVM cutoff score. The default value of 32 is optimal in combination with PolyA_SVM cutoff score=6.

Location: When location is specified, the -100 to +100 nt region surrounding the position will be used to make a single prediction.

Range: The Range specified will calculate the scoring matrix for the specified range + 100 nucleotides flanking the range
NOTE: The Range must be at least the size of HPR for a single calculation and the sequence at least 200 + size of HPR

Model: Models are developed by training using SVM Train on 500 pos/500 neg sequences from each organism. Positive sequences are extracted from known poly(A) site sequences and negative sequences are generated from the first order Markov Chain using Position Specific Scoring Matrices (PSSM) of each organism. Models are constructed using human CIS elements.

Complement Sequence: The opposing 5'->3' reverse complement strand is generated, the procedure is as follows:
The input sequence is translated to the opposing sequence A->T, C->G, T->A, G->C and the resultant sequence is then reversed prior to prediction. The predicted locations are relative to the 5'->3' of the complement strand.
For example, for a sequence from 1 to 300, a site is predicted at 100. For the complement strand the position would correspond to position 200.

PolyA_SVM Cutoff: default value is optimal at 6 which corresponds to the probability 2^-6. Smaller cutoff scores represents a more stringent prediction.


SAMPLE INPUT


>lcl|p.79958.2
GGTTAAGAATCAGGGGTCCAAGAGAGACCCCAGTCCCTCAATAAAGCCACAAGAGCCCAAAAAAGCTGGTTTTTTTCCTG GTGAATTTCTCTGGTGCCCTCACTCTGCTCGGAAATCCATCCCACCCACCTCTGTCCCTCCAAGGGCAGCCTCTCTAACT GGCTCCTAGCAGGGAATTCCAGGAAGCCTCCTGGTCTTCTAGAATCCTGGCAACCTTACAATTCCTCTCGGCATTTGTCA CTTCCATCTCAGCTAATGCACCCACCAGCTCAAACACACCAATAAAGCTTTTGTTACATCTCACTACTAATTTTATATAA ATATATAAAAGTGAATAGCCCCATCCTCTCTTGACCTCAGGGGCAGAGTCTTTGAAGTACCTTCTTCCCTGCTATTAAGC ACTGGGTGGATCTGACTGCCCAGAGAGGGCACTGCCTTTCCCCAAGGTCACACAGCAATGAAGGGAGACAGATGCAGGAG AGGGGACCTAGATGGGCAGGCAGCCCTGCACCGTCTCCTTGGAGTCCTGAGGGGCCCGGGCCTCGGCTGCATGGGAGAAC TGAAAGAAAGGAGAGGGGGTTGTGGGGCCTGCTGGACACA
>lcl|p.79858.3
CTTATACAGTAACTAGTTCAGAGTTGGTGTTGCTAAGTATTTGCTGAATGAATGACTAAACCTAGGGAAAAATCCTATTA AGAACACGAATGCTTTTTATCTTAAACAGCATACATTTCACGCAGAGATTAATTCAGTTGGGGAAGCAGTAGAATAACAA GTTGCCAGGAGAGGATAAGTTGTGTGCATCACTGTTCATCTATAAAATTTTTCTCTTTCTTCTTAAAGGAGACTGTAATC TAATTTCACTAGACGAATACTGGAAAAATGAAAAATAAAGGAATTTCTGAAATAAGGAAATAGAATCCTCCATGCATACT TTTTGAGTCCCCTAGCTGTTGCATCTTTCTTTATGGATACCCCTACATTTAAATAATATTTTAGGTAAGTCACCAACAGT TTAAAAATATGGTTTGTTCTTCTGCAGTAACTATAGAATAGTATTTAATTTAATAGAAATGGCCAGCCATCCTTTAATAG AAGGGTCACCAGAGTAATCCTCCCAAGGCTCCTTGGGATGTGTCCTGCCTAAGAACTACTGGGAATTAAAGGGTTGTATG AAGTATAGTCTGTATTACAGGTGCACACTTGTTTCTTTTG
>p.60:fly
CGTCTCCGTCGCGAGTGCCCTGGTGAGAACTGCGGCGCCGGCGTCTTCATGGCTGCCCACGAAGATCGTCACTACTGCGG CAAGTGCAACCTGACCTTTGTCTTCAGCAAACCAGAGGAAAAGTAATTTTGCTACATAAGATCATGTACGTTTCCAGAAA TCAAATAAAGGTAGTAATTGAATAATAAATTCAATCGCTGAAATTTTCCTTTTTTTTTATTGTTAGTTAGCACTAGCGTG GTTAGTAATTGTTTAGCAAAACACTAAAGTTTCATTTAGGGTGTATGGTTTATTTGTGAAAATATTGAATACATTTTGTA TACTCGATTTTTACTCGACCTTGAGAAATGCAATTTGTTTATGGTTAACCACTCAAACGTTAACTATGTCGACGTTAAGG
>p.61:fly
GCCCATCCCATCGGACTCCACCCGCAGGAAGGGCGGTCGCCGTGGTCGTCGTCTGTAGATGGCAGTATCTGGAAAGCAGT AGTCTATGTTTGCGGTCGAAATACAATACTGCATTTGTGTATGCGATAAGAAAGCTTTTCTGTTCGTGTGCATAGGTGCA CTGTAATAAACCAGGAAAATACGATGTAAATGACTACAAGACATTTTTGTGTGTGCATTGGGTTCGGTTTGGGGTCTGCA AGTGCTGCAATATTTGATTTATAAGGACATTGGCGAAATGTTTACAAAAATTGCACATTACAAGCCGAAGAGTTGCCACC CTGGCGTGCCCAACGCGGTGGTAGCCCGATTGCGCCATCAAAGTAACTATGCGTCGATAACTATCGCATTACCAGCTGTG


Legend
+, Highly similar, >75%
|, very similar, between 50-75%
:, similar, between 25-50%
., somewhat similar, between 0-25%
-, negative score


AUE1:AUE2:AUE3:AUE4:
CUE1:CUE2:
CDE1:CDE2:CDE3:CDE4:
ADE1:ADE2:ADE3:ADE4:ADE5:



REFERENCES:

Cheng, Y., Miura, R.M., and Tian, B. Accurate prediction of mRNA polyadenylation site by support vector machine. Submitted to Bioinformatics.
Hu, J., Lutz, C.S., Wilusz, J., and Tian, B. (2005). Bioinformatic identification of candidate cis-regulatory elements involved in human mRNA polyadenylation. RNA 11: 1485-1493.
Chang, C.-C. and Lin, C.-J. (2005) LIBSVM: a Library for Support Vector Machines (www.csie.ntu.edu.tw/~cjlin/libsvm).
CONTACT INFO:
Please contact Yiming Cheng (yc34@njit.edu) Michael Tsai (tsaimi@umdnj.edu) or Bin Tian (btian@umdnj.edu) for comments/suggestions.

Back to PolyA_SVM