Help


This page contains the description of the PROVEAN web tools including the input and output format.

For details of PROVEAN scores, cutoff, and performance, please refer to about page.


  1. PROVEAN Protein
    1. Input Format
    2. Ouput Format
    3. How this works
  2. PROVEAN Protein Batch
    1. Input Format
    2. Ouput Format
    3. How this works
  3. PROVEAN Genome Variants
    1. Input Format
    2. Ouput Format
    3. How this works

1a. PROVEAN Protein - Input Format

Input: a protein sequence and amino acid variants

Query protein sequence is accepted in FASTA format.

Each amino acid variation can be described in one of the following two ways.

(1) Comma-separated values: <position>,<reference amino acids>,<variant amino acids>

  • Position:
    Reference position, with the 1st amino acid having position 1.
  • Reference amino acids:
    One or more amino acids in the query protein sequence. The value in the "Position" field refers to the position of the first amino acid in this field. At least one amino acid is required.
  • Variant amino acids:
    Amino acids for variant protein sequence. The amino acids in the "Reference amino acid" field are replaced by the amino acids in this field. A deletion can be described by using a period ("."), which means no amino acids (empty).

(2) HGVS (Human Genome Variation Society) recommendations with 1-letter amino acid code (details)

Examples:

Query sequence:

M

E

E

P

Q

S

D

P

S

V

1

2

3

4

5

6

7

8

9

10

Type

Format

Meaning

Variant Sequence

Comma-separated values

HGVS notation

Single Amino Acid Substitution

5,Q,A

Q5A

Q at position 5 is changed to A

MEEPASDPSV

Deletion

4,P,.

P4del

P at position 4 is deleted

MEEQSDPSV

4,PQS,.

P4_S6del

A deletion of three amino acids, from P at position 4 to S at position 6

MEEDPSV

Insertion

7,D,DVA

D7_P8insVA

VA is inserted between positions 7 and 8

MEEPQSDVAPSV

6,S,SPQS
3,E,EPQS

P4_S6dup

PQS is duplicated

MEEPQSPQSDPSV

Insertion-deletion (Indel)

7,DP,VA

D7_P8delinsVA

DP is replaced by VA

MEEPQSVASV

3,E,QS

E3delinsQS

E is replaced by QS

MEQSPQSDPSV


1b. PROVEAN Protein - Output Format

Protein sequence variation prediction results are represented in tab-separated columns. The number of subject sequences used for the prediction is also shown.

Column headers:

  • Variant - amino acid variant provided by the user
  • PROVEAN score - PROVEAN score (details)
  • Prediction - deleterious or neutral (using default cutoff at -2.5, details of cutoff)

1c. PROVEAN Protein - Procedure

This tool provides the same function as the stand-alone PROVEAN software. Given a protein sequence, its homologs are searched against the NCBI nr database using BLAST and clustered using CD-HIT. Based on the selected homologs (supporting sequences), the PROVEAN scores are computed for each of the variants provided. The supporting sequences information is stored so that the BLAST and CD-HIT runs are by-passed for later submissions with the same query protein sequence. This approach usually reduces the run time from several minutes to a few seconds.


2a. PROVEAN Protein Batch - Input Format

Input: a list of protein sequence variants.

Each varaint is represented in comma-separated (or space-separated) values as the following:
<protein ID>,<position>,<reference amino acids>,<variant amino acids>,<comment(optional)>

  • Protein ID:
    Ensembl Protein ID, NCBI RefSeq ID, or UniProt Accession ID (for human)
    Ensembl Protein ID (for mouse)
  • Position:
    Reference position, with the 1st amino acid having position 1.
  • Reference amino acids:
    One or more amino acids in the query protein sequence. The value in the "Position" field refers to the position of the first amino acid in this field. At least one amino acid is required.
  • Variant amino acids:
    Amino acids for variant protein sequence. The amino acids in the "Reference amino acid" field are replaced by the amino acids in this field. A deletion can be described by using a period ("."), which means no amino acids (empty).

Example 1 (Single amino acid substitution): ENSP00000224605,55,D,G,user comment
Example 2 (Amino acid deletion): P13569 508 F . cystic fibrosis
Example 3 (Amino acid insertion): ENSP00000359240 59 Q QA
Example 4 (Multiple amino acid substitution): NP_000483.3 508 FG DS


2b. PROVEAN Protein Batch - Output Format

The results are represented in tab-separated columns. The column headers and their meanings are shown below.

VARIATION

  • ROW_NO. - sequential number
  • INPUT - protein variant provided by the user
  • PROTEIN_ID - protein ID
  • POSITION - position of amino acid residue affected
  • RESIDUE_REF - reference amino acid residue
  • RESIDUE_ALT - variant amino acid residue

PROVEAN PREDICTION

  • SCORE - PROVEAN score (details)
  • PREDICTION (cutoff=-2.5) - deleterious or neutral (details of cutoff)
  • #SEQ - number of sequences used for prediction
  • #CLUSTER - number of clusters used for prediction

SIFT PREDICTION

  • SCORE - SIFT score
  • PREDICTION (cutoff=0.05) - tolerated or damaging
  • MEDIAN_INFO - median sequence information used to measure the diversity of the sequences used for prediction
  • #SEQ - number of sequences used for prediction

2c. PROVEAN Protein Batch - Procedure

For each amino acid variant, its score is retrieved from a precomputed score database, which contains 20 single amino acid substitutions and a single amino acid deletion at every amino acid position for all protein sequences in our model organisms (currently, human and mouse).

If the precomputd score is not available in the database, the precomputed homologous protein identifiers for the query protein is retrieved from a database to bypass the BLAST search and clustering, and the score is computed based on the homologs. This score is stored to the database for future requests.


3a. PROVEAN Genome Variants - Input Format

Input: a list of genome variants

Each genome variant is represented in comma-separated values as the following:
<chromosome>,<position>,<reference allele>,<variant allele>,<comment(optional)>

  • Chromosome:
    Chromosome name (1, 2, 3, ..., 22, X, or Y).
  • Position:
    Reference position, with the 1st base having position 1.
  • Reference allele:
    One or more nucleotides in the reference genome. The value in the "Position" field refers to the position of the first base in this field. At least one base is required.
  • Variant allele:
    One or more nucleotides for non-reference allele. The bases in the "Reference allele" field are replaced by the bases in this field. At least one base is required.

Example 1 (SNP): 1,100382265,C,G,some comments
Example 2 (Deletion): 7,117199646,CTTT,C
Example 3 (Insertion): 1,43217995,G,GCCA
Example 4 (MNP): 10,102762471,AG,CC


3b. PROVEAN Genome Variants - Output Format

The genome variants results are represented in tab-separated columns. Three result files are available.

  1. Full version - For each variant all protein isoforms are shown.
  2. Condensed version - For each variant only the longest protein isoform is shown.
  3. Summary - Each variant is tabulated based on effects on protein translation (e.g. nonsynonymous, frameshift) and predicted effects on function (e.g. deleterious or tolerated).

Column headers for the full or condensed versions of output files:

VARIATION

  • ROW_NO. - sequential number
  • INPUT - genome variant provided by the user
  • PROTEIN_ID - Ensembl protein ID
  • LENGTH - length of the protein
  • STRAND - '1':forward, '-1':reverse
  • CODON_CHANGE - codon change including flanking codons
  • POS - postion of amino acid residue affected
  • RESIDUE_REF - reference amino acid residue
  • RESIDUE_ALT - variant amino acid residue
  • TYPE - synonymous | single AA change (nonsynonymous) | frameshift | ...

PROVEAN PREDICTION

  • SCORE - PROVEAN score (details)
  • PREDICTION (cutoff=-2.5) - deleterious or neutral (details of cutoff)
  • #SEQ - number of sequences used for prediction
  • #CLUSTER - number of clusters used for prediction

SIFT PREDICTION

  • SCORE - SIFT score
  • PREDICTION (cutoff=0.05) - tolerated or damaging
  • MEDIAN_INFO - median sequence information used to measure the diversity of the sequences used for prediction
  • #SEQ - number of sequences used for prediction

ANNOTATION

  • dbSNP_ID - NCBI dbSNP ID
  • additional OPTIONAL OUTPUT

3c. PROVEAN Genome Variants - Procedure

Each genomic variant is classified as coding or non-coding variant based on the reference genome sequence and the Ensembl gene annotation. The coding variants are further classified as amino acid substitutions, insertions, deletions, nonsense mutations, or frameshifts. For amino acid substitutions, insertions, and deletions, PROVEAN scores are retrieved or computed in the same way as PROVEAN Protein Batch function.

It also provides accessory information for the genomic variant including mappings to the NCBI dbSNP reference accessions, and gene annotation obtained from Ensembl BioMart such as gene description, PFAM domain, and Gene Ontology.