Help

This page contains the description of the PROVEAN web tools including the input and output format.

For details of PROVEAN scores, cutoff, and performance, please refer to about page.

PROVEAN Protein

Input Format
Ouput Format
How this works

PROVEAN Protein Batch

Input Format
Ouput Format
How this works

PROVEAN Genome Variants

Input Format
Ouput Format
How this works

1a. PROVEAN Protein - Input Format

Input: a protein sequence and amino acid variants

Query protein sequence is accepted in FASTA format.

Each amino acid variation can be described in one of the following two ways.

(1) Comma-separated values: <position>,<reference amino acids>,<variant amino acids>

Position:
Reference position, with the 1st amino acid having position 1.
Reference amino acids:
One or more amino acids in the query protein sequence. The value in the "Position" field refers to the position of the first amino acid in this field. At least one amino acid is required.
Variant amino acids:
Amino acids for variant protein sequence. The amino acids in the "Reference amino acid" field are replaced by the amino acids in this field. A deletion can be described by using a period ("."), which means no amino acids (empty).

(2) HGVS (Human Genome Variation Society) recommendations with 1-letter amino acid code (details)

Examples:

Query sequence:	M	E	E	P	Q	S	D	P	S	V
	1	2	3	4	5	6	7	8	9	10

Type	Format		Meaning	Variant Sequence
Type	Comma-separated values	HGVS notation	Meaning	Variant Sequence
Single Amino Acid Substitution	5,Q,A	Q5A	Q at position 5 is changed to A	MEEPASDPSV
Deletion	4,P,.	P4del	P at position 4 is deleted	MEEQSDPSV
Deletion	4,PQS,.	P4_S6del	A deletion of three amino acids, from P at position 4 to S at position 6	MEEDPSV
Insertion	7,D,DVA	D7_P8insVA	VA is inserted between positions 7 and 8	MEEPQSDVAPSV
Insertion	6,S,SPQS 3,E,EPQS	P4_S6dup	PQS is duplicated	MEEPQSPQSDPSV
Insertion-deletion (Indel)	7,DP,VA	D7_P8delinsVA	DP is replaced by VA	MEEPQSVASV
Insertion-deletion (Indel)	3,E,QS	E3delinsQS	E is replaced by QS	MEQSPQSDPSV

1b. PROVEAN Protein - Output Format

Protein sequence variation prediction results are represented in tab-separated columns. The number of subject sequences used for the prediction is also shown.

Column headers:

Variant - amino acid variant provided by the user
PROVEAN score - PROVEAN score (details)
Prediction - deleterious or neutral (using default cutoff at -2.5, details of cutoff)

1c. PROVEAN Protein - Procedure

This tool provides the same function as the stand-alone PROVEAN software. Given a protein sequence, its homologs are searched against the NCBI nr database using BLAST and clustered using CD-HIT. Based on the selected homologs (supporting sequences), the PROVEAN scores are computed for each of the variants provided. The supporting sequences information is stored so that the BLAST and CD-HIT runs are by-passed for later submissions with the same query protein sequence. This approach usually reduces the run time from several minutes to a few seconds.

2a. PROVEAN Protein Batch - Input Format

Input: a list of protein sequence variants.

Each varaint is represented in comma-separated (or space-separated) values as the following:
<protein ID>,<position>,<reference amino acids>,<variant amino acids>,<comment(optional)>

Protein ID:
Ensembl Protein ID, NCBI RefSeq ID, or UniProt Accession ID (for human)
Ensembl Protein ID (for mouse)
Position:
Reference position, with the 1st amino acid having position 1.
Reference amino acids:
One or more amino acids in the query protein sequence. The value in the "Position" field refers to the position of the first amino acid in this field. At least one amino acid is required.
Variant amino acids:
Amino acids for variant protein sequence. The amino acids in the "Reference amino acid" field are replaced by the amino acids in this field. A deletion can be described by using a period ("."), which means no amino acids (empty).

Example 1 (Single amino acid substitution): ENSP00000224605,55,D,G,user comment
Example 2 (Amino acid deletion): P13569 508 F . cystic fibrosis
Example 3 (Amino acid insertion): ENSP00000359240 59 Q QA
Example 4 (Multiple amino acid substitution): NP_000483.3 508 FG DS

2b. PROVEAN Protein Batch - Output Format

The results are represented in tab-separated columns. The column headers and their meanings are shown below.

VARIATION

ROW_NO. - sequential number
INPUT - protein variant provided by the user
PROTEIN_ID - protein ID
POSITION - position of amino acid residue affected
RESIDUE_REF - reference amino acid residue
RESIDUE_ALT - variant amino acid residue

PROVEAN PREDICTION

SCORE - PROVEAN score (details)
PREDICTION (cutoff=-2.5) - deleterious or neutral (details of cutoff)
#SEQ - number of sequences used for prediction
#CLUSTER - number of clusters used for prediction

SIFT PREDICTION

SCORE - SIFT score
PREDICTION (cutoff=0.05) - tolerated or damaging
MEDIAN_INFO - median sequence information used to measure the diversity of the sequences used for prediction
#SEQ - number of sequences used for prediction

2c. PROVEAN Protein Batch - Procedure

For each amino acid variant, its score is retrieved from a precomputed score database, which contains 20 single amino acid substitutions and a single amino acid deletion at every amino acid position for all protein sequences in our model organisms (currently, human and mouse).

If the precomputd score is not available in the database, the precomputed homologous protein identifiers for the query protein is retrieved from a database to bypass the BLAST search and clustering, and the score is computed based on the homologs. This score is stored to the database for future requests.

3a. PROVEAN Genome Variants - Input Format

Input: a list of genome variants

Each genome variant is represented in comma-separated values as the following:
<chromosome>,<position>,<reference allele>,<variant allele>,<comment(optional)>

Chromosome:
Chromosome name (1, 2, 3, ..., 22, X, or Y).
Position:
Reference position, with the 1st base having position 1.
Reference allele:
One or more nucleotides in the reference genome. The value in the "Position" field refers to the position of the first base in this field. At least one base is required.
Variant allele:
One or more nucleotides for non-reference allele. The bases in the "Reference allele" field are replaced by the bases in this field. At least one base is required.

Example 1 (SNP): 1,100382265,C,G,some comments
Example 2 (Deletion): 7,117199646,CTTT,C
Example 3 (Insertion): 1,43217995,G,GCCA
Example 4 (MNP): 10,102762471,AG,CC

3b. PROVEAN Genome Variants - Output Format

The genome variants results are represented in tab-separated columns. Three result files are available.

Full version - For each variant all protein isoforms are shown.
Condensed version - For each variant only the longest protein isoform is shown.
Summary - Each variant is tabulated based on effects on protein translation (e.g. nonsynonymous, frameshift) and predicted effects on function (e.g. deleterious or tolerated).

Column headers for the full or condensed versions of output files:

VARIATION

ROW_NO. - sequential number
INPUT - genome variant provided by the user
PROTEIN_ID - Ensembl protein ID
LENGTH - length of the protein
STRAND - '1':forward, '-1':reverse
CODON_CHANGE - codon change including flanking codons
POS - postion of amino acid residue affected
RESIDUE_REF - reference amino acid residue
RESIDUE_ALT - variant amino acid residue
TYPE - synonymous | single AA change (nonsynonymous) | frameshift | ...

PROVEAN PREDICTION

SCORE - PROVEAN score (details)
PREDICTION (cutoff=-2.5) - deleterious or neutral (details of cutoff)
#SEQ - number of sequences used for prediction
#CLUSTER - number of clusters used for prediction

SIFT PREDICTION

SCORE - SIFT score
PREDICTION (cutoff=0.05) - tolerated or damaging
MEDIAN_INFO - median sequence information used to measure the diversity of the sequences used for prediction
#SEQ - number of sequences used for prediction

ANNOTATION

dbSNP_ID - NCBI dbSNP ID
additional OPTIONAL OUTPUT

3c. PROVEAN Genome Variants - Procedure

Each genomic variant is classified as coding or non-coding variant based on the reference genome sequence and the Ensembl gene annotation. The coding variants are further classified as amino acid substitutions, insertions, deletions, nonsense mutations, or frameshifts. For amino acid substitutions, insertions, and deletions, PROVEAN scores are retrieved or computed in the same way as PROVEAN Protein Batch function.

It also provides accessory information for the genomic variant including mappings to the NCBI dbSNP reference accessions, and gene annotation obtained from Ensembl BioMart such as gene description, PFAM domain, and Gene Ontology.