Protein Recommendations

Duplication Variant


Definitions

Duplication
a sequence change where, compared to a reference sequence, a copy of one or more amino acids are inserted directly 3' of the original copy of that sequence.

Description

Format: “prefix”“amino_acid(s)+position(s)_duplicated”“dup”, e.g. p.(Cys76_Glu79dup)

“prefix” = reference sequence used = p.
“amino_acid(s)+position(s)_duplicated” = amino acid with position or range (first amino acid with position to last amino acid with position) duplicated = Cys76_Glu79
“dup” = type of change is a duplication = dup


Note

  • prefix reference sequence accepted is “p.” (protein).
  • predicted consequences, i.e. without experimental evidence (no RNA or protein sequence analysed), should be given in parentheses, e.g. p.(Arg727_Ser783dup).
  • the “amino_acids+positions_duplicated” should contain two different positions, e.g. Cys76_Glu79, not Cys76_Cys76.
  • the “positions_duplicated” should be listed from 5’ to 3’, e.g. Cys76_Glu79, not Glu79_Cys76.
  • by definition, duplication may only be used when the additional copy is directly 3’-flanking the original copy (a “tandem duplication”).
  • when there is no evidence that the extra copy of a sequence detected is in tandem (directly 3’-flanking) the original copy, the change can not be described as a duplication, it should be described as an insertion (see Insertion).
  • for all descriptions the most C-terminal position possible of the reference sequence is arbitrarily assigned to have been changed (3’rule)
    • the 3’rule also applies for changes in single amino acid stretches and tandem repeats
  • variants should be described on the protein level and not incorporate knowledge regarding the change at the DNA level
  • under discussion, see Proposal for complex variants
    { } (curly braces) can be used to list any change in the duplicated sequence (“positions_duplicated”) which is different when compared to the source, e.g. p.Cys76_Glu94dup{Ala88Glu}

Examples

  • p.Ala3dup (one amino acid)
    a duplication of amino acid Ala3 in the sequence MetGlyAlaArgSerSerHis to MetGlyAlaAlaArgSerSerHis
  • p.(Ala3dup)
    the predicted consequence at the protein level is a duplication of amino acid Ala3 in the sequence MetGlyAlaArgSerSerHis to MetGlyAlaAlaArgSerSerHis
  • p.Ala3_Ser5dup (several amino acids)
    a duplication of amino acids Ala3 to Ser5 in the sequence MetGlyAlaArgSerSerHis to MetGlyAlaArgSerAlaArgSerSerHis
  • p.Ser6dup
    a duplication of amino acid Ser in the sequence MetGlyAlaArgSerSerHis to MetGlyAlaArgSerSerSerHis
    NOTE: for duplications in single amino acid stretches or tandem repeats, the most C-terminal residue is arbitrarily assigned to have been duplicated

Q&A

Why do we not describe a duplication as an insertion?

Although duplications are basically a special type of insertion, there are several reasons why the recommendation is to describe duplications differently;
  • the description is simple and shorter,
  • it is clear and prevents confusion regarding the position when an insertion is incorrectly reported like "Ala22insGly".

How should I describe the change "MetArgThrGlySerSerHisGlnTrpPhe" to "MetArgThrGlySerSerHisGlySerSerGlnTrpPhe"? The fact that the inserted sequence (GlySerSer) is present in the original sequence suggests it derives from a duplicative event.

The variant should be described as an insertion; p.His7_Gln8insGly4_Ser6. A description using "dup" is not correct since, by definition, a duplication should be directly 3'-flanking of the original copy (in tandem). Note that the description given still makes it clear that the sequence inserted between p.His7 and pGln8 is probably derived from nearby, i.e. position p.Gly4 to p.Ser6, and thus likely derived from a duplicative event.

What do you mean with "variants should be described on the protein level and not incorporate knowledge regarding the change at the DNA-level"?

It means that protein variant descriptions should be derived from comparing the variant protein sequence with the reference protein sequence. Knowledge on the underlying change at the DNA level should not be used. E.g. when MetTrpSerSerSerHisAsp.. changes to MetTrpSerSerSerSerHisAsp.. this is described as p.Ser5dup. The information that at the DNA level the change is ..ATGTGGTCCAGTTCCCACGAT.. to ..ATGTGGTCCAGTAGTTCCCACGAT.., so the codon for Ser4 is duplicated, is not used; the description p.Ser4dup is not correct.