Protein Recommendations

Duplication Variant


a sequence change between the translation initiation (start) and termination (stop) codon where, compared to a reference sequence, a copy of one or more amino acids are inserted directly C-terminal of the original copy of that sequence.


Format: “prefix”“amino_acid(s)+position(s)_duplicated”“dup”, e.g. p.(Cys76_Glu79dup)

“prefix” = reference sequence used = p.
“amino_acid(s)+position(s)_duplicated” = amino acid with position or range (first amino acid with position to last amino acid with position) duplicated = Cys76_Glu79
“dup” = type of change is a duplication = dup


  • all variants should be described at the DNA level, descriptions at the RNA and/or protein level may be given in addition
  • prefix reference sequence accepted is “p.” (protein).
  • predicted consequences, i.e. without experimental evidence (no RNA or protein sequence analysed), should be given in parentheses, e.g. p.(Arg727_Ser783dup).
  • the “amino_acids+positions_duplicated” should contain two different positions, i.e. Cys76_Glu79, not Cys76_Cys76.
    • the “positions_duplicated” should be listed from 5’ to 3’, i.e. Cys76_Glu79, not Glu79_Cys76.
  • by definition, duplication may only be used when the additional copy is directly C-terminal of the original copy (a “tandem duplication”).
    • when the extra copy is, at the protein level, not in tandem (directly C-terminal), the change should be described as insertion (see Insertion).
    • duplications extending the amino acid sequence at the C-terminal end with one or more amino acids are described as Extension
  • for all descriptions the most C-terminal position possible of the reference sequence is arbitrarily assigned to have been changed (3’rule)
  • duplications at the DNA or RNA level, starting N-terminal of and including the translation termination (stop) codon usually have no (predicted) effect on the protein level.
  • duplications at DNA or RNA level
    • which introduce an immediate translation termination (stop) codon at the protein level are described as a nonsense variant.
    • encoding a translation stop codon in the duplicated sequence are at the protein level described as an insertion of this sequence, not as a deletion-insertion removing the entire C-terminal amino acid sequence.
    • encoding an open reading frame which after the duplicated sequence shift to another reading frame are described as a frame shift.


  • p.Ala3dup (one amino acid)
    a duplication of amino acid Ala3 in the sequence MetGlyAlaArgSerSerHis to MetGlyAlaAlaArgSerSerHis
  • p.(Ala3dup)
    the predicted consequence at the protein level is a duplication of amino acid Ala3 in the sequence MetGlyAlaArgSerSerHis to MetGlyAlaAlaArgSerSerHis
  • p.Ala3_Ser5dup (several amino acids)
    a duplication of amino acids Ala3 to Ser5 in the sequence MetGlyAlaArgSerSerHis to MetGlyAlaArgSerAlaArgSerSerHis
  • p.Ser6dup
    a duplication of amino acid Ser in the sequence MetGlyAlaArgSerSerHis to MetGlyAlaArgSerSerSerHis
    NOTE: for duplications in single amino acid stretches or tandem repeats, the most C-terminal residue is arbitrarily assigned to have been duplicated


Why do we not describe a duplication as an insertion?

Although duplications are basically a special type of insertion, there are several reasons why the recommendation is to describe duplications differently;
  • the description is simple and shorter,
  • it is clear and prevents confusion regarding the position when an insertion is incorrectly reported like "Ala22insGly".

How should I describe the change "MetArgThrGlySerSerHisGlnTrpPhe" to "MetArgThrGlySerSerHisGlySerSerGlnTrpPhe"? The fact that the inserted sequence (GlySerSer) is present in the original sequence suggests it derives from a duplicative event.

The variant should be described as an insertion; p.His7_Gln8insGly4_Ser6. A description using "dup" is not correct since, by definition, a duplication should be directly 3'-flanking of the original copy (in tandem). Note that the description given still makes it clear that the sequence inserted between p.His7 and pGln8 is probably derived from nearby, i.e. position p.Gly4 to p.Ser6, and thus likely derived from a duplicative event.

What do you mean with "variants should be described on the protein level and not incorporate knowledge regarding the change at the DNA-level"?

It means that protein variant descriptions should be derived from comparing the variant protein sequence with the reference protein sequence. Knowledge on the underlying change at the DNA level should not be used. E.g. when MetTrpSerSerSerHisAsp.. changes to MetTrpSerSerSerSerHisAsp.. this is described as p.Ser5dup. The information that at the DNA level the change is ..ATGTGGTCCAGTTCCCACGAT.. to ..ATGTGGTCCAGTAGTTCCCACGAT.., so the codon for Ser4 is duplicated, is not used; the description p.Ser4dup is not correct.