Since references to web sites are not yet acknowledged as citations, please mention Den Dunnen et al. (2016) HGVS recommendations for the description of sequence variants: 2016 update. Hum.Mutat. 25: 37: 564-569 when referring to these pages. Note that although the examples on these pages mainly give examples for human (Homo sapiens), the recommendations can be applied to all species.
- all variants should be described at the most basic level, the DNA level. Descriptions at the RNA and/or protein level may be given in addition.
- descriptions should make clear whether the change was experimentally determined or theoretically deduced by giving predicted consequences in parentheses
- descriptions at RNA/protein level should describe the changes observed on that level (RNA/protein) and not try to incorporate any knowledge regarding the change at DNA-level (see Questions below)
- all variants should be described in relation to an accepted reference sequence (see Reference Sequences).
- the reference sequence file used should be public and clearly described, e.g. NC_000023.10, LRG_199, NG_012232.1, NM_004006.2, LRG-199t1, NR_002196.1, NP_003997.1, etc. (see Reference Sequences)
- when variants are not reported in relation to a genomic reference sequence from a recent genome build, the preferred reference sequence is a Locus Reference Genomic sequence (LRG)
- when no LRG is available, one should be requested (see Reference Sequences).
- the reference sequence used must contain the residue(s) described to be changed.
- a letter prefix is mandatory to indicate the type of reference sequence used. Accepted prefixes are;
- “c.” for a coding DNA reference sequence
- “g.” for a linear genomic reference sequence
- “m.” for a mitochondrial DNA reference sequence
- “n.” for a non-coding DNA reference sequence
- “o.” for a circular genomic reference sequence
- “p.” for a protein reference sequence
- “r.” for an RNA reference sequence (transcript)
- numbering of the residues (nucleotide or amino acid) in relation to the reference sequence used should follow the approved scheme (see Numbering)
- 3’rule: for all descriptions the most 3’ position possible of the reference sequence is arbitrarily assigned to have been changed
- the 3’rule also applies for changes in single residue stretches and tandem repeats (nucleotide or amino acid)
- the 3’rule applies to ALL descriptions (genome, gene, transcript and protein) of a given variant
- exception: deletion/duplication around exon/exon junctions using c., r. or n. reference sequences (see Numbering)
- descriptions at DNA, RNA and protein level are clearly different:
- prioritisation: when a description is possible according to several types, the preferred description is: (1) deletion, (2) inversion, (3) duplication, (4) conversion, (5) insertion
- when a variant can be described as a duplication or an insertion, prioritisation determines it should be described as a duplication
- descriptions removing part of a reference sequence replacing it with part of the same sequence are not allowed (e.g. NM_004006.2:c.[762_768del;767_774dup])
- only approved HGNC gene symbols should be used to describe genes
In HGVS nomenclature some characters have a specific meaning
- “+” (plus) is used in nucleotide numbering; c.123+45A>G
- “-” (minus) is used in nucleotide numbering; c.124-56C>T
- “*” (asterisk) is used in nucleotide numbering and to indicate a translation termination (stop) codon (see Standards); c.*32G>A and P.Trp41*
- “_” (underscore) is used to indicate a range; g.12345_12678del
- “[ ]” (square brackets) are used for alleles (see DNA, RNA, protein)
- “;” (semi colon) is used to separate variants and alleles; g.[123456A>G;345678G>C] or g.[123456A>G];[345678G>C]
- “,” (comma) is used to separate different transcripts/proteins derived from one allele; r.[123a>t, 122_154del]
- “:” (colon) is used to separate the reference sequence file identifier (accession.version_number) from the actual description of a variant; NC_000011.9:g.12345611G>A
- “( )” (parentheses) are used to indicate uncertainties and predicted consequences; NC_000023.9:g.(123456_234567)_(345678_456789)del, p.(Ser123Arg)
NOTE: the range of the uncertainty should be described as precisely as possible (see below)
- “?” (question mark) is used to indicate unknown positions (nucleotide or amino acid); g.(?_234567)_(345678_?)del
- “^” (caret) is used as “or”; c.(370A>C^372C>R) as back translation of p.Ser124Arg (i.e. changing the AGC codon to CGC, AGG or AGA)
- “>” (greater than) is used to describe substitution variants (DNA and RNA level); g.12345A>T, r.123a>u (see DNA, RNA)
- “=” (equals) is used to indicate a sequence was tested but found unchanged; p.(Arg234=)
- “/” (forward slash) is used to indicate mosaicism (see Example DNA substitution)
- “//” (double forward slash) is used to indicate chimerism (see Example DNA substitution)
Abbreviations in variant descriptions
Specific abbreviations are used to describe different variant types.
- Some papers and web sites use a “-“ (minus) to indicate a range, is this correct?
- The sign used to indicate a range is “_” (underscore) and not a “-“ (minus). The minus sign should only be used as a minus in the description of variants based on a coding DNA reference sequence. c.12-14del describes a deletion of nucleotide -14 in the intron directly preceding coding DNA nucleotide 12, not a deletion of nucleotides c.12 to c.14.
- Why is it recommended to use three-letter amino acid code to describe protein variants?
- Several amino acids start with the same initial letter (Ala, Arg, Asn, Asp start with A, Gln, Glu, Gly with G, Leu, Lys with L, Phe, Pro with P and Thr, Tyr with T) but in one-letter amino acid code this letter is used as abbreviation for only one. In practice this leads to many mistakes. It is therefore recommended to use three-letter amino acid code abbreviations.
- When I want to report a variant on DNA, RNA and protein level do I need to use a specific separator?
- No, best is to report the variant using the format NM_004006.2:c.124A>G r.(?) p.(Ser42Gly). NOTE: needless to say, always mention the reference sequence file used
- What do you mean by “descriptions at protein level should describe the changes observed on that level and not try to incorporate any knowledge regarding the change at DNA-level”?
- To describe a variant at the protein level you simply compare the reference and variant protein sequence. You forget what happened at the DNA level. When the sequence …ATG AGC TCG AGC CTT… (encoding MetSerSerSerLeu) changes to …TGG AGC AGC CTT… (encoding MetSerSerLeu) this is described as p.(Ser4del) and not as p.(Ser3del) because from DNA level the codon for Ser3 is deleted.
- Is it correct that when I apply the 3’rule for genes that are on the minus strand of a chromosome, the “g.” and “c.” variant descriptions differ regarding the nucleotide that I describe as deleted?
- Yes, when a gene is on the minus strand of a chromosome (opposite transcriptional orientation) and the change is located in a repeated sequence (mono-, di-, tri-, etc. nucleotide stretches) the 3’rule has this as a consequence. When the chromosome sequence is -TGGGGCAT- and one of the G’s is deleted (change to -TGGG_CAT-) the description based on chromosome coordinates is g.5delG. When the annotated coding DNA reference sequence is on the minus strand (ATGCCCCA) the description is c.7delC. Not only is the deleted nucleotide different (delG vs. delC), in fact the descriptions also point to another nucleotide, g.5 vs. g.2 (equivalent to c.7delC).
- Can I describe a deletion when I have not yet sequenced the break point?
- Yes, using the characters to indicate uncertainties, i.e. the question mark (“?”) and brackets (“( )”), such cases can be described. Describe the range of uncertainty as precise as possible. For details see Uncertain.