Changes in DNA, RNA and protein sequences, also called variants, mutations or polymorphisms, are described using a specific language. To prevent confusion regarding its meaning a standard has been developed for this language, the so called HGVS nomenclature. The standard is used world-wide, especially in human health and diagnostics. This page will try to explain the standard, briefly and in simple terms. After reading you should understand the basics of the HGVS nomenclature and be able to use the internet to find more information on specific variants. In addition, while searching, you should be able to prevent making mistakes leading to misinterpretation of the variant description and its possible consequences. More details, on all subjects, are availble elsewhere on the HGVS nomenclature pages.
The format of a complete variant description is reference:description, e.g.;
All variants are described in relation to a reference, the so called reference sequence, in the example NM_004006.2. After the reference a description of the variant is given, in the example c.4375C>T.
A description without a reference sequence is near useless, additional information will then be required to guess what reference may have been used, e.g. the name of the gene containing the variant, the associated phenotype studied (disease), the chromosome number and possibly predicted consequences of the variant on the RNA and/or protein) level. Furthermore, since reference sequences usually change over time, the date of the report can give useful information as well.
Variants are usually detected by reading the DNA sequence (sequencing). A proper report always contains the variant described on the DNA level. Often the predicted consequences on the protein level are given as well. In rare cases, not following current standards, only the predicted consequences at the protein level are reported.
Variants described on the DNA level are mostly reported in relation to a specific gene based on a so called “coding DNA reference sequence”. When a coding DNA reference sequence is used, the description of the variant starts with “c.” (in the example c.4375C>T). Since we nowadays have a reliable reference sequence of the complete human genome, it becomes more common to (also) give the description based on a “genomic reference sequence”, starting with “g.” (g.32407761G>A). In addition the (predicted) consequences on the RNA level (starting with “r.”) and/or the protein level (starting with “p.”) may be given. NOTE: the “p.” addition is often missing when the predicted protein consequences are reported. For details see “Reference Sequences”.
Reference sequences have a format like NC_000023.10, where NC_000023 is the accession number of the reference sequence and “.10” its version number. Version numbers are required since we started to use reference sequences at a time our knowledge of the human genome was far from complete. The version number directly follows the accession number and increases over time; NC_000023.9 (March 2006) was followed by NC_000023.10 (Feb.2009) and NC_000023.11 (Dec.2013).
For human the reference sequence accession number directly in front of the version number gives the number of the chromosome: 1-22, 23 for the X-chromosome and 24 for the Y-chromosome. In NC_000023.10 this number is “23” so a reference sequence of human chromosome X.
In many cases the reference sequence is not given but a genome build is mentioned. The genome build has two formats, either “hg” and a number (hg18, hg19, hg38) or “GRCh/NCBI” and a number (NCBI35, NCBI36, GRCh37, GRCh38). Having the genomic reference sequence (like NC_000023.10) is exact. When it is missing one needs to know the genome build used. The difference is that genome builds are versioned as well, so called “patches” (e.g. p1) in which errors are corrected.
Genomic reference sequences can also be based on smaller sequences, usually including a specific gene or specifically named genomic segment only. The most frequently used are LRG’s (Locus Genomic Reference sequences, format LRG_199) or NG’s (RefSeq Gene reference sequences, format NG_012232.1).
In a human diagnostic setting the most frequently used reference is a “coding DNA reference sequence” (description starting with “c.”, e.g. c.4375C>T). Variant descriptions based on this format are very popular because they directly link to the encoded protein. Numbering starts with 1 at the first position of the protein coding region, the A of the starting ATG triplet. Numbering ends at the last position of the ending triplet, the last position of the stop codon (TAA, TAG or TGA). Positions in front of the protein coding sequence get a minus sign (e.g. “c.-26”) those after the stop an asterisk (e.g. “c.*85”). When you divide the position number from a “c.” description by three you get in most cases the number of the affected amino acid residue from the protein sequence (description starting with “p.”); for c.4375C>T i.e. 4375 divided by three gives amino acid 1459. The most frequently used coding DNA reference sequences are the NM’s (RefSeq gene transcript sequences, e.g. NM_004006.2) and LRG’s (Locus Genomic Reference sequences, e.g. LRG_199t1).
Depending on the change found, the variant, its description can have many different formats. For a detailed overview we refer to the specific pages on this website see header “Recommendations”. Here we will list and briefly explain, the major variant types.
A standard variant description has the format “prefix_position(s)_change”. In the variant description c.4375C>T the prefix “c.” indicates the type of reference sequence used (“c.” indicating a coding DNA reference sequence), “4375” the position of the nucleotide(s) affected and “C>T” the change (a C changed to T).
All variants given are in the DMD gene and reported in relation to coding DNA reference sequence NM_004006.2.
There are more variant types yet these occur less frequently. For details see header “Recommendations”.
It should be noted that one variant, based on different reference sequences used, can be described in many different ways. Variant c.5234G>A in the DMD gene can be described based on different genomic reference sequences (e.g. NC_000023.9:g.32290917C>T, NC_000023.10:g.32380996C>T, NC_000023.11:g.32362879C>T, NG_012232.1:g.981731G>A, LRG_199:g.981731G>A) as well as different coding DNA reference sequences (e.g. LRG_199t1:c.5234G>A, NM_004006.2:c.5234G>A, NM_004009.3:c.5222G>A, NM_000109.3:c.5210G>A, NM_004007.2:c.4865G>A, NM_004010.3:c.4865G>A, NM_004011.3:c.1211G>A, NM_004012.3:c.1202G>A). These alternative descriptions are rather confusing, especially when reference sequences are not properly listed. Consequently, when databases or the internet are queried for information regarding the potential consequences of specific variants, errors are easily made.
Sometimes variants are not described using the format reference:description (NM_004006.2:c.4375C>T) explained above but using an identifier (ID) in another database. Common formats include a rs ID (from dbSNP, rs1800266), OMIM ID (from OMIM, OMIM300377:0073), LOVD ID (from LOVD, ANO5_000052), RCV ID (from ClinVar, RCV000012031), etc. In most cases, using these IDs, the database can be used to find the full description of the variant using the approved HGVS format reference:description.
When a reference sequence is not known the best way forward is to try and get the name of the gene that is affected by the variant. All genes have an official abbreviation, the so called gene symbol. For the Duchenne muscular dystrophy gene the gene symbol is “DMD”. The HGNC keeps a catalog of all approved gene symbols (and their aliases/synonyms). The HGNC site can be used to find the gene symbol and check whether the name/symbol you have is the officially approved one. Using “dystrophin”, the name of a protein, you will see this is an alias for the Duchenne muscular dystrophy gene, official gene symbol “DMD”. HGNC, and many other sources, can also tell you on which human chromosome a gene is so to which chromosome the variant description you have may relate.
When you are interested in what is known about a specific variant the best start is a Gene variant database, also called Locus-Specific Database (LSDB)… in preparation …