DNA Recommendations

Substitution Variant


Definitions

Substitution
a sequence change where, compared to a reference sequence, one nucleotide is replaced by one other nucleotide.

Description

Format: “prefix”“position_substituted”“reference_nucleotide””>”new_nucleotide”, e.g. g.123A>G

“prefix” = reference sequence used = g.
“position_substituted” = position nucleotide sustituted = 123
“reference_nulceotide” = nucleotide at reference position = A
”>” = type of change is a substitution = >
“new_nucleotide” = substituted nucleotide = G


Note

  • prefix reference sequences accepted are g., m., c. and n. (genomic, mitochondrial, coding DNA and non-coding DNA).
  • changes involving two or more consecutive nucleotides are described as deletion/insertions (indels) (see Deletion/insertion (indel)).
  • nucleotides that have been tested and found not changed are described as c.123A=, g.4567T= (see SVD-WG001 (no change)).
  • the description c.76_77delinsTT is preferred over c.[76A>T;77G>T].
    NOTE: by definition this change can not be described as a substitution (like c.76_77AG>TT or c.76AG>TT)
  • it is not correct to describe “polymorphisms” as c.76A/G (see Discussions).

Examples

  • NC_000023.10:g.33038255C>A
    a substitution of the C nucleotide at g.33038255 for an A
  • NG_012232.1(NM_004006.1):c.93+1G>T
    a substitution of the G nucleotide at c.93+1 (coding DNA reference sequence) with a T
  • LRG_199t1:c.79_80delinsTT or c.[79G>T;80C>T]
    the description c.79_80delinsTT is preferred over c.[79G>T;80C>T], unless either of the two variants (79G>T or c.80C>T) is known as a frequently occurring variant.
    NOTE: based on the definition of a substitution, i.e. one nucleotide replaced by one other nucleotide, this change can not be described as a substitution like c.79_80GC>TT or c.79GC>TT
  • NM_004006.1:c.[145C>T;147C>G]
    two substitutions replacing codon CGC (position c.145 to c.147) with TGG
    NOTE: the variant can also be described as NM_004006.1:c.145_147delinsTGG, i.e. a deletion/insertion. The deletion/insertion format is preferred unless either of the two variants (c.145C>T or c.147C>G) is known as a frequently occurring variant.
  • LRG_199t1:c.54G>H
    a substitution of the G nucleotide at c.54 (coding DNA reference sequence) with a A, C or T (IUPAC code “H”, see Standards)
  • NM_004006.1:c.123=
    a screen was performed showing that nucleotide c.123 was a “C” as in the coding DNA reference sequence (the nucleotide was not changed). Alternative NM_004006.1:c.123C=.
  • LRG_199t1:c.85=/T>C
    a mosaic case where at position 85 besides the normal sequence (a T, described as “=”) also chromosomes are found containing a C (c.85T>C)
  • NM_004006.1:c.85=//T>C
    a chimeric case, i.e. the sample is a mix of cells containing c.85= and c.85T>C.

Q&A

How can I shorten the descriptions of SNPs in a manuscript?

Publications reporting linkage or association studies often use a range of different markers/SNPs. Such publications should contain at least once an unequivocal description of all markers used linking them to a reference sequence, preferably a genomic reference sequence. When this has been done, simplified descritpions can be used like;
  • NM_004006.1:c.3>T, using a GenBank coding DNA reference sequence,
  • GJB2:c.76A>C, using a HGNC-approved gene symbol as reference,
  • rs2306220:T>C, using a dbSNP-identifier as a reference,
  • DXS1219:g.CA[18];[21] (or AFM297yd1:g.CA[18];[21]), using a marker DXS1219 (AFM297yd1) as reference.

How should I describe a variant in the promoter region of a gene?

It is recommended to describe variants in the promoter region of a gene based on a genomic reference sequence, e.g. NC_000023.10:g.33357783G>A (chrX, hg19). Describing the variant in relation to a coding DNA reference sequence (for this variant NM_004006.1:c.-128354C>T or NM_000109.3:c.-401C>T) is possible but not really very informative; you do not know how long the 5'UTR is. The variant can also be described using a genomic reference sequence containing the promoter region (for this variant e.g. L01538.1:g.1407C>T), but again this is not really informative. Although NC_000023.10:g.33357783G>A seems complex, it can be used in a genome browsers helping you to quickly zoom in on the region of interest.

Are polymorphisms described like NM_004006.1:c.76A/G?

No, all substitutions are described as NM_004006.1:c.76A>G. In the past, the format c.76A/G has been used to describe "polymorphic" sequence variants. Note that a description should be neutral, simply describe the change, and not include any other information like predicted or known functional consequences.

Can I describe a GC to TG variant as a dinucleotide substitution (NG_012232.1:g.12GC>TG)?

No, this is not allowed. By definition a substitution changes one nucleotide into one other nucleotide. The change GAAGCCAG to GAATGCAG should be described as NG_012232.1:g.12_13delinsTG, i.e. a deletion/insertion (indel) (see Deletion-Insertion).

The BRCA1 coding DNA reference sequence NM_007294.3 from position c.2074 to c.2080 is ..CATGACA.. A variant frequently found in the population is ..CATAACA.. (NM_007294.3:c.2077G>A). In a patient I found the sequence ..CATATAACA.. Can I describe this variant as NM_007294.3:c.[2077G>A; 2077_2078insTA]?

The shortest description of this variant is NM_007294.3:c.2077delinsATA. However, since the variant is likely a combination of two other variants it is acceptable to describe it as NM_007294.3:c.[2077G>A; 2077_2078insTA].