reference sequences accepted are g., m., c. and n. (genomic, mitochondrial, coding DNA and non-coding DNA).
repeated sequences include both small (mono-, di-, tri-, etc., nucleotide) and large (kilobase-sized) repeats.
the format based on repeat position is preferred, descriptions including the repeat sequence quickly become too lengthy.
NOTE: for clarity, in the protein coding region do not use descriptions like NM_024312.4:c.2686 and NM_024312.4:c.1738TA but NM_024312.4:c.2693dupA and NM_024312.4:c.1741_1742insTATATATA in stead.
the format g.123_124TG, should not be used; it contains redundant information (“123_124” and “TG”).
while g.123CAG describes a repeated sequence containing 23 CAG units, g.123_125 describes a tri-nucleotide repeated sequence of 23 units which could be interrupted with other units (e.g. a rare CAA). The description g.123CAG can thus only be used when the repeat was sequenced.
for composite repeats the basic format can be used, successively listing each different repeat unit; g.456_467468_494495_503.
a repeated di-nucleotide sequence, with the first unit located from position g.123 to g.124, is present in 14 copies on one allele and 18 copies on the other allele
in literature the Fragile-X tri-nucleotide repeat is known as the CGG-repeat. Hoever, based on a coding DNA reference sequence (GenBank NM_002024.5) and applying the 3’rule, the repeat has to be described as a GGC-repeat see Recommendations.
an extended repeat of exactly 79 units
NOTE : c.-128GGC can only be used when the repeat has been sequenced, excluding it is interrupted by one or more GGA-triplets
the repeated tri-nucleotide sequence, starting at position c.-128, has an estimated size of between 600 to 800 copies.
NOTE: the repeat can be pure or a mix of GGC and GGA triplets.
based on the HTT (huntingtin) coding DNA reference sequence (GenBank LRG_763t1 (NM_002111.8), applying the 3’rule, the Huntington’s Disease tri-nucleotide repeat is described as an GCA (not CAG) repeat.
NOTE: the coding DNA reference sequence (LRG_763t1 (NM_002111.8)) was determined and shown to contain an allele of 21 GCA repeats
NOTE: on protein level the reference allele contains 23 Gln’s, described as p.Gln18 (alternatively p.Q18). The difference derives from the fact that the GCA repeat is interrupted by ACA-triplet (“CAA” coding) at position 20.
the coding DNA reference sequence (LRG_763t1 (NM_002111.8)) was determined and shown to contain a tri-nucleotide allele of 21 GCA, 1 ACA, 2 GCC, 1 ACC and 10 GCC-repeats.
NOTE: when the sequence was not determined, but the repeat estimated based on PCR fragment size, the description is c.(54_56;117_119;120_122;126_128;129_131)
a complex repeated sequence has a first unit located from position g.456 to g.457, present in 4 copies, a second unit from position g.466 to g.468 present in 9 copies and a third unit (mono-nucleotide) starting at position position 490 present in 12 copies.
different genomic (g.) and coding DNA (c.) descriptions
NC_000001.10:g.57832719ATAAA and NM_021080.3:c.-136-75952ATTTT describe the same repeat allele in intron 3 of the DAB1 gene. NOTE: based on the 3’ rule and the transcriptional orientation of the gene (minus strand) the description of the repeat unit differs.
Intron 9 of the CFTR gene ends with the sequence ...tgtgtgtgtgtttttttaacag[exon_10]. Both the TG and T stretches are variable in length (from 9 to 13 and 5 to 9 resp.). The reference sequence has 11 TG copies and 7 T's. Is it correct to describe an allele as c.1210-14TGT or for the T stretch as c.1210-6T?
A complex case. First note that by applying the 3'rule it is a variable GT and not TG stretch. When the coding DNA reference sequence has TG11 followed by T7, the reference allele is described as c.1210-33GT1210-11. When only variability of the T-stretch is reported, the reference allele is described as c.1210-12. To indicate the overall variability found in the population the description is c.1210-33GT[(9_13)]T[(4_8)] for the combined repeat and c.1210-12[(5_9)] for the T-stretch.