DNA Recommendations

Repeated sequences Variant


Definitions

Repeated sequence
a sequence where, compared to a reference sequence, a segment of one or more nucleotides (the repeat unit) is present several times, one after the other.

Description

Format (repeat position): “prefix”“position_repeat_unit””["”copy_number””]”, e.g. g.123_125[36]

  • “prefix” = reference sequence used = g.
  • “position_repeat_unit” = position (range) first repeat copy = 123_125
  • [ = opening symbol for allele = [
  • “copy_number” = number of repeat units = 36
  • ] = closing symbol for allele = ]

Format (sequence): “prefix”“position_repeat_start”“repeat_sequence””["”copy_number””]”, e.g. g.123GGC[36]

  • “prefix” = reference sequence used = g.
  • “position_repeat_start” = position first nucleotide repeat unit = 123
  • “repeat_sequence” = nucleotide sequence repeat copy = GGC
  • [ = opening symbol for allele = [
  • “copy_number” = number of repeat units = 36
  • ] = closing symbol for allele = ]

Note

  • reference sequences accepted are g., m., c. and n. (genomic, mitochondrial, coding DNA and non-coding DNA).
  • repeated sequences include both small (mono-, di-, tri-, etc., nucleotide) and large (kilobase-sized) repeats.
  • the format based on repeat position is preferred, descriptions including the repeat sequence quickly become too lengthy.
    • NOTE: for clarity, in the protein coding region do not use descriptions like NM_024312.4:c.2686[9] and NM_024312.4:c.1738TA[6] but NM_024312.4:c.2693dupA and NM_024312.4:c.1741_1742insTATATATA in stead.
  • the format g.123_124TG[4], should not be used; it contains redundant information (“123_124” and “TG”).
  • while g.123CAG[23] describes a repeated sequence containing 23 CAG units, g.123_125[23] describes a tri-nucleotide repeated sequence of 23 units which could be interrupted with other units (e.g. a rare CAA). The description g.123CAG[23] can thus only be used when the repeat was sequenced.
  • for composite repeats the basic format can be used, successively listing each different repeat unit; g.456_467[4]468_494[9]495_503[3].

Examples

  • g.123_124[14] (when sequenced, alternatively g.123TG[14])
    a repeated di-nucleotide sequence, with the first unit located from position g.123 to g.124, is present in 14 copies.
    NOTE: when the repeat is variable in the population, sequenced, and the reference sequence has 15 units, the description g.123TG[14] is preferred over g.151_152del
    NOTE: when the repeat is variable in the population, sequenced, and the reference sequence has 15 units, the description g.123TG[17] is preferred over g.149_152dup
  • g.123_124[14];[18] (when sequenced, alternatively g.123TG[14];[18])
    a repeated di-nucleotide sequence, with the first unit located from position g.123 to g.124, is present in 14 copies on one allele and 18 copies on the other allele
  • FMR1 GGC-repeat
    in literature the Fragile-X tri-nucleotide repeat is known as the CGG-repeat. Hoever, based on a coding DNA reference sequence (GenBank NM_002024.5) and applying the 3’rule, the repeat has to be described as a GGC-repeat see Recommendations.
    • c.-128_-126[79]
      an extended repeat of exactly 79 units
      NOTE : c.-128GGC[79] can only be used when the repeat has been sequenced, excluding it is interrupted by one or more GGA-triplets
    • c.-128_-126[(600_800)]
      the repeated tri-nucleotide sequence, starting at position c.-128, has an estimated size of between 600 to 800 copies.
      NOTE: the repeat can be pure or a mix of GGC and GGA triplets.
  • HD GCA-repeat
    based on the HTT (huntingtin) coding DNA reference sequence (GenBank LRG_763t1 (NM_002111.8), applying the 3’rule, the Huntington’s Disease tri-nucleotide repeat is described as an GCA (not CAG) repeat.
    • c.54GCA[21]
      NOTE: the coding DNA reference sequence (LRG_763t1 (NM_002111.8)) was determined and shown to contain an allele of 21 GCA repeats
      NOTE: on protein level the reference allele contains 23 Gln’s, described as p.Gln18[23] (alternatively p.Q18[23]). The difference derives from the fact that the GCA repeat is interrupted by ACA-triplet (“CAA” coding) at position 20.
    • c.54GCA[21]ACA[1]GCC[2]ACC[1]GCC[10]
      the coding DNA reference sequence (LRG_763t1 (NM_002111.8)) was determined and shown to contain a tri-nucleotide allele of 21 GCA, 1 ACA, 2 GCC, 1 ACC and 10 GCC-repeats.
      NOTE: when the sequence was not determined, but the repeat estimated based on PCR fragment size, the description is c.(54_56;117_119;120_122;126_128;129_131)[35]
  • g.456_457[4]466_468[9]490[12] (when sequenced, alternatively 456TG[4]TAA[9]T[12])
    a complex repeated sequence has a first unit located from position g.456 to g.457, present in 4 copies, a second unit from position g.466 to g.468 present in 9 copies and a third unit (mono-nucleotide) starting at position position 490 present in 12 copies.
  • different genomic (g.) and coding DNA (c.) descriptions
    NC_000001.10:g.57832719ATAAA[15] and NM_021080.3:c.-136-75952ATTTT[15] describe the same repeat allele in intron 3 of the DAB1 gene. NOTE: based on the 3’ rule and the transcriptional orientation of the gene (minus strand) the description of the repeat unit differs.

Q&A

Intron 9 of the CFTR gene ends with the sequence ...tgtgtgtgtgtttttttaacag[exon_10]. Both the TG and T stretches are variable in length (from 9 to 13 and 5 to 9 resp.). The reference sequence has 11 TG copies and 7 T's. Is it correct to describe an allele as c.1210-14TG[13]T[5] or for the T stretch as c.1210-6T[5]?

A complex case. First note that by applying the 3'rule it is a variable GT and not TG stretch. When the coding DNA reference sequence has TG11 followed by T7, the reference allele is described as c.1210-33GT[11]1210-11[6]. When only variability of the T-stretch is reported, the reference allele is described as c.1210-12[7].
To indicate the overall variability found in the population the description is c.1210-33GT[(9_13)]T[(4_8)] for the combined repeat and c.1210-12[(5_9)] for the T-stretch.