DNA Recommendations

Repeated sequences Variant


Definitions

Repeated sequence
a sequence where, compared to a reference sequence, a segment of one or more nucleotides (the repeat unit) is present several times, one after the other.

Description

Format (unique repeat): “prefix”“position_first_nucleotide_first_repeat_unit”“repeat_sequence”[“copy_number”], e.g. g.123CAG[16]

  • “prefix” = reference sequence used = g.
  • “position_first_nucleotide_first_repeat_unit” = first nucleotide of first repeat unit = 123
  • “repeat_sequence” = sequence repeat unit = CAG
  • [ = opening symbol for copy number allele = [
  • “copy_number” = number of repeat units = 16
  • ] = closing symbol for copy number allele = ]

Format (mixed repeat): “prefix”“range_repeated_sequence”“repeat_sequence_unit1”[“copy_number”]”repeat_sequence_unit2”[“copy_number”], e.g. g.123_191CAG[19]CAA[4]

  • “prefix” = reference sequence used = g.
  • “range_repeated_sequence” = position first to last nucleotide repeated sequence (range) = 123_191
  • “repeat_sequence_unit1” = sequence first repeat unit = CAG
  • [ = opening symbol for allele = [
  • “copy_number” = number of repeat units = 19
  • ] = closing symbol for allele = ]
  • “repeat_sequence_unit2” = sequence first repeat unit = CAA
  • [ = opening symbol for allele = [
  • “copy_number” = number of repeat units = 4
  • ] = closing symbol for allele = ]

Note

  • reference sequences accepted are g., m., c. and n. (genomic, mitochondrial, coding DNA and non-coding DNA)
    • NOTE: in the protein coding region repeat descriptions are used only for repeat units with a length which is a multiple of 3, i.e. which can not affect the reading frame. Consequently, use NM_024312.4:c.2692_2693dup and not NM_024312.4:c.2686A[10], use NM_024312.4:c.1741_1742insTATATATA and not NM_024312.4:c.1738TA[6].
  • for mixed repeats the range of the reapeat sequence is given followed by a listing of each repeat unit and the copy number of each unit; NC_000012.11:g.112036755_112036823CTG[9]TTG[1]CTG[13].
  • NM_002111.6:c.54GCA[23] describes a repeated sequence, containing 23 GCA units (sequenced), NM_002111.6:c.54_110(GCA)[23] describes a repeated sequence, located from position c.54 to c.110, of 23 units which was not sequenced (so could be interrupted with other repeat units (e.g. ACA).

Examples

  • NC_000014.8:g.101179660TG[14] (g.101179660(TG)[14] when not sequenced)
    a repeated TG di-nucleotide sequence starting at position g.101179660 on human chromosome 14, with 14 TG copies
  • NC_000014.8:g.101179660TG[14];[18] (g.101179660(TG)[14];[18] when not sequenced)
    a repeated TG di-nucleotide sequence starting at position g.101179660 on human chromosome 14, is present in 14 TG copies on one allele and 18 TG copies on the other allele
  • FMR1 repeat
    in literature the Fragile-X tri-nucleotide repeat is known as the CGG-repeat. However, based on a coding DNA reference sequence (GenBank NM_002024.5) and applying the 3’rule, the repeat has to be described as a GGC-repeat see Recommendations
    • NM_002024.5:c.-128GGC[79]
      a GGC tri-nucleotide repeat starting at position c.-128 contains 79 units
      NOTE: NM_002024.5:c.-128GGC[79] can only be used when the repeat has been sequenced, excluding it is interrupted by one or more GGA-triplets
    • NM_002024.5:c.-128(GGC)[(600_800)]
      a GGC tri-nucleotide repeat starting at position c.-128, not sequenced, has an estimated size of between 600 to 800 copies
      NOTE: the repeat can be pure or a mix of GGC and GGA triplets
  • HD repeat
    in literature the Huntington’s Disease tri-nucleotide repeat, encoding a poly-Gln repeat on protein level, is known as the CAG repeat. However, based on the HTT (huntingtin) coding DNA reference sequence (GenBank LRG_763t1 or NM_002111.8) and applying the 3’rule, the repeat has to be described as an GCA repeat
    • LRG_763t1:c.54GCA[23]
      a sequenced GCA tri-nucleotide repeat starting at position c.54 contains 23 units, on protein level described as NP_002102.4:p.(Gln18)[25] (NOTE: the GCA repeat is followed by ACAG extending the encoded Gln-repeat by 2)
    • LRG_763t1:54_149GCA[23]ACA[1]GCC[2]ACC[1]GCC[10]
      the allele sequenced from position c.54 to c.149 contains 23 GCA, 1 ACA, 2 GCC, 1 ACC and 10 GCC repeats
  • NC_000012.11:g.112036755_112036823CTG[9]TTG[1]CTG[13]
    a complex repeated sequence from position g.112036755 to g.112036823 on chromosome 12 with first a CTG unit present in 9 copies, then a TTG unit present in 1 copy and then a CTG unit present in 13 copies
  • different genomic (g.) and coding DNA (c.) descriptions
    NC_000001.10:g.57832719ATAAA[15] and NM_021080.3:c.-136-75952ATTTT[15] describe the same repeat allele in intron 3 of the DAB1 gene
    NOTE: based on the 3’ rule and the transcriptional orientation of the gene (minus strand) the description of the repeat units differs

Q&A

Intron 9 of the CFTR gene ends with the sequence ...tgtgtgtgtgtttttttaacag[exon_10]. Both the TG and T stretches are variable in length (from 9 to 13 and 5 to 9 resp.). The reference sequence has 11 TG copies and 7 T's. Is it correct to describe an allele as c.1210-14TG[13]T[5] or for the T stretch as c.1210-6T[5]?

A complex case. First note that by applying the 3'rule it is a variable GT and not TG stretch. When the coding DNA reference sequence has TG11 followed by T7, the reference allele is described as c.1210-33_1210-6GT[11]T[6]. When only variability of the T-stretch is reported, the reference allele is described as c.1210-12T[7].
To indicate the overall variability found in the population the description is c.1210-33_1210-6GT[9_13]T[4_8] for the combined repeat and c.1210-12T[5_9] for the T-stretch.