Clade and lineage nomenclature, July 4, 2020

Clade and lineage nomenclature aids in genomic epidemiology studies of active hCoV-19 viruses

Due to the naturally expanding genetic diversity of hCoV-19 viruses, GISAID introduced a nomenclature system for major clades, developed by Sebastian Maurer-Stroh et al, based on marker mutations within 6 high-level phylogenetic groupings from the early split of S and L, to the further evolution of L into V and G and later of G into GH and GR.

GISAID clades are augmented with more detailed lineages assigned by the Phylogenetic Assignment of Named Global Outbreak LINeages (PANGOLIN) tool, aiding in the understanding of patterns and determinants of the global spread of the pandemic strain causing COVID-19. A third effort uses a Year-Letter nomenclature to facilitate discussion of large-scale diversity patterns of hCoV-19 and label clades that persist for at least several months and have significant geographic spread.

Clade definitions in GISAID are informed by the statistical distribution of genome distances in phylogenetic clusters (see Han et al, 2019) followed by merging of smaller lineages into major clades based on shared marker variants. Instead of generic letters A, B, C we chose actual letters of marker mutations (alphabet for non-synonymous and number for synonymous substitutions) to make the system more tangible and specific for this virus. For example, S-D614G is one of several genetic markers characterizing a new clade that rose sharply since February 2020 and the letter G was chosen as its name then. Clade and name extensions are triggered when a clade can be subdivided as described above. Another unique feature is to allow dropping of letters and numbers in front to avoid monotonous letter/number chains as used in other viruses. Using specific combinations of 9 genetic markers, 95% of the hCOV-19 data in GISAID can be classified in 6 size-balanced clades. For example, starting with S and L (initiated also by other research groups Tang et al, 2020), S continued at moderate levels and L split into initially equal G and V versions with G reaching 50% in March 2020 and splitting further into GR and GH. The list of the 9 marker variants is as follows:

   S:    C8782T,T28144C includes NS8-L84S
   L:    C241,C3037,A23403,C8782,G11083,G25563,G26144,T28144,G28882 (WIV04-reference sequence)
   V:    G11083T,G26144T NSP6-L37F + NS3-G251V
   G:    C241T,C3037T,A23403G includes S-D614G
   GH: C241T,C3037T,A23403G,G25563T includes S-D614G  + NS3-Q57H
   GR:  C241T,C3037T,A23403G,G28882A includes S-D614G  + N-G204R

Clade definitions in GISAID are augmented with more detailed lineages assigned, e.g. by the Phylogenetic Assignment of Named Global Outbreak LINeages (PANGOLIN) tool by Rambaut et al, an additional effort aiding in the understanding of patterns and determinants of the global spread of the pandemic strain causing COVID-19. For each strain the clade information is provided in the “Virus detail” section of the metadata. Lineages assigned by this software tool, are characterized by a combination of genetic and epidemiological support. This hierarchical, dynamic nomenclature describes a lineage as a cluster of sequences seen in a geographically distinct region with evidence of ongoing transmission in that region. Multiple sources of information are taken into account, including phylogenetic information as well as a variety of metadata associated with that sequence. The finer scale of this nomenclature system can help tease apart outbreak investigations and as rates of international travel increases will facilitate tracking viral imports across the globe.

Another effort, here by Hodcroft et al, uses a Year-Letter nomenclature to facilitate discussion of large-scale diversity patterns of hCoV-19 and label clades that persist for at least several months and have significant geographic spread. Each clade name consists of the year when the clade emerged and a capital letter starting with A for each year. Clades are defined by signature mutations. New major clades are named once the frequency of a clade exceeds 20% in a representative global sample and that clade differs in at least two positions from its parent clade, currently using the clades 19A, 19B, 20A, 20B, and 20C. 


Figure 1. Rough correspondence of different clade nomenclature systems in format: Rambaut et al (GISAID) Hodcroft et al

In summary, while PANGOLIN lineages provide more detailed outbreak cluster information, the other two nomenclatures offer a large-scale overall view of clade trends and all are in good overall agreement.