Sunday, March 21, 2010

Beyond linear sequence comparisons: the use of genome-level characters for phylogenetic reconstruction

The first whole genomes to be compared for phylogenetic inference were those of mitochondria, which provided the first sets of genome-level char- acters for phylogenetic reconstruction. Most power- ful among these characters have been comparisons of the relative arrangements of genes, which have convincingly resolved numerous branching points, including some that had remained recalcitrant even to very large molecular sequence compari- sons. Now the world faces a tsunami of complete nuclear genome sequences. In addition to the tre- mendous amount of DNA sequence that is becom- ing available for comparison, there is also the potential for many more genome-level characters to be developed, including the relative positions of introns, the domain structures of proteins, gene family membership, the presence of particular bio- chemical pathways, aspects of DNA replication or transcription, and many others. These characters can be especially convincing because of their low likelihood of reverting to a primitive condition or occurring independently in separate lineages, so reducing the occurrence of homoplasy. The com- parisons of organelle genomes pioneered the way for the use of such features for phylogenetic recon- structions, and it is almost certainly true, as ever more genomic sequence becomes available, that further use of genome-level characters will play a big role in outlining the relationships among major animal groups.
13.1 Why do we need anything other than molecular sequence comparisons?

Over the past few decades, the comparison of nucleotide and amino acid sequences has revolutionized our understanding of evolution- ary relationships for many groups of organisms. The broader field of systematics has been reinvig- orated and a generation of evolutionary biologists have come to accept that molecular sequence com- parisons are an essential component for inferring the phylogeny of any group. These studies have led to extensive revision of animal systematics and to the overturning of previous reliance on features of the coelom and segmentation (Adoutte et al., 1999).
In the 1980s, when comparing molecular sequences for phylogenetic inference was first becoming common, some asserted with great con- fidence that all evolutionary relationships would soon be convincingly resolved solely with this type of data, leading to much consternation. However, some of the relationships that were equivocal in early molecular studies have remained highly recalcitrant even with many more DNA sequence data in hand. There are several potential explan- ations, including:

1. Multiple nucleotide or amino acid substitutions may have occurred at a single site, obscuring any accumulated signal.


139

140 AN I M AL EV O L UTI O N



2. Convergent or parallel substitutions may have occurred among different lineages due to having only four (for nucleotides) or 20 (for amino acids) possible character states, exacerbated by conver- gent biases in base composition (Naylor and Brown,
1998), which may even cause ever-increasing con- fidence measures for incorrect associations with ever larger data sets (Phillips et al., 2004).
3. The analysis may show artefactual association
of the more rapidly changing lineages (Felsenstein,
1978), including the attraction of long branches
to the base of the ingroup in association with the
outgroup (which is almost always a long branch;
Philippe and Laurent, 1998).
4. In some cases, non-orthologous gene copies may be inadvertently compared among various lineages due to ancestral gene duplications followed by dif- ferential losses, or due to incomplete sampling.
5. Differing views of scientists on alignments, exclusion sets, and weighting schemes frequently cannot be arbitrated based on objective criteria and can lead to radically different phylogenetic reconstructions.
6. The most difficult problems are when the time of shared ancestry is short relative to the subse- quent time of divergence, where there has been little opportunity to accumulate signal and ample time for it to have been erased.

Molecular sequence comparison is now a mature field that has influenced the culture of systematics. Many have come to expect that the future of sys- tematics will be dominated by creating ever more sophisticated methods for teasing a weak signal from noisy data. This causes concern that differing preferences for various methods will ensure that no consensus on many evolutionary relationships will ever be reached.
However, an alternative is possible, that there may be other, less explored, types of characters that could be powerful for resolving these conten- tious relationships. There is no doubt that com- parisons of some characters have identified certain robust synapomorphies (shared and derived char- acter states) that have supported long-standing, little-contested evolutionary relationships, such as the monophyly of mammals, tetrapods, and
echinoderms. These synapomorphies are sub- jectively judged to be of characters so unlikely to revert to an earlier condition or to occur mul- tiple times in parallel that they could only have arisen once in the common ancestor of the group. Can new sets of characters be found that would meet these criteria to provide confident resolution of some problematic evolutionary relationships? Although there is a broad range of character types to explore, we will focus here specifically on com- parison of features of genomes.

13.2 Comparisons of mitochondrial genomes have laid the foundation

Sequences from mitochondrial genes and genomes have been used extensively for phylogenetic infer- ence, with complete mtDNA sequences being pub- licly available for more than 1000 animal species. (For a summary of the characteristics of animal mtDNAs, see Boore, 1999.) It has been long-argued (e.g. Boore and Brown, 1998) that the relative arrange- ment of the (normally) 37 genes in animal mitochon- drial genomes constitutes an especially powerful type of character for phylogenetic inference, and so constitutes the first set of genome-level features to be used extensively for animal phylogeny. Briefly summarized, these genes are present in nearly all animal groups, are unambiguously homologous, and can potentially be rearranged into an enor- mous number of states such that convergent rear- rangements are very unlikely (and demonstrated to be uncommon). All genes on each strand are tran- scribed together in cases where it has been studied (Clayton, 1992), so selection on gene arrangements is expected to be minimal. A summary of the evo- lutionary relationships convincingly demonstrated by these types of data (and in many cases left unre- solved by all other studies) is found in Boore (2006), but here are a few of the more significant conclu- sions of deep-branch phylogenetic relationships: (1) the superphylum Eutrochozoa includes cestode platyhelminths (von Nickisch-Rosenegk et al., 2001) and the phylum Phoronida (Helfenbein and Boore,
2004); (2) Sipuncula is closely related to Annelida rather than to Mollusca (Boore and Staton, 2002); (3) Annelida is more closely related to Mollusca than to Arthropoda (Boore and Brown, 2000);

G E N O M E S A N D P H Y L O G EN Y 141


Table 13.1 URLs for the largest public DNA sequencing centres

DNA sequencing centre website


Wellcome Trust Sanger Institute http://www.sanger.ac.uk/ DOE Joint Genome Institute http://www.jgi.doe.gov/
Mammals 10 40
Birds 1 1

Washington University Genome
Sequencing Center
http://genome.wustl.edu/
Reptiles 1 1
Amphibians 1 1
Coelacanths 0 2

Broad Institute http://www.broad.mit.edu/
Bony fish 5 6

Baylor College of Medicine
Genome Center
http://www.hgsc.bcm.tmc.edu/
Cartilaginous fish 0 2
Jawless fish 0 2
Cephalochordates 1 0

Beijing Genomics Institute http://www.genomics.org.cn/
en/index.php
Riken Genomic Sciences Center http://www.gsc.riken.go.jp/ J. Craig Venter Institute http://www.jcvi.org/ Genoscope http://www.genoscope.cns.
fr/spip/
Urochordates 2 1
Hemichordates 0 1
Echinoderms 1 0
Mollusks 1 2
Flatworms 0 3
Annelids 2 0
Arthropods 22 23
Priapulids 0 1
Tardigrades 0 1

Nematodes
3 21


(4) Arthropoda is monophyletic and, within this phylum, Crustacea is united with Hexapoda to the exclusion of Myriapoda and Onychophora (Boore et al., 1995, 1998); (5) Pentastomida is not a phy- lum, but rather a type of crustacean, and joins with Cephalocarida and Maxillopoda to the exclusion of other major crustacean groups (Lavrov et al., 2004).


13.3 Nuclear genomes, a treasure-trove of phylogenetic characters

By a great margin, more DNA sequence is being generated than ever before. Facilities built and techniques developed for sequencing the human genome are now focusing on many other organ- isms. As recently as a year ago, the nine largest genome sequencing centres (see Table 13.1) collect- ively produced well over 170 billion nucleotides of DNA sequence per year: approximately 57-fold the coverage of the human genome. With next-genera- tion sequencing platforms now in regular use, that number is exploding. Imminently there will be complete genomes of at least draft quality for many dozens of animals representing a phylogenetically diverse sample and including several equivocally placed lineages (Figure 13.1, Table 13.2).
In these genomic data are many higher-order features, beyond the linear sequences, that consti- tute genome-level characters that are potentially
Cnidarians 1 1
Placozoans 1 0
Poriferans 0 1

Figure 13.1 This reconstruction of the major branches of
animal evolution is used to plot the numbers of taxa with complete genome sequences done and under way. The taxonomic ranks shown are arbitrary, split for illustration, but not meant to be consistent among the major groups, and taxa listed do not comprehensively cover all of life. Branch length holds no meaning. While opinions may differ on particular genomes as to whether
they are complete versus needing more work, and whether they
are well enough along to consider them ‘under way’, it is clear that there will soon be a large and phylogenetically broad sampling of genome sequences.


useful for phylogenetic reconstruction, including: (1) gene content, including components of multiu- nit complexes such as the ribosome, spliceosome, DNA replication machinery, or oxidative phosphor- ylation enzymes and the presence versus absence of particular biochemical pathways (e.g. de Rosa et al., 1999; Fitz-Gibbon and House, 1999; House and Fitz-Gibbon, 2002; Huson and Steel, 2004; Snel et al., 1999, 2005); (2) the relative arrangements of genes (Boore and Brown, 1998); (3) movements of genes among intracellular compartments (i.e. plastid, mitochondrion, nucleus) (e.g. Nugent and Palmer, 1991); (4) insertions of segments of DNA, including transposons and numts (Fukuda et al.,
1985; Richly and Leister, 2004); (5) variation in intron positions (e.g. Qiu et al., 1998); (6) secondary structures of rRNAs or tRNAs (e.g. Murrell et al.,

142 AN I M AL EV O L UTI O N


Table 13.2 Complete nuclear genome sequencing projects completely drafted (i.e. not necessarily having every gap closed) or under way as summarized in Figure 13.1. Some of the taxa listed as under way are currently funded to only low (generally 2X) coverage. There are many other taxa not listed here whose genomes are being investigated at even lower levels of coverage.

Taxonomy Organism Common description

COMPLETE GENOMES
Chordata, Mammalia Bos taurus Cow Callithrix jacchus Marmoset Canis familiaris Dog
Mus musculus Mouse
Homo sapiens Human
Macaca mulatta Rhesus macaque Monodelphis domestica Opossum Ornithorhynchus anatinus Duck-billed platypus Pan troglodytes Chimpanzee
Pongo pygmaeus abelii Orangutan Chordata, Aves Gallus gallus Red jungle fowl Chordata, Sauria Anolis carolinensis Anole lizard Chordata, Amphibia Xenopus tropicalis Western clawed frog Chordata, Teleostei Danio rerio Zebrafish
Gasterosteus aculeatus Stickleback Oryzias latipes Medakafish Takifugu rubripes Japanese pufferfish
Tetraodon nigroviridis Green spotted pufferfish
Chordata, Cephalochordata Branchiostoma floridae Lancelet Chordata, Urochordata Ciona intestinalis, C. savignyi Sea squirt Echinodermata, Echinozoa Strongylocentrotus purpuratus Purple sea urchin Mollusca, Bivalvia Lottia gigantea Owl limpet Annelida, Oligochaeta Helobdella robusta Leech
Annelida, Polychaeta Capitella capitata None Arthropoda, Coleoptera Tribolium castaneum Red flour beetle Arthropoda, Diptera Aedes aegypti Yellow fever mosquito
Anopheles gambiae Malaria mosquito
Culex pipiens House mosquito

Drosophila ananassae, D. erecta, D. grimshawi, D. melanogaster,
D. mojavensis, D. persimilis, D. pseudoobscura, D. sechellia,
D. simulans (8), D. virilis, D. willistoni, D. yakuba
Fruit flies
Arthropoda, Hemiptera Pediculus humanus corporis Louse
Arthropoda, Hymenoptera Apis mellifera Honeybee
Nasonia vitripennis Parasitic wasp Arthropoda, Lepidoptera Bombyx mori Silkworm Arthropoda, Crustacea Daphnia pulex Water flea Arthropoda, Chelicerata Ixodes scapularis Deer tick Nematoda, Chromadorea Caenorhabditis briggsae, C. elegans, C. remanei Roundworms
Pristionchus pacificus None
Meloidogyne incognita Root-knot nematode

G E N O M E S A N D P H Y L O G EN Y 143


Table 13.2 (Continued.)

Taxonomy Organism Common description

Cnidaria, Anthozoa Nematostella vectensis Sea anemone Placozoa Trichoplax adhaerens Tablet animal GENOMES IN PROGRESS
Chordata, Mammalia Cavia porcellus Guinea pig Choloepus hoffmanni Two-toed sloth Cryptomys sp. Mole Cynocephalus volans Flying lemur
Dasypus novemcinctus Nine-banded armadillo
Dipodomys panamintinus Kangaroo rat
Echinops telfairi Lesser hedgehog tenrec
Elephantulus sp. Elephant shrew
Equus caballus Horse
Erinaceus europaeus Western European hedgehog
Felis catus Cat Gorilla gorilla Gorilla Loxodonta africana African elephant
Macaca fascicularis Crab-eating macaque
Macropus eugenii Wallaby
Manis pentadactyla Chinese pangolin Microcebus murinus Mouse lemur Mustela putorius furo Ferret
Myotis lucifugus Little brown bat Nomascus leucogenys Gibbon Ochotona princeps Pika
Oryctolagus cuniculus European rabbit
Otolemur garnettii Bushbaby Pan paniscus Bonobo Papio anubis Baboon

Peromyscus californicus, P. leucopus,
P. maniculatus,
P. polionotus
Mice
Procavia capensis Hyrax
Pteropus vampyrus Flying fox
Saimiri sp. Squirrel monkey
Sorex araneus European common shrew
Spermophilus tridecemlineatus Ground squirrel
Sus scrofa Pig
Tarsius syrichta Tarsier
Tenrec ecaudatus Common tenrec Tupaia belangeri Tree shrew Tursiops truncatus Dolphin
Vicugna pacos Alpaca

144 AN I M AL EV O L UTI O N


Table 13.2 (Continued.)

Taxonomy Organism Common description

Chordata, Aves Taeniopygia guttata Zebra finch Chordata, Testudines Chrysemys picta Painted turtle Chordata, Amphibia Xenopus laevis African clawed frog Chordata, Coelocanthiformes Latimeria chalumnae Indonesian coelacanth
Latimeria menadoensis South African coelacanth
Chordata, Teleostei Astatotilapia burtoni Tilapia Lepisosteus oculatus Spotted gar Metriaclima zebra Tilapia Oreochromis niloticus Tilapia Paralibidichromis chilotes Tilapia
Salmo salar Atlantic salmon
Chordata, Chondrichthys Callorhinchus milii Elephant shark
Raja erinacea Skate Chordata, Hyperotreti Eptatretus burgeri Hagfish Chordata, Hyperoartia Petromyzon marinus Sea lamprey Chordata, Urochordata Oikopleura dioica Tunicate Hemichordata, Enteropneusta Saccoglossus kowalevskii Acorn worm Mollusca, Gastropoda Aplysia californica Sea hare
Biomphalaria glabrata Snail
Platyhelminthes, Cestoda Echinococcus multilocularis Tapeworm
Taenia solium Pork tapeworm
Platyhelminthes, Turbellaria Schmidtea mediterranea Flatworm
Platyhelminthes, Trematoda Schistosoma mansoni, S. japonicum Blood flukes (schistosomes)

Arthropoda, Diptera Drosophila americana, D. auraria, D. equinoxialis, D. hydei, D. littoralis, D. mercatorum, D. mimica, D. miranda, D. novamexicana, D. repleta,
D. silvestris
Fruit flies
Glossina morsitans Tsetse fly Lutzomyia longipalpis Sand fly Phlebotomus papatasi Sand fly
Arthropoda, Hemiptera Acyrthosiphon pisum Pea aphid
Rhodnius prolixus Kissing bug Arthropoda, Hymenoptera Nasonia giraulti, N. longicornis Parasitic wasps Arthropoda, Crustacea Jassa slatteryi Amphipod
Parhyale hawaiensis Amphipod
Arthropoda, Chelicerata Limulus polyphemus Horseshoe crab
Tetranychus urticae Spider mite Arthropoda, Myriapod Strigamia maritima Centipede Priapula Priapulus caudatus Priapulid worm Tardigrada Hypsibius dujardini Water bear

G E N O M E S A N D P H Y L O G EN Y 145


Table 13.2 (Continued.)
Taxonomy Organism Common description
Nematoda, Chromadorea Ancylostoma caninum Canine hookworm
Ascaris lumbricoides Human intestinal roundworm
Brugia malayi Filarial roundworm
Caenorhabditis brenneri, C. japonica None
Cooperia oncophora Intestinal worm
Dictyocaulus viviparus Bovine lungworm
Haemonchus contortus Barber pole worm
Heterorhabditis bacteriophora None
Necator americanus New World hookworm
Nematodirus battus Thread necked worm
Nippostrongylus brasiliensis Rat intestinal nematode
Oesophagostomum dentatum Nodule worm
Onchocerca volvulus River blindness roundworm
Ostertagia ostertagi Stomach worm
Strongyloides ratti Threadworm
Teladorsagia circumcincta Brown stomach worm
Trichostrongylus vitrinus Black scour worm
Nematoda, Enoplia Trichinella spiralis Trichina worm
Trichuris muris Whipworm
Cnidaria, Hydrozoa Hydra magnipapillata Hydra
Porifera, Demosponge Reniera sp. None




2003); (7) details of genome-level processes, such as the rearrangements that generate antibody diver- sity (Frieder et al., 2006); and (8) deviations from the
‘universal’ genetic code (Telford et al., 2000; Santos,
2004). Many others are likely to be found.
Of course, the reliability of these features can
only be assessed by study of their consistency
with other characters, and several are already
suspect. Convergent gene losses, for example, may
be common as organisms independently evolve
smaller genomes or no longer experience selection
for maintaining a particular biochemical path-
way; in contrast, convergent gain of genes seems
much less likely. Independent evolution of smaller
genomes may also lead to parallel losses of the
most expendable structures in RNA or protein
genes. There is a certain time-horizon that limits
the usefulness of any particular type of charac-
ter; for example, once retro-elements degrade in
sequence beyond the point where the insertion can be reliably inferred to be of single origin, the insertion is no longer useful as a phylogenetic character. Certain changes in the genetic code and in tRNA secondary structures of mitochondria are known to have occurred convergently (although occasional homoplasy has not disqualified the use of either morphological characters or molecular sequence comparisons). There is also the problem in the case of closely spaced sequential internodes where random partitioning of polymorphisms, including those of genome-level characters, can lead to incorrect inference of phylogeny (e.g. Salem et al., 2003). See Boore (2006) for additional caveats and precautions.
Already there have been important insights gained from comparing such features, including: (1) tarsiers have been shown to be the sister group to the clade of monkeys and apes rather than the

146 AN I M AL EV O L UTI O N



prosimians based on patterns of short interspersed nuclear element (SINE) integration (Schmitz et al., 2001); (2) patterns of SINE and long inter- spersed nuclear element (LINE) insertions have also supported the monophyly of toothed plus baleen whales, that hippopotamuses are the sister group to cetaceans, that camels are the most basal cetartiodactyls (Nikaido et al., 1999), and that river dolphins are paraphyletic (Nikaido et al., 2001); (3) animal interphylum relationships have been clari- fied by comparisons of the gene membership within Hox clusters (de Rosa et al., 1999); and (4) a study of the presence of spliceosomal introns supports the monophyly of Actinopterygia and clarifies sev- eral relationships within the group, including the basal position of bichirs (Venkatesh et al., 1999). For further discussion see Murphy et al. (2004), Okada et al. (2004), and Boore (2006).

(a)
















(b)




Taxon 1




Taxon 2




Taxon 3






Taxon1


13.4 What are the advantages of using these genome-level characters?

In general, these types of features would be expected to change in a saltatory, non-clocklike manner. This may seem, at first, to be wrong- headed, since great effort has been expended in many studies to identify clocklike characters, to enable accurate molecular clock estimates of time of divergence. But it is this aspect that makes these genome-level characters especially useful for addressing the most difficult branch points, those with a short time of shared history followed by a long period of divergence, as mentioned above. It is for resolving these relationships that clocklike behaviour guarantees failure, since the ratio of signal to noise will closely match the ratio of the two time periods. Rather it is the least clocklike of characters that are expected to prevail, where an occasional and abrupt change may have occurred and then remain (Figure 13.2). Admittedly, the con- comitant disadvantage is that many such charac- ters must typically be examined in order to find those that happened to have changed during the period of shared ancestry and so marking the rela- tionship (see Boore, 2006, for further analysis and discussion).



Taxon 2




Taxon 3

Figure 13.2 Illustration of why clocklike characters (a) may be less informative than non-clocklike characters (b) when the internode between subsequent lineage splits is short. Each of the four shapes is meant to be a character with states indicated by patterning. In (a) the circle and triangle are not informative and the square and pentagon are homoplasious. The two changes accumulated in the common ancestor of taxon 1 and 2 (for the pentagram and circle), that were at one point synapomorphies, have been erased by subsequent changes. In (b), the changes are
rarer and saltatory. The pentagram and triangle are not informative and the circle is constant, but the square is informative for uniting taxa 1 and 2.




13.5 What about clades without representative genome sequences?

This enormous data set provides a new class of characters that could lead to definitive resolution of some branches of the tree of life, not only for these taxa but also for others where targeted searches for

G E N O M E S A N D P H Y L O G EN Y 147



identified characters could be fruitful. As shown in Figure 13.1, whole-genome sampling will include many major lineages, but not all. Fortunately, we can use genomes in hand to identify sets of genome-level characters that can be diagnostic for the relationships of related groups without gen- ome projects. One could then determine gene order by using Southern hybridization, for example, or probe a large DNA insert library (i.e. in bacterial artificial chromosome (BAC) or fosmid vectors) to find a clone to sequence for the region of interest of the genome. Gene rearrangements, losses, and duplications can also be identified using compara- tive genomic hybridization (CGH) chips with tiled large-insert clones, as has been done for a sampling of diverse human populations (Sharp et al., 2005) and more broadly across the great apes (Locke et al., 2003) or by using arrays of oligonucleotides (representational oligonucleotide microarray analysis (ROMA); Sebat et al., 2004).


13.6 What are the main challenges before us?

First, we must increase the representation of under- studied groups of animals for large-scale genomic sequencing. There is no reason to believe that taxa that have been traditionally studied intensively, i.e.
those with higher species richness, greater breadth of niche occupation, more important roles in patho- genesis, or amenability to laboratory experimenta- tion, will be more informative toward the goals of understanding broad patterns of the evolution of animals and their genomes. Next, we need to have a codification of nomenclature for genes that is based on assessment of orthology (Dehal and Boore, 2006). The renaming of genes to indicate orthology is not feasible because it would ren- der large bodies of literature difficult to interpret and because scientists who study model organ- isms (and who have largely done the naming) are invested in their parochial nomenclature. Thus, the solution must be a lexicon superimposed on these names already in place. Third, a system must be devised for codifying the genome-level char- acters themselves for entry into data bases and matrices for broad comparisons. Lastly, we need for the community to devise standards of inter- pretation and analysis, such as the use of cladistic reasoning rather than associating taxa by similar- ity alone (Boore, 2006). Then it seems likely that genome-level characters will provide the best data set for convincingly reconstructing relationships for some of the most hotly contended nodes in the tree of life and for establishing a framework for all organismal relationships.

No comments:

Post a Comment