Sunday, March 21, 2010

MicroRNAs and metazoan phylogeny: big trees from little genes

Understanding the evolution of a clade, from either a morphological or genomic perspective, first and foremost requires a correct phylogenetic tree top- ology. This allows for the polarization of traits so that synapomorphies (innovations) can be distin- guished from plesiomorphies and homoplasies. Metazoan phylogeny was originally formulated on the basis of morphological similarity, and in some areas of the tree was robustly supported by molecu- lar analyses, whereas in others it was strongly repu- diated. Nonetheless, some areas of the tree still remain largely unknown, despite decades, if not centuries, of research. This lack of consensus may be largely due to apomorphic body plans combined with apomorphic sequences. Here, we propose that microRNAs (miRNAs) may represent a new data set that can unequivocally resolve many relationships in metazoan phylogeny, ranging from the interre- lationships among genera to the interrelationships among phyla. miRNAs, small non-coding regula- tory genes, show three properties that make them excellent candidates for phylogenetic markers: (1) new miRNA families are continually being incorpo- rated into metazoan genomes through time; (2) they show very low homoplasy, with only rare instances of secondary loss, and only rare instances of substi- tutions occurring in the mature gene sequence; and (3) they are almost impossible to evolve convergently. Because of these three properties, we propose that miRNAs are a novel type of data that can be applied to virtually any area of the metazoan tree, to test among competing hypotheses or to forge new ones, and to help finally resolve the correct topology of the metazoan tree.
15.1 Introduction

Since the dawn of molecular phylogenetics, the relationships between animal groups, from spe- cies to the deepest nodes in Metazoa, have been the domain of ribosomal, mitochondrial, and nuclear protein-coding genes. Orthologous genes are amp- lified and sequenced, the sequences are aligned, and the alignment is analysed with increasingly sophisticated phylogenetic algorithms to gain an estimate of relationships. It is unarguable that our understanding of metazoan phylogeny has pro- gressed through the use of these genes and the application of standard phylogenetic methods. Many relationships originally proposed on mor- phological grounds have been confirmed, while others, such as the grouping of annelids and arthropods as the Articulata, have been strongly refuted, leading to a new understanding of mor- phological evolution (Eernisse and Peterson, 2004; Halanych, 2004). However, many areas of the metazoan tree have remained recalcitrant, yield- ing trees with low statistical support or with little resemblance to any credible scenario of morpho- logical evolution. It has often been assumed that these problems would disappear as more data (i.e. more genes and/or more taxa) were applied to the questions at hand. A sampling of the literature on multigene phylogenetics demonstrates that this has not been the case. Indeed, despite the fact that the amount of sequence data in public data bases such as NCBI’s GenBank doubles every 10 months, many phylogenetic questions remain as intractable today as they were before the advent of molecular


157

158 AN I M AL EV O L UTI O N



systematics. As just one example, from one of our parochial areas of interest, the interrelationships among three lophotrochozoan phyla, the nemer teans, annelids, and molluscs, still remain effect- ively unknown. This is despite a number of mul- tigene studies that have recently been published including complete 18S + 28S ribosomal RNA genes (Passamaneck and Halanych, 2006), complete mitochondrial genomes (Yokobori et al., 2008), mul- tiple PCR-amplified nuclear housekeeping genes (Helmkampf et al., 2008a; Peterson et al., 2008), and expressed sequence tag (EST) studies (Dunn et al.
2008; Struck and Fisse, 2008), with all three possible arrangements of these three phyla being advocated by at least one data set (Figure 15.1).
Three problems have always plagued (and will forever plague) the field of molecular phylogenet- ics: differential rates of molecular evolution, long internodes caused by a recent origin of the crown group, and fast, deep radiations. Indeed, in our reading of the metazoan phylogenetic literature, a large number of questions were robustly answered in the first or second pass using 18S rDNA, and were then largely confirmed using other types of data and/or algorithms. But the remaining nodes, which are usually hampered by at least one of these three problems, have remained largely intractable despite the ever-increasing number of taxa and genes being applied. Because of this, we believe that it is not more of the same data that will
ultimately answer these questions, but new types of data.
It was originally hoped that large-scale gen- omic changes, such as gene rearrangements in mitochondrial genomes (Boore et al., 1998) or insertion–deletion events, retroposon integrations, or gene duplications in nuclear genomes (Rokas and Holland, 2000) would provide this new data set, and provide a complementary approach to sequence-based phylogenetic estimation. These sources have provided robust support to topologies previously identified in sequence-based phylogen- etic studies, notably the placement of phoronids and brachiopods within the Protostomia (Helfenbein and Boore, 2004) in the case of mitochondrial gene order, or resolution of the whale–hippo clade by ret- roposon analysis (Shimamura et al., 1997; Nikaido et al., 1999) in the case of nuclear genome changes. Ultimately, however, these structural changes have not been the panacea it was hoped they would be. The most comprehensive coding of mitochondrial gene order demonstrated that in some clades, such as the vertebrates, rearrangement was too slow or non-existent, leading to a polytomy, whereas in other clades, such as the molluscs, rearrangement was too fast, leading to a nonsensical tree (Fritzsch et al., 2006). Mutational decay of the flanking regions surrounding retroposons makes their detection, at best, difficult in taxa that diverged more than about c. 50 million years ago (Ma), restricting their utility



(a) (b) (c)










Figure 15.1 The three proposed hypotheses for the interrelationships among nemerteans, annelids, and molluscs with respect to arthropods. (a) The Neotrochozoa hypothesis (Peterson and Eernisse, 2001) posits that annelids and molluscs are each other’s closest relatives with respect to nemerteans, found using morphological characters (Peterson and Eernisse, 2001) as well as an analysis using concatenated amino acid sequences of nuclear protein-coding genes (Peterson et al. 2008). (b) Mollusca + Nemertea was found with concatenated
amino acid sequences of nuclear protein-coding genes (Hausdorf et al., 2007; Helmkampf et al., 2008a; Struck and Fisse, 2008), as well as concatenated amino acid sequences of mitochondrial protein-coding genes (Yokobori et al., 2008). (c) Annelida + Nemertea was found using combined 18S rDNA + 28S rDNA by Passamaneck and Halanych (2006) and concatenated amino acid sequences of nuclear protein-coding genes (Dunn et al., 2008).

MIC R ORN A S 159



to the post-Mesozoic portions of the phylogeny (Luo, 2000; Rokas and Holland, 2000). Moreover, the utility of these rare events has been hampered by their very rarity; although in some fortuitous examples a strong synapomorphy is captured, they are simply not present in sufficient numbers such that investigators can reliably base a research pro- gram around using them to test hypotheses con- cerning metazoan interrelationships.
The ultimate problem in resolving evolutionary relationships is homoplasy: similarity caused not by shared ancestry but by convergent evolution, or loss and reversion to the primitive condition. Homoplasy occurs in every data set, from examples like the torpedo shape of fish, whales, and ich- thyosaurs, to the rapid gene rearrangements in molluscan mitochondrial genomes, to multiple sub- stitutions or convergent changes in gene sequences that are limited to only four nucleotide or 20 amino acid character states. Both elevated sequence evolu- tion in some taxa and long internodes cause homo- plasy in molecular data sets, causing informative synapomophies to be eroded to misleading homo- plasies. The key to resolving intractable nodes will lie in data sets that minimize homoplasy as much as possible, but whose characters arise or change at a high enough rate that they record the divergences in question. In this paper, we propose that short, highly conserved genes within the non-coding por- tion of the genome, specifically miRNA, may be one such data set. Not only is it almost impossible to evolve the same miRNA twice independently, but miRNAs are continually being added to metazoan genomes through time; only rarely are they sec- ondarily lost, and nucleotide substitutions to the mature gene sequence are infrequent. Importantly, ascertaining the miRNA complement of a taxon does not require any prior knowledge of the miRNA sequences themselves, greatly facilitating their util- ity for attacking phylogenetic questions at all scales of the animal tree, from phyla to species.


15.2 Background

Briefly, miRNAs are small, c. 22 nucleotides, non-coding genes that negatively regulate pro- tein-coding genes by binding, with imperfect complementarity, to sites in their 3c untranslated
regions (UTRs), thereby subjecting the transcript to cleavage or to blockage of its translation (Zhao and Srivastava, 2007; Filipowicz et al., 2008; Hobert
2008; Stefani and Slack, 2008). The first retrospect- ively recognized animal miRNA, lin-4, was discov- ered in the nematode worm Caenorhabditis elegans, where it is involved in regulating the timing of cell cycle division in the larval worm by binding to tar- get sites in the protein-coding gene lin-14 and pre- venting its translation (Lee et al., 1993; Wightman et al., 1993). Because lin-4 could not be found out- side of nematodes, it was considered to be a quirk of the nematode developmental process. The wider significance of this discovery came when it was shown that this type of gene regulation exists in other systems, particularly vertebrates (Ruvkun et al., 2004; Wickens and Takayama, 1994). This occurred with the finding of a second miRNA, let-7, which was originally discovered again in C. elegans (Reinhart et al., 2000), but was soon found in numerous other taxa including fruitflies and ver- tebrates (Pasquinelli et al., 2000), and quickly led to the discovery of many small regulatory RNAs subsequently named miRNAs (Lagos-Quintana et al., 2001; Lau et al., 2001; Lee and Ambros, 2001). let-7 had three intriguing characteristics that held promise for a future role in phylogenetic recon- struction for these small RNA genes (Pasquinelli et al., 2000). First, the mature gene product of let-7 is unchanged in sequence between nematodes, humans, and Drosophila, despite a total of almost
2000 million years of independent evolution in these three taxa. Second, let-7 was found in every protostome and deuterostome analysed, with no suggestions of secondary loss. Third, the gene was not present in any non-metazoan genome and was not detectable by Northern analysis in sponges or cnidarians, suggesting that the gene arose within Eumetazoa at the base of the nephrozoan triploblasts (i.e. protostomes and deuterostomes). Subsequent studies confirmed this pattern for let-7 as the gene was found in, for example, chaetog- naths, nemerteans, and polyclad and triclad flat- worms, but not in acoel flatworms or ctenophores (Pasquinelli et al., 2003).
miRNAs are defined by their mode of biogen- esis, which is intimately related to their unique hairpin secondary structure (Figure 15.2) and not

160 AN I M AL EV O L UTI O N




Dme Dpu Isc Csp
consensus
10 20 30 40 50 60 70 80


Star Loop Mature






Figure 15.2 Alignment and secondary structure of representative sequences of the miRNA bantam. The mature sequence of the Drosophila melanogaster (Dme) bantam gene (miRBase) was used as a query against the trace archive sequences of Daphnia pulux (Dpu), Ixodes scapularis (Isc), and Capitella sp. (Csp) using the default settings (see Wheeler et al., 2009). About 85 nucleotides of the best hits were then folded using mfold (Zuker et al., 1999). Shown at the top is an alignment of these best hits using the default settings of ClustalW (MacVector, version 9.5.2), and shown below are the structures of two of these sequences, D. melanogaster (left) and Capitella (right) as determined by mfold. The mature and star sequences are shown.





by their specific nucleotide sequence. There are two components to a miRNA, the mature gene product, which is what binds to the 3c UTRs of tar- get genes, and the star sequence, the complement of the mature sequence, which is often degraded but is sometimes used as a gene product as well. miRNAs, which can be located either in intergenic regions or in introns, are transcribed as long pri- mary transcripts that are capped and polyade- nylated in typical Pol II fashion. However, because of the complementarity and spacing of this com- plementarity, the primary miRNA transcript folds into a hairpin structure, which is recognized by an enzyme complex involving at least two proteins, Drosha and Pasha, which cleave the pro-RNA into a c. 70 nucleotide precursor miRNA (Kim, 2005). This pre-miRNA is then exported into the cytoplasm where it is further processed by another RNAse enzyme called Dicer, and the mature gene product is then incorporated into an RNA–protein moiety that serves as the repressive entity with respect to messenger RNA translation and/or stability. Hence, miRNA biogenesis relies solely on miRNA struc- ture and not on miRNA sequence per se, greatly facilitating their utility for phylogenetics because it obviates the need for a researcher to know any particular miRNA sequence (see below).
miRNAs are named in sequential order of dis- covery, with identical or near identical mature sequences in the same or different organism given the same number (Ambros et al., 2003). miRNAs
given different numbers have different primary sequences and are assumed to have arisen independ- ently of other named miRNAs. This can be shown using a standard maximum parsimony analysis. If the first 20 miRNAs listed for both the fly Drosophila melanogaster and the human are aligned (Figure
15.3, left), and analysed using bootstrap analysis (Figure 15.3, right; see legend for details) one can easily see the orthology between similarly named miRNAs in the fly and human (e.g. let7, miR-1), and the paralogy of similarly named miRNAs in each taxon (e.g. miR-2). Further, the unique nature of each numbered miRNA or groups of miRNAs is readily apparent as they share virtually no similar- ity with any other miRNA in the alignment (Figure
15.3, left) and do not cluster together in the boot- strap analysis (Figure 15.3, right).
However, phylogenetic analyses are rarely, if ever, used to help name miRNAs, and thus nomen- clature problems can and do arise. For example, the two copies of miR-13 group with miR-2 (Figure 15.3, right), not unexpected given a cursory look at the alignment (Figure 15.3, left), and hence there are five, not three, copies of miR-2 in the fly genome. Even worse is when the same miRNA is given dif- ferent names in different organisms. For example, Sempere et al. (2006) reconstructed the protostome- specific set of miRNAs to include miR-8, and the deuterostome-specific set of miRNAs to include miR-141 and miR-200, and part of the reason for this was that the seed sequences of these genes, which

MIC R ORN A S 161




Dme bantam
Dme let7
Dme miR-1
Dme miR-2a
Dme miR-2b
Dme miR-2c
Dme miR-3
Dme miR-4
Dme miR-5
Dme miR-6
Dme miR-7
Dme miR-8
Dme miR-9a
Dme miR-9b
Dme miR-9c
Dme miR-10
Dme miR-11
Dme miR-12
Dme miR-13a
Dme miR-13b
Hsa let7a
Hsa let7b
Hsa let7c
Hsa let7d
Hsa let7e
Hsa let7f1
Hsa let7f2
Hsa let7g
Hsa let7i
Hsa miR-1-1
Hsa miR-1-2
Hsa miR-7-1
Hsa miR-7-2
Hsa miR-7-3
Hsa miR-9-1
Hsa miR-9-2
Hsa miR-9-3
Hsa miR-10a
Hsa miR-10b
Hsa miR-15a
consensus
10 20
Dme bantam
Dme let7
Hsa let7a
Hsa let7b
Hsa let7c
73 Hsa let7d
Hsa let7e
Hsa let7f1
Hsa let7f2
Hsa let7g
Hsa let7i
89 Dme miR-1
Hsa miR-1-1
Hsa miR-1-2
Dme miR-2a
Dme miR-2b
96 Dme miR-2c
Dme miR-13a
Dme miR-13b
Dme miR-3
Dme miR-4
Dme miR-5
Dme miR-6
Dme miR-7
79 Hsa miR-7-1
Hsa miR-7-2
Hsa miR-7-3
Dme miR-8
Dme miR-9a
Hsa miR-9-1
73 Hsa miR-9-2
Hsa miR-9-3
Dme miR-9b
Dme miR-9c
Dme miR-10
Hsa miR-10a
Hsa miR-10b
Dme miR-11
Dme miR-12
Hsa miR-15a

Figure 15.3 Alignment (left) and phylogenetic analysis (right) of the first 20 miRNAs listed for the fly Drosophila melanogaster (Dme) and the human Homo sapiens (Has) in miRBase. The alignment used the same parameters as Figure 15.2. Right: The 70% bootstrap tree. Sequences were analysed by PAUP, version 4.0b10 (Swofford, 2002) using maximum parsimony. Nodes found less than 70% of the time (1000 replications) were collapsed into polytomies. Note that similarly named miRNAs in the two systems cluster together, as do obvious paralogues in each system (e.g. let-7, miR-1). Note also that some differently numbered miRNAs (e.g. miR-2 and miR-13) group together as well, and as such constitute miRNA families, similar to, for example, the let-7 family or miR-1 family.



are positions 2–8 of the mature gene products, were slightly different (Figure 15.4, left). And because the seed sequence is the most important area of the mature gene sequence for target recognition it is primarily used for family-level classification (Filipowicz et al., 2008). But the use of only the seed sequence to name (and hence classify) miRNAs is a functional rather than a phylogenetic distinction, and in this case it is clear from a bootstrap analysis (Figure 15.4, right) that this is the same gene fam- ily, with the protostome versions called miR-8, and deuterostome versions called miR-141/200.


15.3 miRNAs as phylogenetic characters

miRNAs show three characteristics that make them outstanding candidates to arbitrate among
competing phylogenetic hypotheses and to forge new ones: (1) new miRNA families are continu- ously being added to metazoan genomes through time; (2) once incorporated into a gene regulatory network, there are only rare instances of secondary gene loss and they show only rare nucleotide sub- stitutions to the mature gene product; and (3) there is an infinitesimally small chance that miRNAs with the same mature sequence will evolve more than once.


15.3.1 Continuous addition of miRNA
families to metazoan genomes

Sempere et al. (2006) showed that the miRNA rep- ertoires of both fly and human were added sequen- tially through time such that each node leading to the fly, or to the human, could be characterized by

162 AN I M AL EV O L UTI O N




Hsa miR-141
Hsa miR-200a
Hsa miR-200b
Hsa miR-200c
Bfl1
Bfl2
Bfl3
Sko
Spu
Dme miR-8
Aga
Ame
Bmo Dpu Isc Csp Lgi
10 20




Deuterostomes



72



Protostomes

Figure 15.4 Homologous, but differently numbered, miRNAs in protostomes and deuterostomes. miR-8 in protostomes is clearly the same miRNA as miR-141 and miR-200 in deuterostomes, and indeed is supported as such in a 70% bootstrap analysis (right), despite some chordate paralogues possessing changes in the seed sequence (nucleotides 2–8) with respect to miR-8. Specifically, the human miR-141 and miR-200a, and the third paralogue found in the genomic traces of the cephalochordate Branchiostoma floridae (Bf13), have T to C changes in position 4 (left). Other abbreviations: Sko, Saccoglossus kowalevskii; Spu, Strongylocentrotus purpuratus; Aga, Anopheles gambiae; Ame, Apis mellifera; Bmo, Bombyx mori; Lgi, Lottia gigantea.




a distinctive miRNA or set of miRNAs. To explore this further, we again traced the phylogenetic his- tory of 132 uniquely numbered D. melanogaster miRNAs (see Heimberg et al., 2008, for vertebrates), but this time used many more genomes com- bined with 454 sequencing of small RNA librar- ies (Wheeler et al., 2009). We chose the arthropod example for three reasons. First, the miRNA rep- ertoire of D. melanogaster is the most extensively studied of any model organism (Ruby et al., 2007; Stark et al., 2007a,b), and we can be confident we are examining almost every miRNA in the organ- ism. Second, there are a large number of arthro- pod genomes available, including 12 from the genus Drosophila alone, allowing us to trace the phylogenetic acquisition over a range of taxonomic scales. And third, and most importantly for test- ing this data set for phylogenetic utility, there is an accepted phylogeny, allowing us to map miRNA gain (and losses, see below) against a known top- ology (Stark et al., 2007a).
Figure 15.5 shows the phylogenetic history of all
132 D. melanogaster miRNAs considered, as well
as the ancient miRNAs that should be present in
Drosophila, as determined by Wheeler et al., (2009),
but have been secondarily lost. Where we identify the gain of a new miRNA, it is shown in black under the node with paralogues of previously existing miRNA families underlined (see figure legend for details). Importantly, every node since the diver- gence between D. melanogaster and demosponges, where a sequenced genome is available and/or where a miRNA library has been constructed and sequenced (e.g. Priapulida; Wheeler et al., 2009), is characterized by the addition of at least one novel miRNA. This often involves the innovation of new families (e.g. miR-2 at the base of Protostomia), but sometimes additionally involves the generation of a paralogue from an existing gene (e.g. miR-13 at the base of Ecdysozoa). Hence, miRNAs could be used to resolve the interrelationships of taxa at vir- tually every level in the taxonomic hierarchy, from species to phyla.
We emphasize that we are showing the phylo- genetic history of the D. melanogaster miRNAs because they are well known and because the large number of genomes available enables such a study through bioinformatics alone. This does not imply that the other terminal tips will not have a similar number of miRNAs; groups such

MIC R ORN A S 163







100





1
31
79
92
124
219
252







let7
7
8
9
10
33
34
71
125
133
137
153
184
190
193
210
242
263
278
281
283
285
315
365
375
980
2001








Bantam
2
12
76
87
277
279
307
317
750
1175
1993








–242
–365
13
993










–1993 iab-4-3p
iab-4-5p
275
276












–2001
965













–153
14
277
282
286
305
927
929
932
970
988
989
998
1000















–750
11
306
308
316
957
996
999


















–71
3
4
5
6
274
280
284
287
288
289
304
309
314
318





















959
960
964
974
978
986
1003























1014

























961
968
975
1011
1015



























310
311
Demospongia Cnidaria Acoela Deuterostomia Eutrochozoa Priapulus
Ixodes Daphnia Tribolium Aedes
D. virilis
D. mojavensis D. grimshawi D. willistoni D. persimilis
D. pseudoobscura
D. ananassae
D. yakuba
D. erecta
D. sechellia

139 gains
7 known losses
955
956
962
963
969
971
976
987
994
995
1006
1007
1010
312
973
977
982*
985*
990
991*
992
1001*
1002
1004
1005
1008
1009
1012
1013
1016

313
966
967
983


303
954
972
979
984
D. simulans
D. melanogaster

Figure 15.5 Gains and secondary losses of 132 differently numbered miRNAs in Drosophila melanogaster. Gains are shown in black below
the node, and the seven secondary losses are shown above the node in grey (and where they were originally acquired are shown boxed below the node). Underlined miRNAs are paralogues of previously acquired miRNAs; those that are starred have slightly different seed sequences in other species of Drosophila, but fold properly and hence are considered gains at that point on the tree. This figure only considers gains and losses on
the lineage leading to the single terminal D. melanogaster and does not consider those leading to other terminals, although groups like beetles or chelicerates will clearly have their own sets of clade-specific miRNAs. Results from Demospongia to Daphnia are taken from Wheeler et al. (2009); the trace archives of the remaining insects were searched using all D. melanogaster miRNAs as query sequences. Potential hits were folded using the program mfold and assessed using standard structural criteria (see Wheeler et al., 2009, for materials and methods).



as mosquitoes or chelicerates will have their own clade-specific set of miRNAs that (we suspect) can (and hopefully will) be used to ascertain their internal phylogenetics. Indeed, Wheeler et al.,
(2009) showed that each major lineage of meta- zoans, except for Deuterostomia, could be char- acterized by the acquisition of at least one novel miRNA family. For example, ambulacrarians were

164 AN I M AL EV O L UTI O N



characterized by the addition of five novel miRNA families, and eleutherozoan echinoderms were characterized by the addition of 10 novel miRNA families. Even cnidarians have a novel miRNA family found only in Hydra and Nematostella and not anywhere else in the animal kingdom. And within these groups, the hemichordate worm Saccoglossus kowalevskii has at least an additional
34 miRNAs not found in the two echinoderms analysed, the sea urchin Strongylocentrotus purpu- ratus and the starfish Henricia sanguinolenta, and the hydrozoan cnidarian Hydra has at least an additional 17 miRNAs not found in Nematostella (Peterson et al., unpublished data). These novel miRNAs could then be used as phylogenetic markers to explore hemichordate and hydrozoan interrelationships, respectively, assuming they show low homoplasy. Fortunately, if they are similar to virtually all other known miRNAs, this will indeed be the case.


15.3.2 Minimal secondary gene loss and rare substitutions to the mature
sequence

miRNA homoplasy results from the possible com- bination of two factors, the first being the conver- gent evolution of the same miRNA in two taxa. The second is either complete loss from the gen- ome, or nucleotide substitutions in the mature sequence that destroy the ability to recognize its true orthology. As argued below, independent evolution of miRNAs is extremely limited, but sec- ondary gene loss and substitutions to the mature sequence can and do occur, and could obscure not only the interrelationships among the miRNAs but among the animal taxa as well. Nonetheless, in the Drosophila example discussed above (Figure 15.5), there are only seven secondary losses in D. mela- nogaster as compared with 139 gains—these losses are shown in grey in Figure 15.5 with their point of origin shown below the node and their inferred location of loss above the node. Note that loss can occur at any point in the evolutionary history— two of the genes not present in the fly were lost at the base of Ecdysozoa (miR-242 and miR-365), whereas one gene (miR-71) was lost in Drosophila after this lineage split from Aedes but before the
diversification of the 12 species under consider- ation (Figure 15.5).
If it could be shown that for most metazoan taxa miRNA gains outnumber miRNA losses by over an order of magnitude, as they do in this example, then their utility as phylogenetic markers would be unsurpassed, assuming that the mature sequence does not degrade over time. Sempere et al. (2006) argued that this was indeed the case, otherwise it would not be possible to map the origin of these
139 miRNAs with such minimal numbers of sec- ondary losses, as in this example (Figure 15.5). Further, Sempere et al. (2006) showed that miRNAs were some of the most, if not the most, conserved genetic elements in the genome, with most fly and eutherian mammal miRNAs showing no substi- tutions to the mature sequence. But because their focus was necessarily on flies and vertebrates, it could be argued these evolutionary patterns were particular to flies and vertebrates. Subsequently, Wheeler et al. (2009) quantified the number and position of substitutions of all 93 shared miRNAs across 14 nephrozoan taxa, and because this study relied primarily on isolating mature sequences in small RNA libraries it was not biased towards finding only conserved miRNAs. These authors analysed 16,729 nucleotides and showed that the substitution rate of all known and novel miRNAs across these 14 taxa, whose independent evolution- ary history spans over 7800 million years, is only
3.5% (567 total substitutions). When compared with
18S rDNA, one of the most conserved genes in the
metazoan genome, this rate is impressively slow:
aligning 18S rDNA from the same 14 taxa and
removing the unalignable regions using Gblocks
resulted in a substitution rate of 7.3% (Wheeler
et al., 2009). Hence, miRNAs evolve more than twice
as slowly as the most conserved positions in a gene
that is often used for reconstructing the deepest
nodes in the tree of life.


15.3.3 Exceedingly small probability of the independent evolution of
the same miRNA

In terms of convergent evolution, each unique
22-nucleotide sequence occurs by chance once
for every 1.76 × 1013 nucleotides (422), or once for

MIC R ORN A S 165



every 5864 human-genome-sized chunks of DNA queried. However, this is not an accurate estimate of the chances of two miRNAs evolving twice independently. For example, we took the (arbitrar- ily chosen) protostome-specific bantam miRNA gene (see Figure 15.2) from D. melanogaster and searched both protostomes and deuterostome genomes for this sequence in taxa that diverged from one another at least 500 Ma (Figure 15.7). In no case was the very same 23-nucleotide sequence found in any of these genomes (and aside from hits to D. melanogaster it is not found in the nucleotide data base deposited at GenBank, which consisted of
24,006,283,182 letters as of June 2008). Nonetheless,
23-nucleotide sequences were found in the two
arthropods, the water flea Daphnia and the tick
Ixodes, that are identical to each other but that differ
from that of D. melanogaster by a single nucleotide
at position 11. A single sequence was also found in
the genomic traces of the sea urchin S. purpuratus
that differs from the fly bantam sequence by a single
nucleotide, at position 13; the best hits in all of the
remaining deuterostomes have numerous differ-
ences, many of which are distributed in positions
2–6. The putative orthologue of bantam in the poly-
chaete annelid Capitella shares the same nucleotide
at position 11 as the water flea and the tick, but dif-
fers from all of the arthropods at positions 17, 20,
and 23. Because clearly orthologous miRNAs often
differ by two or three nucleotides, rather than com-
puting the probability for 23 nucleotides, a more
appropriate calculation is for the occurrence of a
stretch of 19 nucleotides, which is expected every
2.75 × 1011 bases, or once in every 91 human- genome equivalents, with the possibility of a few nucleotide substitutions (see below).
On the other hand, these numbers are decep- tively low because there are more constraints on a miRNA than the mature sequence of 22 nucle- otides; it must also fold with a free energy value lower than about –20 kcal/mol and often lower than –25 kcal/mol. In addition, the spacing has to be such that the mature sequence, which has to be located in one of the two hairpin arms, occurs within about two nucleotides from the loop, with the entire pre-miRNA generally being from 60–80 nucleotides long. Further, the structure cannot
contain large, and in particular asymmetrical, internal loops or bulges (Ambros et al., 2003). Thus, if one compares the two bantam miRNA sequences from Drosophila and the annelid Capitella it is obvi- ous that these are real miRNA genes; they have the requisite free energy values and structure to be processed and thus function as bona fide miRNA genes (Figures 15.2 and 15.6). But when the deu- terostome sequences are folded in silico, it is readily apparent that none of these are miRNAs, let alone orthologues of bantam. In the hemichordate (Sko), amphioxus (Bfl), and lamprey (Pma, see Figure 15.6) the free energy of these sequences are extremely high, c. –8 kcal/mol. In both the zebrafish (Dre, Figure 15.6) and ascidian (Cin) they have relatively low free energy values (c. –22 kcal/mol), but in both cases large and asymmetrical bulges and loops are present. Finally, in the sea urchin (Spu), which has the highest nucleotide similarity with the proto- stome sequences, not only is the free energy too high (–15 kcal/mol), but it too has large and asym- metrical bulges (Figure 15.6).
These non-folds are consistent with the observed substitution profile of the mature miRNA sequence as revealed by Wheeler et al. (2009). These authors found that most substitutions occurred at the 3c end of the mature sequence, but other regions of the gene, especially nucleotide 1 and nucleotide
10, showed a relatively high percentage of sub- stitutions. Importantly, the two most infrequent places for substitutions to occur are the seed region (positions 2–8) and the 3c complementar- ity region spanning nucleotides 13–16, especially position 15, in concordance with the hypothesized importance of these two regions for base pairing with the 3c UTR of targets (Filipowicz et al., 2008). Thus, unlike the protostome substitutions, which occur in statistically likely places (positions 11,
17, 20, and 23, see Figure 15.2), in deuterostomes, differences occur in the most conserved areas of miRNAs, positions 2–8 and 13–15. Conservation of sequence of orthologous miRNAs is explained by the constraints governing not only folding but base-pairing with targets, and these structural considerations also explain why the same miRNA gene sequence evolving twice independently is highly unlikely.





Dme Dpu Isc
Csp
Spu
Sko
Bfl
Cin
Pma
Dre
Consensus
10 20 30 40 50 60 70 80 90



Dme–bantam initial dG = –25.50






Csp–bantam initial dG = –29.80







Pma initial dG = –8.80
Spu initial dG = –15.70












Dre initial dG = –22.90








Figure 15.6 Alignment of the bantam gene taken from Figure 15.2 with the best hits from six different deuterostome genomes including the lamprey Petromyzon marinus (Pma) and the zebrafish Danio rerio (Dre) (other abbreviations are listed in Figure 15.4). Note that although some similarity is found in the mature sequence, especially with the sea urchin Strongylocentrotus purpuratus (Spu), there is no similarity in the star region, and, contra the protostome sequences (Dme and Csp), the deuterostome sequences do not show canonical folds (bottom), highlighting the improbability of evolving two miRNAs with the same mature sequence twice independently.

MIC R ORN A S 167


Symsagittifera roscoffensis Caenorhabditis elegans Ciona intestinalis

Cnidarians
Lophotrochozoans
10 10
Cnidarians
Lophotrochozoans
10
Cnidarians
Lophotrochozoans


let 7 133
1 137
7 153
8 184
9 190
22 193
29 210
31 216
33 219
Bm 67 277 750
2 76 279 1175
12 87 317 1993


217




126
135
155
Ecdysozoans Ambulacrarians Cephalochorates Vertebrates

Let7 133
1 137
7 153
8 184
9 190
22 193
29 210
31 216
33 219
34 242
71 252
79 278
92 281
96 315
124 365
125 375
2001
Bm 67 277 750
2 76 279 1175
12 87 317 1993


217




126
135
155
Ecdysozoans Ambulacrarians Cephalochorates Vertebrates

Let7 133
1 137
7 153
8 184
9 190
22 193
29 210
31 216
33 219
34 242
71 252
79 278
92 281
96 315
124 365
125 375
2001
Bm 67 277 750
2 76 279 1175
12 87 317 1993


217




126
135
155
Ecdysozoans Ambulacrarians Cephalochorates Vertebrates

Figure 15.7 Primitive repertoire versus secondary loss of miRNAs. Shown are three taxa with reduced complements of miRNAs, the acoel flatworm Symsagittifera roscoffensis (Sempere et al., 2007; Wheeler et al., 2009); the nematode Caenorhabditis elegans (Ruby et al., 2006), and the ascidian urochordate Ciona intestinalis (Norden-Krichmaer et al., 2007). The miRNAs found in each taxon are shown in a grey box; the ones in black are known to characterize that particular node based on extensive comparative analyses (Wheeler et al., 2009). The 37 miRNA families known to characterize vertebrates (Heimberg et al., 2008) are not shown; only the three shared between vertebrates and
ascidians. Note that, unlike the acoel flatworm, the nematode and the ascidian, while missing many primitive miRNAs, do possess protostome or chordate-specific miRNAs, respectively. In fact, the ascidian is grouped as the sister taxon to the vertebrates given that it shares three miRNA families with them (Heimberg et al., 2008). The nematode is clearly a protostome, but cannot (at the moment) be allied with the ecdysozoans based on miRNAs, as hypothesized by numerous other data sets.




15.4 miRNAs in organisms with fast molecular evolution and frequent gene loss

We wish to take a moment to emphasize that miRNAs are not immutable, but are components of genomes that will experience some of the same processes that affect other components, especially when there is a high rate of gene loss and/or a high substitution rate. Nonetheless, the pattern that emerges from such instances will still allow an investigator to draw accurate if imprecise conclu- sions concerning the taxon’s phylogenetic position. Ascidian urochordates, nematode worms, and acoel flatworms are all characterized by high rates of molecular evolution, and both nematodes (Copley et al., 2004) and ascidians (Hughes and Friedman,
2005) are further characterized by large amounts of secondary gene loss. Both nematodes and ascidians have taken a phylogenetic ‘bump up’ recently with nematodes going from basal bilaterians to near relatives of arthropods (Aguinaldo et al., 1997), and ascidians moving from basal chordates to the sis- ter taxon of vertebrates (Delsuc et al., 2006). Acoels, on the other hand, have followed the opposite
phylogenetic trajectory—they were originally included within the Platyhelminthes, but mul- tiple studies with different genes have suggested that acoels and nemertodermatids form a grade or clade at the base of Bilateria (recently reviewed in Baguñà et al., 2008).
Because all three of these hypotheses are con- troversial, they serve as useful test cases for the utility of miRNAs. Figure 15.7 shows the phylogen- etic distribution of miRNAs in the acoel flatworm Symsagittifera roscoffensis (Wheeler et al., 2009), the nematode C. elegans (Ruby et al., 2006), and the ascidian Ciona intestinalis (Norden-Krichmaer et al.,
2007). In stark contrast to both the nematode and the ascidian, the acoel possesses only a subset of the bilaterian set of miRNAs, and no miRNAs that characterize higher clades, namely Protostomia or Platyhelminthes (Figure 15.7, left; see also Sempere et al., 2007). Clearly, if acoels were in fact Platyhelminthes, or even within the protostomes or deuterostomes as suggested by recent EST studies (Philippe et al., 2007; Dunn et al., 2008) and have sec- ondarily lost miRNAs so as to artificially appear to be basal bilaterians, then they have lost miRNAs in an extremely unlikely pattern. The position

168 AN I M AL EV O L UTI O N



recovered by Dunn et al. (2008), which also corres- ponds to the traditional morphological hypothesis that allies them with the Platyhelminthes, would require the loss of 26 nephrozoan-specific and
12 protostome-specific miRNAs, in addition to some unknown set of platyzoan-specific miRNAs. The miRNA complement of C. elegans is well
known from deep 454 sequencing (Ruby et al.,
2006), and it is clear that it has lost a number of
miRNAs (Figure 15.7, centre), as it possesses just
over half of the reconstructed repertoire of the
ancestral bilaterian miRNA-family complement
(19 of 34). Nonetheless, they are clearly not basal
bilaterians as they also have over half of the
protostome-specific miRNAs as well (6 of 12). The
ascidian C. intestinalis has also lost many miRNA
families (Figure 15.8, right), as it only possesses
14 of the 34 miRNAs families present in the last
common ancestor of protostomes and deuteros-
tomes, but it also has the chordate-specific miRNA
miR-217, and three miRNAs otherwise found only
in vertebrates (Heimberg et al., 2008), which is
entirely consistent with the hypothesis that they,
and not cephalochordates, are the sister taxon to
the vertebrates (Delsuc et al., 2006). Thus, it is these
mosaic patterns of miRNA gene loss that charac-
terize a secondary reduction in terms of miRNA
content from primary absence and distinguishes
nematodes and ascidians from acoels (Sempere
et al., 2007).
There is a good reason for suspecting high gene
loss and/or high rates of sequence evolution in
miRNA genes in these two taxa. The presence and
sequence constraint of miRNA is probably dictated,
to a large degree, by targeting numerous mes-
senger RNA gene products. Several studies have
shown that miRNAs can regulate up to hundreds
of protein-coding genes (Lim et al., 2005; Baek et al.,
2008; Selbach et al., 2008), and because miRNAs
regulate so many different transcripts, and must
functionally interact with the 3c UTR of all targets,
it is difficult to lose the gene or change the pri-
mary sequence. Nematodes and ascidians have
both lost a considerable fraction of their protein-
coding genome, and consequently, in nematodes at
least, each miRNA probably regulates, at best, only
one or a few protein-coding genes (Ambros and
Chen, 2007). This allows for individual miRNAs
to be more easily lost when their target messenger RNA is lost, or else to track the target site without being constrained by other targets. Thus, if a taxon is known to have a high proportion of secondary gene losses, it is likely that there will also be a rela- tively high number of missing and/or unrecog- nizable miRNAs. The pattern should nevertheless be both mosaic and random still allowing for an accurate (but possibly imprecise) placement on the metazoan tree of life.


15.5 Returning to the lophotrochozoan problem . . .

Ultimately, taxa such as nematodes are exceptions—most organisms have not experienced drastic gene losses. We hypothesize that lophotro- chozoans in particular, which show little secondary gene loss and very little secondary modifications to their genomes (Tessmar-Raible and Arendt, 2003; Raible et al., 2005), will make a near-perfect test case for miRNA phylogenetics. Returning to the problem introduced earlier in the paper, the inter- relationships among nemerteans, annelids, and molluscs with respect to arthropods (Figure 15.1), the miRNAs are unequivocal. Both Eutrochozoa and Neotrochozoa are monophyletic, as nemerte- ans share with annelids and molluscs three unique miRNA families, one of which is the star sequence of miR-958, and annelids and molluscs share two miRNA families not found in nemerteans or any other taxon, one of which is the star sequence of an ancient miRNA family miR-133 (Wheeler et al.,
2009). Further, nemerteans do not share any miR- NAs with either the annelids or the molluscs to the exclusion of the other, nor do they share with annelids and molluscs second copies of miR-10 and miR-22. Thus, among the three possible arrange- ments of these three taxa (Figure 15.1) miRNAs support the topology derived from morphological and embryological considerations (Figure 15.1a) (Peterson and Eernisse, 2001).


15.6 Methodology for miRNA
phylogenetics

Because the primary sequence of the mature sequences of miRNAs is so fundamentally

MIC R ORN A S 169



conserved, miRNA phylogenetics is essentially a binary system, involving simply the presence or absence of given miRNAs in different organ- isms. miRNAs can be identified as present in an organism by bioinformatic searches in genomes, Northern analysis, or sequencing of libraries tar- geting the products of Dicer cleavage, usually with new high-throughput sequencing technologies (e.g. Wheeler et al., 2009). The literature on discov- ery and validation of miRNAs is too large to cover in this chapter, and we point the reader towards the general reviews cited above as a starting point to this literature, as well as to Ambros et al. (2003), which explains the requirements for annotation in miRBase, the online miRNA repository (Griffiths- Jones et al., 2006).
With the exception of the phylogenetic position of acoels (Sempere et al., 2007; Wheeler et al., 2009), the utility of miRNAs as phylogenetic characters has not been fully tested. The main problem for miRNA-based phylogenetics is in positively dem- onstrating absence, which (needless to say) is far more difficult than demonstrating presence, espe- cially in organisms without sequenced genomes. miRNAs with low expression levels will be hard to detect with both Northern analysis and librar- ies. As an extreme example, the miRNA lys-6, which was discovered in C. elegans by genetic screens (Johnston and Hobert, 2003) is expressed in fewer than 10 cells, and consequently has yet to be found in small RNA libraries even by extremely deep sequencing (Ruby et al., 2006). Furthermore, reaction kinetics for Northern analyses indicate that only a few base changes will result in non- detection (Sempere et al., 2006; Pierce et al., 2008), so a negative result could be the result of a few nucleotide changes as opposed to an absence of the gene product.
Although the absence of a gene can never be proved, there are several relatively straightforward ways strongly to suggest absence. First, studies should strive to sequence libraries deeply enough to provide some confidence that the absence is real. The cost of next-generation sequencing tech- nology is dropping quickly and as this happens the ability to sequence many organisms deeply and obtain a near complete understanding of their
miRNA complement will be possible. Second, until the methodology of miRNA-based phylogenetics is more fully developed, studies should focus on understanding the relationships of non-genomic organisms to genomic organisms, rather than tack- ling questions where there are no genomes avail- able at all. Genomes are never completed in the true sense of the word, but absence in both a finished genome and in a small RNA library is extremely unlikely to be a false negative. Working in a con- text where at least some of the organisms have sequenced genomes allows for the demonstration that the putative miRNA folds correctly in at least some of the organisms, precluding the possibility that the shared library reads are a degraded and highly conserved fragment of another gene. For non-genomic organisms, it is experimentally feas- ible to amplify miRNA loci using genome walk- ing from the taxon of interest to demonstrate the necessary structural features (Wheeler et al., 2009). The third, and probably most important, approach, much like any other form of phylogenetic infer- ence, is taxon sampling. Studying more than one organism per clade of interest (especially for library construction) has the benefit of sampling two dif- ferent transcriptomes which are likely to differ in their miRNA expression levels but which will help establish the polarity of individual characters and distinguish synapomorphies from plesiomor- phies. As an example, miR-750 was reconstructed as a lophotrochozoan-specific miRNA by Sempere et al., (2007), but the presence of miR-750 in the small RNA library of the priapulid Priapulus caudautus (Wheeler et al., 2009) demonstrated that this was in fact a protostome-specific miRNA that had been lost at the base of Insecta (Wheeler et al., 2009; see Figure 15.5).


15.7 Conclusions

We see two main advantages to miRNA-based phylogeny. First, as demonstrated in Sempere et al., (2006) and also in our Figure 15.5, they are applic- able over a wide range of phylogenetic scales, from species divergences within a genus to phylum-level relationships at the base of Bilateria. Second, the constraints on miRNA structure means that any

170 AN I M AL EV O L UTI O N



taxon can be queried for its complement of miRNAs, without having prior knowledge of a single miRNA sequence, simply by building a small RNA library. Given that miRNAs are continually added over time, rarely change in primary sequence, and are only rarely secondarily lost, they are potentially the near homoplasy-free data set that systematists have long wished for, and one that can be used to resolve the interrelationships among eumetazoan taxa at virtually any hierarchical level.
15.8 Acknowledgements

KJP would like to thank the National Science Foundation for funding and T. Littlewood and M. Telford for the invitation to contribute to the symposium. EAS would like to thank the Lerner- Gray fellowship from the American Museum of Natural History, the Systematics Research Fund of the Systematics Association and the Yale Enders Fund for funding. We thank D. Pisani and S. Smith for helpful suggestions on the manuscript.

No comments:

Post a Comment