Phylogenetic signal in highly variable chloroplast DNA markers

Recent advances in understanding patterns and mechanisms of molecular evolution in coding and non-coding DNA regions have contributed to developing more sophisticated approaches to molecular phylogenetic reconstructions and have enhanced confidence in phylogenetic trees that nowadays are a framework for hypothesis testing in almost all biological sub-disciplines. Currently, the kind of marker used to answer a given phylogenetic question, as well as the selection of included taxa and density of taxon sampling are to a large extend subjectively chosen by the investigators. For example, recent studies aiming at unraveling angiosperm relationships range from analyses using complete chloroplast sequences for a dozen of species to analyses of several hundred taxa for which only a limited number of nucleotides have been sequenced. Whether a general ideal taxon sampling scheme can be recommended that maximizes the phylogenetic structure for a given DNA marker remains unclear. So far, there has been only limited work on optimal sampling- and sequencing strategies in which cost minimization is achieved while keeping phylogenetic structure of nucleotide data sets as high as possible.

We try to assess the impact of the various sampling parameters on the phylogenetic structure in relation to the sequencing effort required. Our long-term objective is developing models that allow predicting phylogenetic structure based on initial data. This could enable researchers to assemble a minimum cost dataset optimally suited to address a given set of phylogenetic questions. Beyond theoretical considerations and simulation analyses, real large scale data sets available from GenBank and other public sources such as TreeBase are analyzed.