Abstract

Many parts of the life sciences, including phylogenetics, phylogenomics or ecology, have become data-intensive due to increasingly cheaper high-throughput sequencing technologies, the digitiza-tion of large biological collections or data contributions from citizen science. An increasing number of available and computationally accessible methods for downstream analysis that produce derived data (e.g., phylogenetic trees or character data automatically extracted from images) further con-tributes to the production of large quantities of potentially reusable data. This opens up new op-portunities for big data studies, but also creates new challenges for infrastructure and method de-velopment. To fully use the potential of available data, cyberinfrastructure for sharing and mainte-nance of scientific data and policies of journals and funding agencies that encourage its publication are necessary. In part, this has already been addressed by databases like Dryad and recommenda-tions of an increasing number of journals and funding agencies, although additional measures still need to be taken. Equally important as the availability of scientific data is its reusability. This in-cludes the use of open and well-defined formats, as well as the semantic annotation of data with metadata of different kinds, for an unambiguous description and links to related information and resources. These annotations should ideally be machine-interpretable to allow reliable automated data collection for large-scale studies. In addition to its value for data reuse, proper annotation can also increase the reproducibility of studies, if methods used and steps of workflows are directly documented using attached metadata. Ideally, this would be done by the researchers that produce the data, who, however, often are unfamiliar with the necessary annotation technologies like the Resource Description Framework (RDF), advanced file formats like NeXML or biological ontologies. Therefore, software that makes this process more convenient is a key requirement in the age of big data and the semantic web.

To address these needs regarding phylogenetic data types, two approaches are followed in this thesis, which cater to both developers of bioinformatical software and researchers from any disci-pline dealing, e.g., with multiple sequence alignments or phylogenetic trees at any step of their workflows. First, programming libraries are introduced that provide required reusable software components. JPhyloIO allows reading and writing phylogenetic data from and to various file formats through a single memory-efficient interface, while making full use of the metadata model of each format. LibrAlign provides flexible and easily extendible GUI components for displaying and editing biological sequences and multiple sequence alignments (MSAs) closely together with any type of attached metadata. Second, these libraries form the basis of applications newly developed here that address the described needs of researchers. At the same time, new functionality exposed through these libraries is available to all developers and enables creation or extension of software for diverse biological applications that simplify data reuse through efficient annotation.

Among the developed applications, the Taxonomic Editor of the EDIT Platform for Cybertaxonomy models taxonomic workflows and persistently links all data elements to the specimen they were derived from. This a major advantage over the traditional approach of linking all information to a taxon, because data remains reusable and interpretable if the assignment of specimens to taxa changes in taxonomic revisions. In this thesis, the Taxonomic Editor is extended to support molecu-lar sequence data with help of the functionality provided by LibrAlign and JPhyloIO. The two main phylogenetic data types are addressed by PhyDE 2 and TreeGraph 2, editors for multiple sequence alignments and phylogenetic trees, respectively. PhyDE 2 is a reimplementation of the currently used version of PhyDE based on LibrAlign and JPhyloIO. Although it currently is in a proof-of-concept state and does not yet offer the full feature set of the previous version, its new codebase is much easier to maintain and extend and significantly simplifies the future development towards advanced metadata modeling and using the potential of the new libraries. TreeGraph 2 offers ver-satile formatting and editing options in a user-friendly way and models any type of metadata asso-ciated with tree nodes and branches, while offering a variety of options to visualize these annota-tions. It makes use of JPhyloIO to read and write phylogenetic trees and their metadata.

In addition to fostering data reuse, allowing to compare and combine results from alternative methods is another major goal of this thesis and is also closely linked to metadata modelling and increased reproducibility of studies. Many alternative methods to construct MSAs or phylogenetic trees are available and is choosing among them is usually non-trivial. As a result, researchers often need to carefully check for agreements and conflicts between results from alternative approaches and possibly also present a synthesis across alternatives. AlignmentComparator implements differ-ent algorithms to visually compare alternative MSAs of the same dataset in detail and allows to identify and annotate differently and identically aligned regions. It can also be used to track subse-quent automatic or manual alignment changes in workflows. TreeGraph 2 completes the required functionality by providing an interactive comparison feature for phylogenetic trees and allows to map statistical support values derived from alternative methods onto a single reference topology, thereby highlighting topological conflicts.

Together, the developed applications support visualizing, editing and comparing all major data types of phylogenetics and related fields and have the potential to allow convenient and complete modeling of necessary metadata across complete phylogenetic workflows that produce optimally reusable data in an easily reproducibly way. Easy reuse of the developed functionality is ensured by providing key functionality in separate libraries that simplify the development and extension of more tools to provide features for easier data reuse and increased reproducibility. All developed products are freely available at http://bioinfweb.info/Software.

Download full text