Database Research Challenges in Supporting Taxonomic Names in Cell and Molecular Biology Databases.

Jessie Kennedy

Napier University

17/1/2003

j.kennedy@napier.ac.uk

The field of biological taxonomy involves taxonomists classifying and naming groups of organisms, which provides others, e.g. cell and molecular biologists with a framework for identifying, categorizing and referring to organisms. However, the process of discovering, classifying and naming all organisms on Earth is far from complete, and the continuing accumulation of knowledge results in revisions of existing classifications with associated changes in taxon concepts and names. Although we need names or labels to refer to things, we cannot simply assume a single, common reference classification, which uniquely categorises and names all organisms. The same organism may have at times been classified according to different taxonomic opinions and subsequently have several alternative names. Without halting the advancement of our knowledge of existing biodiversity, it is difficult to see how we can (in the foreseeable future) achieve a single, static index of species names, which will serve to provide unique identifiers for all organisms. Therefore, we must acknowledge this issue and deal with it adequately in biological information resources, which reference groups of organisms or taxa.

Biological databases are a relatively new medium for the storage of biological information. However, the emphasis on the design and development of these databases has primarily been in recording the data generated from experiments, such as nucleotide sequences, proteins, metabolic pathways, gene expression etc. [1,2,3,4], on particular groups of organisms, rather than the seemingly trivial reference to the source organism. Biologists interact with these databases using labels to refer to the specimens or organisms, including common names, generic names and species names. These are the same taxa and names used in the taxonomic literature but without reference to the taxonomic concept associated with the label. Biological taxonomy can provide the framework by which biological information is stored, retrieved, and exchanged, but it is necessary for biological databases to accurately represent the taxonomic constructs, rather than simply use an undefined label. The major challenges in biology are to answer the "bigger" questions, which require integrating data from different experiments (and hence databases). Therefore to ensure valid conclusions are drawn from any analysis which integrates data from different sources, it is vital that like is compared with like however this cannot be guaranteed with an un-attributed name.

Several challenges for database research arise from the need to allow users to reference organisms by name while accurately representing the reality of the meaning and usage of taxonomic names. To represent taxonomic concepts adequately, the minimum information required is the full taxonomic name and reference to the author and publication in which the concept was described [5,6]. Therefore if a biologist is naming an organism (identifying it) he must cite the publication used for identification purposes. This publication will be either a taxonomic work (and hence will define the concept) or that publication should cite a taxonomic work in order to fix the taxonomic concept to the associated name. This will allow others to be sure of the concept associated with the name, however it will not allow them to automatically compare the concept to other concepts, unless they are experts in the taxonomic group concerned. In order to interpret the relationships between taxonomic names (concepts), one must know, not only the classification assumed by the original publication, but also the nomenclatural and taxonomic changes that relate that classification to others. There are 2 general ways that this can be done. If a sufficient description of the taxon concept has been captured [7] then it could be possible to automatically determine the similarity of concepts. However for most historical classifications there is insufficient information recorded to enable this to be done and therefore although the most useful approach for the long term, would only be a solution for future taxonomic revisions. A second mechanism is for taxonomic experts to explicitly define the relationships between taxa [8,9,10,14]. This is limited in that few other relationships can ever be inferred and little automation can be provided. Both approaches require work by expert taxonomists, however even if this work was completed, there is insufficient support in existing systems to take advantage of it.

In order to model the reality of taxonomy and nomenclature, database management systems must provide support to store and manipulate the structures and properties of this type of data [11,12,13]. Classifications are hierarchies, however, when all revisions of classifications of groups of organisms are considered we have in effect a graph of overlapping hierarchies. There are many database research challenges in supporting taxonomy but perhaps the major challenge is:

Modelling and manipulating large, distributed hierarchies and graphs of complex objects.

Currently database systems provide limited facilities for modelling graphs, although there are many research database systems which provide some of the functionality required. However, none to our knowledge provide all of the functionality required [12], nor are they in widespread use or provide the support expected for multi-user environments with large-scale data requirements.

· Most graph databases (or support for graphs in other databases) treat nodes simply as labels. We require to be able to store objects (e.g. specimens) and use them in one or more graphs (e.g. classification hierarchies, type hierarchies, placement hierarchies), therefore the objects (specimens) must be independent of the graphs in which they occur and the graphs must be able to support complex objects as opposed to labels. Therefore, we need database modelling concepts to allow us to describe objects and relationships, from which we can then compose hierarchies and graph structures.

· Pattern matching is a common querying mechanism in graph databases, however patterns are typically simple paths in a graph. We require not only simple pattern matching but also patterns which allow matching of attributes of the nodes and edges in the paths of the graph.

· The levels in classification hierarchies are called ranks, however every classification does not make use of all possible ranks, although those that are used must appear in the given order. We need to be able to query by rank (level) in the graph where rank (level) is not semantically equal to depth, i.e. from a given node at a particular rank in one classification, a node at depth of 2 below in that classification will not necessarily have the same biological rank as a node at a depth of 2 below in another. Additionally to ensure the semantic integrity of the database we need to be able to specify constraints on the graph. E.g. nodes of a particular rank can only exist below other nodes in the hierarchy.

· Taxonomies are directed, therefore in queries we need to be able to traverse the graph or tree in a specified direction.

· The results of querying a graph could be a node or a sub-graph. If sub-graphs are returned the structure of the graph must be maintained.

· Having stored and being able to query our classifications we also need to be able to compare taxa or concepts. As discussed above this could be done in two ways, by capturing a definition of the concept in terms of for example its circumscription (members or child nodes of a given node) or by explicitly creating another edge between nodes that specifies explicitly the relationship between two taxa in different classifications (e.g. subset of) Both of these mechanisms have different graph query processing requirements.

We have built a prototype to support the functionality we require for taxonomic systems, but the system is not scalable for large systems. Nor has it been implemented on a platform with a sufficiently wide user base to encourage adoption of the approach. Providing this sort of functionality and support in commercial systems is a major challenge in database research.

The development of taxonomy is a specialised field and the process is typically limited to small groups of organisms, therefore for pragmatic reasons there would need to be many autonomous taxonomic databases resolving parts of the overall taxonomic graph with an integrating database server providing a portal for all taxonomic names and synonym resolution. Any other biological database could then consult the taxonomy server for appropriate name usage and possible synonymy or homonymy resolution with some indication of similarity that could be used to guide the integration of data within and between databases. Developing such a support mechanism is another major challenge.

References

1. Stoesser G., Baker W., van den Broek A., Camon E., Garcia-Pastor M., Kanz C., Kulikova T., Lombard V., Lopez R., Parkinson H., Redaschi N., Sterk P., Stoehr P., Tuli MA; The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 29:17-21(2001)

2. Barker, W.C., Garavelli, J.S., Haft, D.H., Hunt, L.T., Marzec, C.R., Orcutt, B.C., Srinivasarao, G.Y., Yeh, L.L., Ledley, R.S., Mewes, H., Pfeiffer, G., Tsugita, A., The PIR-International Protein Sequence Database. Nucleic Acids Res. 27, 26, 27-32, 1998.

3. Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., and Kanehisa, M.; KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 27, 29-34 (1999). [pubmed]

4. Brazma, A. et al. ArrayExpress — a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. (in press)

5. Yoon, N and Rose, J, An Information Model for the Representation of Multiple Biological Classifications. In: Alexandrov VN, Dongarra JJ, Juliano BA, Renner RS, Tan CJK, editors. Computational Science - ICCS 2001: International Conference, San Francisco, CA, USA, May 2001 Proceedings, Part 1. New York:Springer. pp 937-946.

6. Nozomi Ytow, David R. Morse and David Mcl. Roberts, Nomencurator: a nomenclatural history model to handle multiple taxonomic views, Biological Journal of the Linnean Society, Volume 73, Issue 1, May 2001, Pages 81-98.

7. Pullan MR, Watson MF, Kennedy JB, Raguenaud C, Hyam R. (2000) The Prometheus Taxonomic Model: a practical approach to representing multiple classifications. Taxon 49: 55-75.

8. VegBank. http://www.bio.unc.edu/faculty/peet/lab/PEL/vegbank/vegbranchhelp/VegBankInfo.htm

9. Moretax: http://www.bgbm.org/BioDivInf/Projects/MoreTax/standard_liste_en.htm

10. Universal Biological Indexer and Organizer: http://www.ubio.org/

11. Raguenaud, C, Kennedy, J and Barclay P, The Prometheus Taxonomic Database, Bio-Informatics and Biomedical Engineering, 2000, Arlington, Virginia, USA, 08/11/2000-10/11/2000, pp63-70, IEEE Computer Society Press, ISBN-0-7695-0862-6, 2000

12. Raguenaud, C. Managing Complex Data in an Object-Oriented Database, PhD Thesis, Napier University, January 2002

13. Raguenaud, C,. Pullan, M., Watson, M., Kennedy, J., Newman, M., Barclay, P. Implementation of the Prometheus Taxonomic Model: a comparison of database systems, in Taxon, 51(1), pp 131-142, May, 2002,

14. J. H Beach., S. Pramanik, J. H. Beaman, "Hierarchic taxonomic databases",. Ch. 15 in Fortuner, R., ed. Advances in Computer Methods for Systematic Biology: Artificial Intelligence, Databases, Computer Vision. Johns Hopkins Univ. Press, Baltimore. pp. 241-256 (1993)