Challenges in Data Management for Biology and Medicine Russ B. Altman Stanford University http://www.smi.stanford.edu/people/altman/ THE COMPLEXITIES OF BIOLOGICAL DATA The problems we face with representing and manipulating biological knowledge are not unique to biology, but biology is particularly difficult because it has a confluence of many technical challenges that often appear isolated in other domains. One key point is that we are rarely talking about something being impossible in a relational model, but instead are often referring to the relative difficult (relative to ad hoc solutions, object oriented solutions, text-based solutions, and others). The key features of biological data that make them difficult to represent and manipulate efficiently include: * DUAL NATURE OF SEQUENCE INFORMATION--as a block and as individuals. We care about sequences as entities: genes, exons, transcripts, etc...but we also care about individual elements of sequences, such as particular bases (for SNPs) or amino acids (for protein mutations, functional analyses). It is very difficult to find efficient representations that allow for this dual requirement of accessing entire blocks while also wanting to "explode" them. Furthermore, there are ways in which we parse the blocks into pre-splice, post-splice mRNA where we need to talk about subsegments. * IMPORTANCE OF HIERARCHICAL DATA REPRESENTATIONS. Biological structure is organized hierarchically from atoms up to organisms (and even ecosystems). There is therefore both an elaborate "part-of" reality to biology that is constantly re-invented, as well as a fundamental "is-a" reality in terms of classification (especially with regards to genes and species). Rapid reasoning up and down these two intermingled trees is fundamental to many biological analyses, and yet is sometimes cumbersome to support in existing systems. * GREAT DATA TYPE VARIETY, ESPECIALLY FOR PHENOTYPE DATA. As we enter the post-genomic period, it becomes clear that a major challenge is relating genotype data to phenotype data. With few exceptions, most phenotype data is collected in non-standard ways and is represented in databases in ad hoc manner. This great heterogeneity is a major challenge to the post-genomic computational analysis of data, because it is much more difficult to aggregate data from different groups, and because there is an infinity of experimental procedures that can be used to collected related, but non-identical data. The great heterogeneity of techniques for collecting data is a relative strength of biology, because models can be tested in a virtual infinity of manners, but it is a major representational challenge to the computational community, because we can not easily identify representational schemes that both capture the information in sufficient detail, and also allow aggregation and analysis with general purpose algorithms. * DISTRIBUTED NATURE OF BIOLOGICAL DATA. It is generally agreed that the best databases in biology are those that are carefully curated by specialists with domain expertise. A consequence is that most databases are very limited in their scope, with high quality primary data and a drop-off in ancillary data as you move away from the primary data. This is often addressed as a set of URL links between resources, which is fine for biologist humans, but less useful for computational engines. Thus, the assumption of relatively narrow data collections in a distributed environment needs to be supported in a more robust manner. * TENSION BETWEEN HUMAN AND COMPUTER CLIENTS. There is an unfortunate tension between computational representations and formats that are human-usable and those that are set up well for computational analysis. Data locality is very important for human processing (e.g., "what's all the available information about gene X?") but often stresses computational goals such as normal form and efficient query. In addition, complex computational representations that enable powerful algorithms often require "views" of the data that are not intuitive to biologists. Sometime these views can be deconvoluted in order to present information to human users, but this is typically a difficult and expensive activity. The dual requirement of "easy to understand" and "powerful for computation" is difficult to meet with current systems, and should only become more difficult as our understanding of biological complexity increases. This is simply not simple stuff that can be easily modeled as seven ellipsoids interacting with one another qualitatively. These issues lead to three observations about current biological information systems. * LOTS OF REDUNDANT EFFORT SPENT CREATING MIDDLEWARE Almost every project I am aware of has a huge effort to create "biology-friendly" middleware on top of either a relational database (recently) or on top of an elaborate file system of text files (in the past). This should be of concern to funding agencies, since precious research dollars are being used in many cases to invent the same kinds of interfaces. Of course, each is slightly different in order to accommodate either the biological situation or the biases/interests of the architects. * POOR WEB PERFORMANCE WITHOUT EXTRA ENGINEERING Another distressing trend is that towards optimization of web resources in order to support real-time queries on the web or by computational engines. Many investigators are surprised to learn that relatively reasonable relational database implementations of information resources can be difficult to tune for real-time query performance. This performance is really a result of all the complicating issues mentioned above. However, it also leads to substantial amounts of time being spent worrying about how to deliver information that is perfectly well represented and available, but just takes too long to retrieve with current DB + middle layer models. This leads to strategies for compiling content down for performance which can make maintenance issues more difficult. * INADEQUATE QUERY AND USER INTERFACE At tension with performance and need for expressive power is the need to reduce these data into forms that can be queried and understood by users. The query model for any information system will reflect its underlying organization (as well as the organization of any middle layer). The user interaction, can therefore be as complicated and intricate as the underying information system, and this may be a major barrier to acceptance. Thus, a major challenge to building information systems is the creation of robust interfaces that supply access to the power and richness, but which also can be used by opinion-leader biologists who are impatient to "just get the answer." There are currently few general purpose tools for creating such environments, and thus much effort and funding is spent on special purpose, domain-specific interfaces that are expensive and single-use.