Data Integration: The Ocean Biogeographic Information System
Edward Vanden Berghe1, Karen I. Stocks2, J. Frederick Grassle1
1Institute of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
2San Diego Supercomputer Center, University of California San Diego, La Jolla, California, USA
Informed management of the environment has to be supported by data (Richardson & Poloczanska 2008; Stokstad 2008). Often marine biological data are the result of projects with a limited taxonomic, temporal, and spatial cover. Taken in isolation, datasets resulting from these projects are only of limited use in the interpretation of large-scale phenomena. More specifically, they fail to inform on a scale commensurate with the problems humankind is confronted with: pollution, global change, invasive species, harmful algal blooms, and the loss of biodiversity to name but a few. Individual studies are restricted in the amount of data they can generate but, by combining the results from many studies, massive databases can be created, making possible analyses on a more relevant, much larger scale. It is the ambition of the Ocean Biogeographic Information System (OBIS; www.iobis.org; see Box 17.1) community to provide a sound basis for management decisions by integrating data from many sources, and thus facilitating badly needed regional, ecosystem, and global analyses. OBIS does so by facilitating publication of data, and stimulating open and free access for all potential users. Indeed, OBIS is often mentioned as the organization best suited for this role (see, for example, Poloczanska et al. 2008).
Box 17.1 OBIS “Biography”
The Ocean Biogeographic Information System (OBIS) is an online, user-friendly system for absorbing, integrating, and assessing data about life in the oceans. It is recognized by many as the prime provider of information on the distribution of marine species. OBIS aims to stimulate new research that generates new hypotheses about evolutionary processes and species distributions by providing software tools for data exploration and analysis. All data are freely available over the Internet and interoperable with similar databases. OBIS integrates data from many sources, over a wide range of marine themes, from poles to equator, from microbes to whales. It is the largest provider of information on the distribution of marine species, and one of the largest contributors to Global Biodiversity Information Facility (GBIF). Any organization, consortium, project, or individual may contribute. OBIS was created as the data integration component of the Census of Marine Life; the international portal is hosted by Rutgers University, New Jersey, USA. A global network of 15 Regional and Thematic OBIS Nodes assures the worldwide scientific support needed to fulfill the global mandate.
OBIS was conceived as the data integration component of the Census of Marine Life (Box 17.1). It is very much a “work in progress”: we know that many important datasets are not available through OBIS. However, we do think that the present content is sufficient to start exploring global patterns of biodiversity, taking into account a wide range of life forms; this exercise was not possible before OBIS brought the relevant data together into one consolidated, quality-controlled system.
In the first part of this chapter, we discuss some of the issues we encountered while working on OBIS. In the second part, development of OBIS, in terms of both technology and content, is discussed. In the third part, some of the possible analyses are illustrated, and the content of the database is explored.
17.2. List of Acronyms
17.3. The Data Sharing Challenge
The willingness to share data is a prerequisite to data portals. Advantages of sharing data are clear and numerous, and have prompted many organizations, including the International Council for Science (ICSU) and the Intergovernmental Oceanographic Commission (IOC), to adopt a policy of open access to data. The physical oceanographers have set an example with the World Ocean Database (WOD) and derived products such as World Ocean Atlas (WOA), published by the US National Oceanic and Atmospheric Administration (NOAA) (Boyer et al. 2006). Much of our understanding of global patterns is based on these global databases (see, for example, Levitus 1996; Conkright & Levitus 1996). The advantages might be clear, but practice is often lacking. This led the participants at the Ocean Biodiversity Informatics (OBI) conference in Hamburg, 2004, to formulate a public statement summarizing the benefits (Box 17.2) (Vanden Berghe et al. 2007a).
- is good scientific practice and necessary for advancement of science
- enables greater understanding through more data being available from different places and times
- improves quality control due to better data organization, and discovery of errors during analysis
- secures data from loss
- overall cost/benefit
- importance to science
- long-term benefits to society and the environment
- increased value by being publicly available
We also call upon employers of scientists, academic institutions and funding agencies and editors of scientific journals, to:
- promote on-line availability of data used in published papers
- promote comprehensive documentation of data, including metadata and information on the quality of the data
- reward on-line publication of peer reviewed electronic publications and on-line databases in the same way conventional paper publications are rewarded in the hiring and promotion of scientists
- encourage and support scientists to share currently unavailable data by placing it in the public domain in accordance with publicly available standards, or in formats compatible with other users
Here are a few of the benefits of data sharing.
Sharing data is a way to avoid data loss related to institutional discontinuities or poor archiving (Froese et al. 2003); the very fact of sharing data creates redundancy, and this will assist in recovery of data after accidental destruction of a dataset.
Sharing data makes the data more visible, and so increases the opportunities to create collaborative ventures with scientists outside the immediate environment.
It facilitates re-use of the data for purposes that they were not originally collected for; every time a datum is used in some analysis or consulted through a website, society's return on investment in collecting the data increases.
Not all countries are fortunate enough to have the expertise and/or the resources to set up data management systems of their own; data sharing ventures can be the framework for data repatriation to developing countries, and assist them in fulfilling their reporting obligations in the framework of international conventions such as the Convention on Biological Diversity.
Last but not least, by sharing data it becomes possible to create the large data systems we need to support proper management of our natural resources.
Any initiative relying on the willingness to share data has to take into account the sociology of science: data owners will have to see clearly the advantages of sharing data, and will need incentives to do so. Scientists have to be compensated for the time that they spend making the data available for re-use, and for the loss of exclusive access to the data, and the competitive advantage associated with this. An obvious example of such an incentive is when data are shared between several data providers, with the intent to analyze the pooled dataset and to publish the results jointly. Examples include the North Sea Benthos Project of the International Council for the Exploration of the Sea (ICES) (Rees et al. 2007; Vanden Berghe et al. 2007b); MacroBen (Somerfield et al. 2009; Vanden Berghe et al. 2009); and other initiatives of the European Union (EU) Network of Excellence “Marine Biodiversity and Ecosystem Functioning” (MarBEF). The incentive is, in this case, clearly the opportunity to analyze a larger dataset than the one available from a single data provider, and to become a co-author on the resulting papers.
However, the model of co-authorship as incentive for data sharing does not scale: it is not tenable with large databases such as OBIS or the WOD/WOA. There are too many individual data contributors, so papers based on the complete dataset would have to list thousands of authors. Also, even if the number of data contributors were more reasonable, it does not always make sense for people to become co-author; in principle, anyone listed as an author on a paper should have made a direct intellectual contribution to the paper, and share responsibility for the conclusions. A recent trend to include too many colleagues as co-authors is putting pressure on science's credit system (Greene 2007; Sekercioglu 2008). In many cases, citation of the source of the data would be more appropriate. However, this needs a formal system of indexing, just as the citations of “classical” publications are indexed by the Institute for Scientific Information (ISI). And, of course, use or re-use of a dataset should contribute to the career advancement of any person involved in the collection or management of the data. Several initiatives have started to address data citation. There is a working group of the Global Biodiversity Information Facility (GBIF) discussing this issue, organized in response to a discussion at the e-Biosphere conference; another working group, jointly organized by the Scientific Council for Oceanographic Research (SCOR) and the International Oceanographic Data and Information Exchange (IODE), recently published a first report (SCOR & IODE 2008).
When trying to persuade someone to do something, one has the choice of using a carrot or a stick. Data citation and co-authorship are clear examples of the former. However, the stick can also be used creatively and fairly, with everyone having to comply with the same rules. The prime example of appropriate use of the “stick” is the requirement by several major scientific journals to publish gene sequences in GenBank or a similar public and openly accessible repository before the paper is published. The information itself is shared and made public through GenBank, and the papers cite the accession number. At the same time, the GenBank information becomes citable through this accession number, so that it works to the advantage of the scientist depositing the sequence information. It is an excellent example, and a possible model for the biogeographic community. Many journals now have a policy of asking authors to make their data available after publication (see, for example, Science; www.sciencemag.org/about/authors/prep/gen_info.dtl#dataavail). However, it seems that these requests are not enforced, and that the GenBank strategy of asking for inclusion of the accession number in the paper is a better guarantee that data will be made public.
Data are often collected using public funding, so many feel that for this reason alone they should be publicly available; sometimes there is a contractual obligation to make data available after publication of results. Funding agencies finance research to further our understanding of the environment; withholding raw data hampers the process by which the results of the funded activities can be used, thus clearly contravening the original intention of the support (Dittert et al. 2001). One of the roles of a data portal such as OBIS is to offer a service assisting beneficiaries of public funding in fulfilling their contractual obligations.
Too many datasets are lying dormant, some of them on hard drives, often in difficult-to-access electronic formats; others are only available on paper. The physical oceanographers have set an example with the Global Data Archaeology and Rescue (GODAR) project, through which many datasets, at risk of being lost, were recovered and integrated into the WOD. The cost of “recovering” data is typically only a fraction of the cost of collecting the samples and generating the data. In the case of a Guinean trawling survey, the data recovery cost 0.2% of the initial survey cost (Zeller et al. 2005). More important even than these economic arguments is the historic aspect of environmental data: they are irreplaceable, and once lost they cannot be collected again.
Metadata, data about the data, are essential when sharing data. They make it possible for users to judge fitness-for-use (Chapman 2005), so that they are not inadvertently used for purposes for which they are not suited; part of this fitness-for-use statement is a description of quality control and quality assurance methods applied to the data. Metadata facilitate data discovery through their inclusion in metadata repositories such as the Global Change Master Directory (GCMD; gcmd.gsfc.nasa.gov) of the National Aeronautics and Space Administration (NASA). They are essential in creating an audit trail, so that any datum can be traced to its origin. Part of the audit trail is a list of all those involved in collecting, managing, and controlling the quality of the data, which makes it possible to give appropriate credit.
Making data publicly available is a critical step, but it is only the first. To be available for large-scale analysis, data have to be integrated and their quality controlled. Creating these integrated databases is a second step up the ladder from raw data through information and knowledge to wisdom (Fig. 17.1). Data integration requires knowledge about the data being handled, and often is a time-consuming business; it is important to avoid duplication of effort, and to preserve any efforts expended. Without mechanisms to preserve these efforts, any large-scale analysis would have to redo this step of data integration.
|Figure 17.1 The Wisdom Pyramid. Reproduced with permission, from a presentation by C. Besancon, UNEP-WCMC.
An important aspect of the integration of individual datasets is to check for consistency between them, and where inconsistencies are found, to resolve them. Obvious examples here are the spelling of taxonomic names, or detection of outlier distribution points caused by misidentification or errors of georeferencing. This reconciliation process is an extra opportunity for quality control, in addition to what is possible at the level of single datasets; conflicts between datasets are flags for potential problems. Data warehouses such as OBIS can add value by resolving these inconsistencies in consultation with specialists and end users, and with the original data providers. Quality-control procedures have to be documented, so that end users can judge whether data are reliable enough for their purposes.
Neither data managers nor data users should be fooled into thinking that there is such a thing as a database without errors. No matter how much time goes into quality control, there always will be a certain error rate. It is by using the data, sharing it with others to do their analyses, and critically looking at the results that erroneous data can be detected. It is important for any data system to have a mechanism for capturing this information, by making sure that there are mechanisms for user feedback, and by promptly acting on such feedback. In those cases where there are several levels of aggregation (as is the case for many of the OBIS datasets), this can lead to complications: errors detected at a higher level of aggregation (for example, at the level of OBIS or GBIF) have to be communicated to and corrected by the original data provider. Obviously, at any step in this communication things can go wrong, with delays in correcting obvious mistakes, and frustrated end users as a result.
Data integration comes at a price: it is rarely possible to integrate data over many sources without losing detail. Information on sampling devices or sampling effort is difficult to standardize across many data sets. Temporary taxonomic names make sense within one study but not with several studies (Paterson et al. 2000). The opportunistic exploitation of available resources will usually result in very unequal sampling in the area of interest, because the sampling effort is governed by external factors that are not under the control of the data manager. Any analysis based on such data collections has to deal with heavy observational bias. However, these drawbacks should be weighed against the larger footprint of the data, and hence stronger signals. For example, combining several datasets to create a consolidated dataset with a much larger latitudinal range will increase any latitudinal gradient, and make this gradient easier to discover. Also, the increased number of observations will result in an increase in statistical power of any analysis done on the combined dataset.
17.4. Development of OBIS
OBIS was created as the data integration component of the Census of Marine Life (Grassle & Stocks 1999; Grassle 2000; Yarincik & O'Dor 2005). From the start it was conceived as a global and distributed system, giving control of data to data providers (Fornwall 2000), with strong ties to existing national and international biodiversity information systems (Fornwall 2000; Grassle 2005). Today, OBIS has evolved into a community of practice, consisting of people and organizations sharing a vision to make marine biogeographic data, from all over the world, freely available over the World Wide Web. OBIS is not limited to data from Census-related projects; any organization, consortium, project, or individual may contribute to OBIS.
From the OBIS portal (the first website page connecting to the data), the user can do the following:
search where a marine genus and/or species is recorded in the data published through OBIS;
download data published in OBIS for any species, including location, depth, date and time collected, source datasets, and verified taxonomic name information;
plot species locations on a range of flat and spherical views of the world, including polar views, using the C-Squares Mapper;
plot species against background maps of sea temperature, depth, and salinity using the KGS Mapper;
use environmental data for the locations of these data to predict the species potential range on the KGS Mapper;
explore relationships between species and environmental data on KGS Mapper to see which parameters best explain a species distribution;
browse down a taxonomic hierarchy to get lists of all species in OBIS for a phylum, class, order, or other higher taxonomic group;
plot maps of all data at a higher taxonomic level;
search for lists of species recorded in OBIS by country (exclusive economic zone), sea or ocean, large marine ecosystems (LMEs), Food and Agriculture Organization (FAO) and ICES fishery areas, Longhurst's pelagic regions, depth, date, and by entering latitude–longitude coordinates;
connect to other sources of information on the species, including genetic data, published literature, and images.
A workshop critical to the genesis of OBIS was held in Rutgers University Institute of Marine and Coastal Sciences, New Jersey, in October 1997. The framework of the workshop was essentially that different groups were asked which project, to be completed on a scale of five to seven years, would most advance science. The strong consensus of the participants, consisting mainly of benthic ecologists, taxonomists, and statisticians, was to bring together and make publicly available the data that already existed, rather than new sampling campaigns, taking stock of what was known. From this OBIS was defined as “An on-line world-wide marine atlas ‘infrastructure’ providing scientists with the capability of operating in a four-dimensional environment so that analyses, modelling and mapping can be accomplished in response to user demand through accessing and providing relevant data.” The key characteristics of the then to-be-developed system were interoperability through common definition of metadata standards and protocols for a distributed, multi-tiered architecture. A website was built to demonstrate the OBIS concept (Stocks et al. 2000); this website is being preserved as a reference document, and can still be visited at www.marine.rutgers.edu/OBIS. The first OBIS workshop was held in Washington, DC, in November 1999.
Early growth of OBIS was initiated through the announcement, in May 2000, of eight grants by the US Government Agencies in the National Oceanographic Partnership Program (NOPP) together with the Alfred P. Sloan Foundation. These grants involved researchers in more than 60 institutes in 15 countries, and addressed infrastructural issues as well as taxon-based projects of data acquisition (Grassle 2000; Decker 2001; Zhang & Grassle 2003). A ninth, National Science Foundation (NSF)-funded project (SeamountsOnline; Stocks 2009) was added soon afterward, and the nine projects formed the core of the early OBIS (Table 17.1). In 2001, an NSF project was awarded to Rutgers University to create an international portal; by February 2002, all NOPP-funded data projects and the NSF-funded SeamountsOnline were made interoperable through the OBIS portal (Zhang & Grassle 2003). At that point, the portal provided access to over 400,000 occurrence records.
Institutionally, OBIS is growing rapidly as a distributed system with an international secretariat and portal (iOBIS) hosted by the Institute of Marine and Coastal Sciences of Rutgers University, and Regional OBIS Nodes (RONs) in all continents (Fig. 17.2 and Table 17.2). RONs were created to serve national or regional needs better and to achieve global coverage. The RON network is still expanding: several RONs were added in 2006 and 2007 (China, Korea, Philippines) and discussions are continuing to create new ones (Arctic, Oman, and possibly Mexico). The RON network has been very active and very successful in connecting datasets. Each RON is self-sustaining and is the geographical backbone for further development of OBIS data content. The institutes hosting the RONs are an asset for OBIS as a network and have proven to be very supportive of OBIS activities and objectives.
|Figure 17.2 Locations of Regional OBIS Nodes (yellow squares), international secretariat (red circle), and proposed mirror sites (orange circles).
In addition to the Regional Nodes, OBIS has thematic nodes for major subsets of marine life. OBIS Spatial Ecological Analysis of Megavertebrate Populations (OBIS SEAMAP), the repository for data on marine birds, turtles, and mammals, is developing new ways to visualize migrations of these animals and to understand their habitats (Halpin et al. 2006, 2009). The Biogeoinformatics of Hexacorals website maintains an authoritative, global anemone and coral database (Fautin 2000). FishBase contains comprehensive information on finfishes (Froese & Pauly 2009). The OBIS microorganisms component (MICROBIS) is breaking completely new ground by defining the known world of microorganisms using new molecular approaches to define microbial taxa. The Continuous Plankton Recorder (CPR), managed by the Sir Alister Hardy Foundation for Ocean Science (SAHFOS), provides a unique and very large dataset. One of the strengths of the CPR data is that it has been collected in a standard way for more than half a century (see, for example, Reid et al. 1998; Beaugrand et al. 2004).
Data generated by the field projects of the Census ultimately will all be available through the OBIS website. This is essential if OBIS is to play its role in integrating Census data, and support the Census Synthesis. All field projects are producing high-quality data. However, as with many projects, data generated by a single project are usually restricted to a single theme defined on the basis of habitat, geographical region, or taxonomic scope. The power of the OBIS database is the integration of data from all these fields in a single coherent taxonomic framework, presenting a view that is truly global and facilitating analysis across scientific disciplines.
OBIS has strong relationships with several UN organizations. Data are exchanged with the Fisheries Department of the FAO, and links to the species information pages on the FAO site are displayed on the OBIS site. Collaboration with the IOC and its IODE program has centered on data standards and protocols. There have been joint activities on capacity building in Africa, with training workshops on the use of OBIS standards and tools; data logging workshops have been organized, focusing on sponges and on mollusks. Close collaboration between OBIS and IODE has resulted in the formal adoption, in June 2009, of OBIS as an activity of the IOC under its IODE program (see below).
OBIS was one of the earliest Associate Members of GBIF (www.gbif.org) which publishes data on all species. OBIS is a very active participant in GBIF activities, and one of the largest publishers of data to GBIF, reflecting its role as a specialist network for marine species. GBIF recommends that marine data are first published through OBIS, because OBIS can add special value and will manage the subsequent publication of data through GBIF. This also avoids duplication of data being separately published in GBIF and OBIS.
OBIS works closely with other players in the field of biodiversity informatics. OBIS exchanges information and is reciprocally linked with the Barcode of Life (BOL). As the marine component of the latter is being developed OBIS will forge even stronger links. OBIS and its web interface can be used as a geographical window on the BOL information; OBIS distribution records can be used to document occurrence of a species in a region or country, and thus assist in management of property rights to genetic resources. Together with the European Node of OBIS (EurOBIS), OBIS has collaborated on the development of the World Register of Marine Species (WoRMS, see below). This venture forms the basis of the marine community's contribution to the Catalogue of Life (CoL). For many of the species it contains, the content of WoRMS goes far beyond the pure taxonomic information contained in the CoL. This content is made available to the Encyclopedia of Life (EOL).
From the outset, OBIS was conceived as a distributed system, leaving control over data publication in the hands of the data custodians (Fornwall 2000). The structure and content of the data exchanged was formatted following the Darwin Core format (Vieglais et al. 2000), an extensible markup language (XML)-based standard originally developed at the University of Kansas. Later, the Darwin Core was adopted by the Taxonomic Database Working Group as one of its standards, and further developed. Several “extensions” of the Darwin Core exist: specific user communities have expanded the number of terms defined in the data exchange format to serve the needs of their community better. Also OBIS defined an extension (known as OBIS Schema), to address the specific needs of the oceanographic and marine biology community better. For example, one of the features of the OBIS Schema is that the location of an observation can be ascribed to a set of two points needed to define a transect line instead of a sampling point; this makes it possible to capture accurately the position of data resulting from a trawl. All extensions of the Darwin Core are still compatible with the original standard. It is this compatibility that forms the basis of the compatibility between different content providers and aggregators, and that allows OBIS data to be published through GBIF.
The original protocol defining computer-to-computer communication to exchange the Darwin Core data was the Z39.50 protocol (Vieglais et al. 2000); this was soon replaced with the Distributed Generic Information Retrieval (DiGIR; Blum et al. 2001). Originally, the OBIS website was built as a pure distributed system, with no data residing in the portal server; exception was only made for datasets from custodians who did not have a provider service set up. All queries to the data provider were performed in real time, as the end user was requesting the data through the OBIS portal (Zhang & Grassle 2003). This proved to be too slow, and too critically dependent on the availability of all providers at all times. For reasons of performance and reliability, a system was developed where all available data (including a link back to the data provider's own website) were stored in a cache, maintained in a database at the OBIS secretariat. This cache also made it possible to build indices on different sets of polygons, and to calculate summary information for the different taxa (Rees & Zhang 2007).
The technology behind the present OBIS system is several years old, and in need of an overhaul. Possible tools and technology for a new incarnation of OBIS have been discussed in the OBIS community and with relevant experts. All developments at iOBIS adhere strictly to the relevant standards wherever they exist. For geographic information system (GIS) and web-based mapping we will work with Open Geospatial Consortium (OGC) compliant tools, and closely collaborate with the people developing GeoServer. Access to OBIS data will no longer be restricted to the iOBIS website, with its canned queries, but will also be possible through standards-compliant web services.
As mentioned above, metadata are an essential element of data warehouses. The DiGIR protocol itself carries some metadata: the data standard is documented, there is room for an abstract to give a verbal description of the original purpose and intent of the data, and contact information, both for technical and for scientific aspects, can be listed; it is also possible to include a universal resource locator (URL) that points back at the website of the data provider. Although these are all the essential elements, many users wanted to include richer metadata: this gives end users the ability to judge the coverage of data in OBIS better, and to assess fitness for use. For this reason, OBIS started collaborating with the GCMD; all OBIS-related metadata are visible as a separate collection on their site (gcmd.gsfc.nasa.gov/KeywordSearch/Home.do?Portal=OBIS&MetadataType=0). One of the great advantages of this system is that users can maintain their own metadata records through the GCMD web interface. OBIS will expand its metadata activities also to accommodate metadata in other widely accepted standards in use by members of the OBIS community.
A taxonomic reference list, including information on classification and synonymy, is an essential tool in the quality control process of taxonomically resolved data. It is needed as a controlled vocabulary, to make sure that data from different datasets are not only compatible at the technical level, but also at the content level. Differently spelled names, or differently interpreted taxonomic names, have to be reconciled before any analysis of the integrated content can be done.
The initial website, launched in February 2002, already included a taxonomy name service, built in partnership with Species 2000 and FishBase. A prototype name service provided common name/scientific name and synonym translation (Zhang & Grassle 2003). Later versions of the portal implemented these taxonomic name services, through integration with the Interim Register of Marine and Non-marine Genera (IRMNG) (Rees & Zhang 2007), developed by Tony Rees of the Australian OBIS Node. One of the objectives was to be able to discriminate between marine and non-marine taxa, and between fossil and extant taxa. Several providers of data to OBIS do not have a simple way of discriminating between these in their databases, so IRMNG was conceived as the basis for this filtering mechanism.
A standard register of taxonomic names of European marine species (European Register of Marine Species, ERMS) was compiled using funding from the European Commission Marine Science and Technology research program (Costello et al. 2001). ERMS was made internally consistent, expanded with a consistent classification, and turned into a relational database for use by the European OBIS node with support from the EU Network of Excellence MarBEF. Under the aegis of OBIS, ERMS has developed into WoRMS. WoRMS has nearly 150,000 valid species names, of which 68,700 have at least one record in OBIS. The OBIS website is now using WoRMS as the standard source for names of marine species. WoRMS provides correct names for the OBIS community and is recognized as the marine component of CoL.
The number of records in the OBIS databases has grown according to expectations (though there was a setback from November 2007 to May 2008, owing to a change in personnel; Fig. 17.3). The growth in number of records after 2004 is linear after the initial development phase from 2002 to 2004. If the current growth can be sustained, OBIS will publish over 30 million records by October 2010.
|Figure 17.3 (A) Number of records in the OBIS cache (millions). (B) Average number of records per dataset (thousands). (C) Number of individual datasets published through OBIS.
An issue worth noting is the size of an average dataset, which has been decreasing steadily (Fig. 17.3). This trend is to be expected, as OBIS has first connected the largest, most important databases. Obviously, this has implications for future planning for OBIS. Smaller average datasets means more work for the same gain. In practice, this will necessitate more data management time in OBIS, either at the level of the secretariat, or at the RONs, or both. In this respect, the linear growth of OBIS content is good news: it means that data acquisition and quality control are becoming more efficient.
Table 17.3 lists the largest datasets available through OBIS. It is gratifying to see two South African datasets in the top 20, a clear example of the strength of the RON network and the global nature of collaboration within OBIS. From the list it is clear that most of the large datasets are monitoring datasets, in many cases fisheries monitoring (for example the South African line fisheries data, several fisheries datasets from the US NOAA, Fisheries and Oceans Canada (DFO), and New Zealand's National Institute of Water and Atmospheric Research (NIWA)). Several other datasets are in fact aggregations of many individual datasets (for example the WOD01 Plankton database, European Seabirds at Sea, benthic data from the Joint Nature Conservation Committee of the UK, FishBase occurrence records). One of our most valued contributors is the SAHFOS, with the data from the CPR. The Smithsonian Institution's National Museum of Natural History makes the data from its catalogue available, as do many other museums. However, the real value of OBIS is in the 679 datasets that are not listed in this table. The large datasets are often available already online, through the website of the data provider. But many of the smaller datasets would to a large extent be undiscoverable and remain unused, if it were not for OBIS.
The RONs are instrumental in achieving global coverage, and collectively provide about half of the data available through OBIS. The African and the European nodes are the largest with well over 3 million records each. All of the Census projects provide data. The champions here are OBIS SEAMAP with nearly 2.5 million data points, International Census of Marine Microbes (ICoMM) with 1.5 million, and Census of Antarctic Marine Life (CAML) with 900,000. Also, History of Marine Animal Populations (HMAP) contributes a substantial dataset, with 250,000 records and, not surprisingly, extends the time for which data are available (Fig. 17.4).
|Figure 17.4 Number of records in OBIS cache, as a function of time. In most cases, this corresponds with the year the observation was made. For historical data, this is the estimated year the organism was alive.
The map in Figure 17.5 illustrates the very uneven availability of data within OBIS. Most of the data are from coastal waters; the shallow waters of the European Atlantic coast, the Pacific coast of Alaska, and the Atlantic and Gulf of Mexico coasts of the USA are especially well represented. In open waters, the Northern Atlantic is well covered. The large volume of data here is mainly from the CPR. The Northern hemisphere is much better covered than the southern one; exceptions here are South Africa (mainly the west and south coasts), and part of the coast of Argentina. The southern Pacific is particularly poorly represented; the southern Atlantic and Indian Oceans also represent major gaps in coverage. Some of the mega-diverse coastal areas also have a disappointing number of records, such as the coral reefs of eastern Africa and the Red Sea, and the coasts of the Coral Triangle.
|Figure 17.5 Number of records in OBIS per 1° × 1° square of latitude and longitude, corrected for differences in surface area of the squares. Red is high numbers, blue low, and white for squares without a single observation.
The series of maps in Figure 17.6 illustrates that most of the data in OBIS are from surface waters. The top-most map represents essentially the same information as in Figure 17.5, but at a lower resolution. Consecutive maps illustrate the number of records deeper than 100, 500, 1,000 and 2,500 m. respectively. In all five maps, the ocean floor shallower than this depth is drawn in light grey, to illustrate the amount of seafloor at this depth. The bottom map clearly shows that most of the seafloor is completely unexplored. We hope that this part of the oceans will be better represented as the Census deep-sea data become available.
|Figure 17.6 Number of observations in OBIS deeper than a given depth, per 5° × 5° degree square. Depths are 0, 100, 500, 1,000 and 2,500 m, respectively. Ocean floor deeper than this depth is shaded light grey. Color coding same as in Figure 17.5.
Not surprisingly, there is also a strong bias in taxonomic coverage. Larger and commercial species are clearly better represented, as is evident from the list in Table 17.4. Of the 50 taxa listed, 37 are vertebrates; of these, 11 are birds and 23 are fish. All fish in this list are species of commercial importance. Loligo vulgaris reynaudii d'Orbigny, 1845 is the lone mollusk on the list, very likely so well represented in the database because it is also a commercial species. Apart from the data recorded as phylum Chaetognatha, nearly all other invertebrates are planktonic crustaceans; for both of these groups, this probably accurately reflects their high abundance in the best-sampled waters of the Northern Atlantic. The same is true for the two taxa that are not animals: Chaetoceros Ehrenberg, 1844, a genus of diatoms, and Ceratium fusus (Ehrenberg, 1834) Dujardin, 1841, a dinoflagellate. Most of the OBIS records are resolved to species (or even subspecies where relevant), but as is apparent from the top 50, there are exceptions. In the case of groups that are difficult to identify such as Euphausiacea, Decapoda, or Chaetognatha, this is not completely unexpected.
Table 17.5 further illustrates the bias towards larger and commercially important species, and reflects the completeness of our knowledge. The percentage completeness and degree of cover is calculated for WoRMS. Because WoRMS is not complete, the estimates of the total number of marine species compiled by Bouchet (2006) are also listed. Fish and other vertebrates are virtually complete, and well covered, with a high number of records per species. For other groups, such as the mollusks, the percentage completeness, even measured against WoRMS, is very low; also WoRMS is quite incomplete for this very species-rich group. Within the mollusks, the cephalopods are well covered, with two-thirds of the species having at least one record in OBIS, and an average of nearly 500 records per species in WoRMS. Bryozoa are poorly represented in both OBIS and WoRMS; there are records for only 690 species, where Bouchet estimates that there are 5,700 in total.
As has been noted before, OBIS is a work in progress. There are clear gaps in geographical and taxonomic coverage. Some of these gaps no doubt are the result of the uneven distribution of scientific work: some places such as open oceans and polar seas are difficult and costly to sample; some groups of organisms are more difficult and less “interesting” to study. In these cases, sparse data coverage reflects our uneven knowledge of nature, and could provide interesting guidelines to set priorities for future work. In other cases, data exist but are not available through OBIS. One of the highest priorities is to identify such datasets; this inventory will assist in defining priorities for data assimilation.
Missing data are a problem, wrong data are an even greater worry. Yet, no data system is without mistakes, and that is definitely the case for OBIS. For example, a 2008 study of OBIS content found wrong records for over a third of the species present in OBIS (Robertson 2008). Responsibility for the accuracy of the data in a multi-level aggregation system such as OBIS is not a simple issue. One argument could be that OBIS is only the publisher of the data, and just as it cannot take credit as “owner” of the data, it cannot take responsibility for the mistakes in it, just like Google cannot be held responsible for the information that shows up on its pages (R. Froese, personal communication). However, Google does not claim to have expertise in the subject matter of all the sites it indexes. We like to think that OBIS has a certain degree of competence in biogeography. This makes it possible to implement at least a minimum level of quality control, which is applied to all incoming data and gradually to all data retrospectively. OBIS works with its data providers to improve the quality not only at the level of the international portal, but also at the level of the data provider. Of course, no system is perfect, and Robertson's advice of “caveat emptor” should be kept in mind. The best way of detecting errors in a database is to work with the data. We hope that any user finding errors will not be discouraged from using OBIS data, but work together with OBIS staff at the secretariat and its data providers to improve the content.
17.5. Using OBIS
OBIS is both a secure repository for data and a ready source of data for a growing user community of scientists and educators throughout the marine sciences community. Education and outreach are being achieved by developing modules for use in schools and broadening our end-user community. There is hardly any downtime and the number of visitors and records downloaded from the OBIS website increases steadily and now averages 80,000 per day (Fig. 17.7). The OBIS website will continue to host species-level links to most other species-referenced marine databases. Through the OBIS website, active (and in most cases reciprocal) links at the species level are made with CoL, Integrated Taxonomic Information System (ITIS), Barcode of Life, FishBase, FAO, and GenBank among others.
|Figure 17.7 Number of records downloaded, per day, from the OBIS website.
The content of the OBIS database is growing and maturing; it is now possible to use the OBIS database to answer scientific questions and to investigate broad patterns of distribution of biodiversity. A first series of maps was created and distributed through newsletters and conferences (for example the LME conference in Qingdao, September 2007; Group on Earth Observations (GEO) IV meeting in Cape Town, November 2007; Ocean Sciences conference in Orlando, March 2008; the October 2007 newsletter of Global Ocean Ecosystem Dynamics (GLOBEC)). As an illustration, the global map of Hurlbert's Index (es(50), the expected number of distinct species in a random sample of 50 distribution records from the database; Hurlbert 1971) is reproduced here (Fig. 17.8). This is actually the first map of biodiversity of all taxa, on a global scale; previous studies were restricted either in taxonomic or in geographical scope; the OBIS integration of datasets across its many data providers makes it possible to present this comprehensive picture. A similar analysis formed the basis of maps published in National Geographic's Ocean: An Illustrated Atlas (Grassle & Vanden Berghe 2009). A second application of Hurlbert's Index is shown in Figure 17.9, illustrating the latitudinal gradient in species richness.
|Figure 17.9 Latitudinal gradient in species richness, as measured by Hurlbert's index, es(50).
Figure 17.10 illustrates another use of OBIS data. Yellow dots are actual observations of lionfish (Pterois volitans (Linnaeus, 1758)), an invasive species with its home range in the Red Sea. Through environmental envelope modeling, the range to which this invader could spread can be calculated. The red area in Figure 17.10 displays the region with similar environmental conditions to that in which the species was found, and so might be expected to spread. Environmental envelope modeling is a good demonstration of the power of data sharing and integration. It can combine data from different sources of biogeography, and overlay these with physical and chemical oceanography data, allowing multi-disciplinary analysis. Other potential applications using this and other modeling techniques are the study of shifts in species distribution in response to global change.
|Figure 17.10 Predicted potential range for Pterois volitans (Linnaeus, 1758) (lionfish), an invader from the Red Sea. Yellow dots are actual observed occurrences. Red area represents areas with similar oceanographic conditions to the one where the observations were made, and so where conditions might favor the spread of this species.
Publicly available fisheries data in OBIS have already played an important role in documenting examples of overfishing in the ocean (Worm & Myers 2004; Baum & Worm 2009). Other examples of use of the database are a study of the completeness of our knowledge of fish communities (Mora et al. 2007), global distribution patterns of Myxinidae (Cavalcanti & Gallo 2008) and of Cephalopoda (Rosa et al. 2008a, 2008b). It is expected that the number of papers based on data obtained from the OBIS database will grow rapidly, now that the content of OBIS has sufficiently matured and grown.
17.6. Future of OBIS
Participation in OBIS is open to any interested individual, country, or organization committed to the long-term maintenance of an accessible, relevant, biogeographic database. Present members of the federation include the NOPP-funded OBIS programs, the Census projects, the RONs, and many independent data custodians interested in developing ties with the OBIS international system of databases. The international OBIS secretariat, through the international portal, is responsible for making the entire system interoperable, maintaining standards for data exchange, and coordinating data acquisition. Each member of the Federation will, in addition to maintaining their own database systems, be committing to provide data through the OBIS portal. One of the priorities for OBIS at this point is to fill some of the gaps in the available data by forging relationships with more organizations, and to expand the federation.
OBIS is one of the main outputs from the Census – a four-dimensional atlas of marine life, accessible online and analyzable to test hypotheses and make predictions about diversity, distribution, and abundance of marine life. This data system will be used in ocean management, including fisheries, conservation planning, and risk assessment of invasive species. Although the Census culminates in 2010, OBIS will live on as a major legacy of Census and a community of practice, maintaining an informatics infrastructure for managing, researching, and educating about living marine resources. OBIS is establishing itself as an integral part of the international scientific infrastructure. Its regional development, as exemplified by the establishment of RONs, will ensure that it can serve these needs both locally and globally.
In June 2009, OBIS was adopted by the IOC of the United Nations Educational, Scientific and Cultural Organization (UNESCO) as one of the activities of its IODE program. This is a clear recognition by the IOC member states that OBIS is part of the international scientific infrastructure, and gives a formal intergovernmental status to OBIS activities. This will be important in soliciting resources to fund further activities, to attract more data, and to achieve wide acceptance of OBIS data in the process of environmental decision making.
The future data needs of ocean science and ocean resource management will require a seamless coupling of biological data with physical oceanographic processes. This biophysical data framework will be built through the active integration of data from a large and diverse number of sources, including physical, chemical, and biological oceanography. GEO and its Global Earth Observation System of Systems (GEOSS) is an international federation bringing together relevant players in this field. The Global Ocean Observing System (GOOS), the marine component of GEOSS, is hosted by the IOC. OBIS is poised to play a significant and expanding role in GOOS, and to take on the responsibility for marine biogeographic information, through involvement in GEO's Biodiversity Observing Network (GEO BON). Its position within IOC will assist in achieving this ambition.
OBIS data have been used for scientific purposes, and it is expected that this use will grow. Another objective of OBIS is to inform management of the marine environment; for example, OBIS data have been used in the preparation of scientific background documents for the Convention on Biological Diversity through an International Union for the Conservation of Nature (IUCN) project to identify areas of special ecological or biological significance. If OBIS is to reach its full potential, it needs to be made interoperable with data systems on socio-economic data, including use data. Although there are mature systems that can easily serve as sources for global physical oceanography, there seems to be no equivalent for socio-economic data.
OBIS is now at the stage where it is an essential international source of data and Web-based tools for defining habitats, communities, and biogeographical units in the marine environment. However, it still is far from a comprehensive source for all biogeographic data that have been collected; and there are large gaps in the coverage. The OBIS portal expects continued growth, and counts on input from the international community of OBIS users, including the Census National and Regional Implementation Committees (NRIC) and the Regional OBIS Nodes, to help this happen.
We are grateful for the generous support and guidance OBIS received from the Alfred P. Sloan Foundation and its staff. Parts of the development of OBIS were funded through NSF grants to Fred Grassle and Yunquing (Phoebe) Zhang. Phoebe was instrumental in building the IT infrastructure for OBIS. OBIS would not exist without the input from others, including the numerous data providers, node managers, and the members of the International Committee and Governing Board. We are also grateful for the trust of this OBIS community, and for the opportunity to develop OBIS.
|Baum, J. & Worm, B. (2009) Cascading top-down effects of changing oceanic predator abundances. Journal of Animal Ecology 78, 699–714.|
|Beaugrand, G., Edwards, M., John, A. & Lindley, A. (2004) Continuous Plankton Records: Plankton Atlas of the North Atlantic Ocean 1958–1999. Marine Ecology Progress Series (Suppl.): 1–75.|
|Blum, S., Vieglais, D. & Schwartz, P.J. (2001) DiGIR – distributed generic information retrieval. Available at http://digir.sourceforge.net/events/20011106/DiGIR.ppt.|
|Bouchet, P. (2006) The magnitude of marine biodiversity. In: The Exploration of Marine Biodiversity: Scientific and Technological Challenges (ed. C. Duarte), chapter 2. Spain: Fundacion BBVA.|
|Boyer, T.P., Antonov, J.I., Garcia, H.E., et al. (2006) World Ocean Database 2005 (ed. S. Levitus). NOAA Atlas NEDIS 60. Washington, DC: US Government Printing Office. DVD, 190 pp.|
|Cavalcanti, M.J. & Gallo, V. (2008) Panbiogeographical analysis of distribution patterns in hagfishes (Craniata: Myxinidae). Journal of Biogeography 35, 1258–1268.|
|Chapman, A.D. (2005) Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen.|
|Conkright, M.E. & Levitus, S. (1996) Objective analysis of surface chlorophyll data in the northern hemisphere. In: Proceedings of the International Workshop on Oceanographic Biological and Chemical Data Management. NOAA Technical Report NESDIS 87, 33–43.|
|Costello, M.J., Emblow, C. & White, R. (eds) (2001) European Register of Marine Species. A check-list of the marine species in Europe and a bibliography of guides to their identification. Patrimoines Naturels 50, 463 pp.|
|Decker, C. (2001) The Census of Marine Life: an update on activities. In: Proceedings of the PICES.COML.IPRC Workshop on Impact of Climate Variability on Observation and Prediction of Ecosystem and Biodiversity Changes in the North Pacific (eds. V. Alexander, A.S. Bychkov, P. Livingston & S. M. McKinnell,), pp. 5–9. PICES Scientific Report 18. Sidney, Canada: North Pacific Marine Science Organisation (PICES). V + 205 pp.|
|Dittert, N., Diepenbroek, M. & Grobe, H. (2001) Scientific data must be made available to all. Nature 412, 393.|
|Fornwall, M. (2000) Planning for OBIS: examining relationships with existing national and international biodiversity information systems. Oceanography 13(3), 31–38.|
|Fautin, D. (2000) Electronic Atlas of Sea anemones: an OBIS pilot project. Oceanography 13, 66–69.|
|Froese, R., Lloris, D. & Opitz, S. (2003) The need to make scientific data publicly available – concerns and possible solutions. In: Fish Biodiversity: Local Studies as Basis for Global Inferences (eds. M.L.D. Palomares, B. Samb, T. Diouf, et al.), pp 267–271. Brussels. 281 pp.|
|Froese, R. & Pauly, D. (eds.) (2009) FishBase. World Wide Web electronic publication. www.fishbase.org, version 09/2009.|
|Greene, M. (2007) The demise of the lone author. Nature 450, 1165.|
|Grassle, J.F. (2000) The Ocean Biogeographic Information System (OBIS): an on-line, worldwide atlas for accessing, modelling and mapping marine biological data in a multidimensional geographic context. Oceanography 13(3), 5–7.|
|Grassle, J.F. (2005) Data management and communications plan for research and operational integrated ocean observing systems, 1. Interoperatable Data Discovery, Access and Archive, Part III. Appendices. Appendix 7, pp 285–292. Biological Data Considerations, Ocean.US, Clarendon Boulevard, Suite 1350, Arlington, VA 22201-3667, USA.|
|Grassle, J.F. & Stocks, K.I. (1999) A Global Ocean Biogeographic Information System (OBIS) for the Census of Marine Life. Oceanography 12(3), 12–14.|
|Grassle, J.F. & Vanden Berghe, E. (2009) Census of Marine Life. In: Ocean: An Illustrated Atlas (eds. S.A. Earle. & L.K. Glover) Washington, DC: National Geographic. 352 pp.|
|Halpin, P.N., Read, A.J., Best, B.D., et al. (2009) OBIS-SEAMAP 2.0: developing a research data commons for the ecological studies of marine mammals, seabirds and seaturtles. Oceanography 22(2), 104–115.|
|Halpin P.N., Read A.J., Best B.D., et al. (2006) OBIS-SEAMAP: developing a biogeographic research data commons for the ecological studies of marine mammals, seabirds, and sea turtles. Marine Ecology Progress Series 316, 239–246.|
|Hurlbert, S.H. (1971) The nonconcept of species diversity: a critique and alternative parameters. Ecology 52, 577–586.|
|Levitus, S. (1996) Interannual-to-decadal variability of the temperature–salinity structure of the world ocean. In ‘Proceedings of the international workshop on oceanographic biological and chemical data management. NOAA Technical Report NESDIS 87, 51–54.|
|Mora, C., Tittensor, D.P. & Myers, R.A. (2007) The completeness of taxonomic inventories for describing the global diversity and distribution of marine fishes. Proceedings of the Royal Society B 275, 149–155.|
|Paterson, G., Boxshall, G., Thomson, N. & Hussey, C. (2000) Where are all the data? Oceanography 13(3), 21–24.|
|Poloczanska, E., Hobday, A.J. & Richardson, A.J. (2008) Global database is needed to support adaptation science. Nature 453, 720.|
|Rees, H.L., Eggleton, J.D., Rachor, E. & Vanden Berghe, E. (2007) Structure and dynamics of the North Sea Benthos. ICES Cooperative Research Report 288. Copenhagen. 259 pp.|
|Rees, T. & Zhang, Y. (2007) Evolving concepts in the architecture and functionality of OBIS, the Ocean Biogeographic Information System. In: Proceedings of Ocean Biodiversity Informatics: An International Conference on Marine Biodiversity Data Management Hamburg, Germany, 29 November – 1 December, 2004 (eds. E. Vanden Berghe, et al.), pp. 167–176. IOC Workshop Report, 202, VLIZ Special Publication 37.|
|Reid, P.C., Edwards, M., Hunt, H.G. & Warner, A.J. (1998) Phytoplankton change in the North Atlantic. Nature 391, 546.|
|Richardson, A.J. & Poloczanska, E. (2008) Under-resourced, under threat. Science 320, 1294.|
|Robertson, D.R. (2008) Global biogeographical data bases on marine fishes: caveat emptor. Diversity and Distributions 14, 891–892.|
|Rosa, R., Dierssen, H.M., Gonzalez, L. & Seibel, B.A. (2008a) Ecological biogeography of cephalopod molluscs in the Atlantic Ocean: historical and contemporary causes of coastal diversity patterns. Global Ecology and Biogeography 17, 600–610.|
|Rosa, R., Dierssen, H.M., Gonzalez, L. & Seibel, B.A. (2008b) Large-scale diversity patterns of cephalopods in the Atlantic open ocean and deep sea. Ecology 89, 3449–3461.|
|SCOR & IODE (2008) SCOR/IODE Workshop on Data Publishing, Oostende, Belgium, 17–19 June 2008. IOC Workshop Report No. 207. Paris: UNESCO. 23 pp.|
|Sekercioglu, C.H. (2008) Quantifying coauthor contributions. Science 322, 371.|
|Somerfield, P.J., Arvanitidis, C., Vanden Berghe, E., et al. (2009) MarBEF, databases and the legacy of John Gray. Marine Ecology Progress Series 382, 221–224.|
|Stocks, K (2009) SeamountsOnline: an online information system for seamount biology. Version 2009-1. Available at http://seamounts.sdsc.edu.|
|Stocks, K., Zhang, Y., Flanders, C. & Grassle, J.F. (2000) OBIS: Ocean Biogeographic Information System. The Institute of Marine and Coastal Science, Rutgers University. Available at http://marine/rutgers.edu/OBIS.|
|Stokstad, E. (2008) Proposed rule would limit fish catch but faces data gaps. Science 320, 1706–1707.|
|Vanden Berghe, E., Appeltans, W., Costello, M.J. & Pissierssens, P. (eds.) (2007a) Proceedings of “Ocean Biodiversity Informatics”: An International Conference on Marine Biodiversity Data Management Hamburg, Germany, 29 November – 1 December, 2004. Paris, UNESCO/IOC, VLIZ, BSH, 2007. vi + 192 pp.|
|Vanden Berghe, E., Claus, C., Appeltans, W., et al. (2009) MacroBen integrated database on benthic invertebrates of European continental shelves: a tool for large-scale analysis across Europe. Marine Ecology Progress Series 382, 225–238.|
|Vanden Berghe, E., Rees, H.L. & Eggleton, J.D. (2007b) NSBP 2000 data management. In: Structure and dynamics of the North Sea Benthos (eds. H.L. Rees, J.D. Eggleton, E. Rachor, & E. Vanden Berghe), pp 7–20. Copenhagen: ICES Cooperative Research Report 288. 259 pp.|
|Vieglais, D., Wiley, E.O., Robins, C.R. & Peterson, A.T. (2000) Harnessing museum resources for the Census of Marine Life: the FISHNET project. Oceanography 13(3), 10–13.|
|Worm, B. & Myers, R.A. (2004) Managing fisheries in a changing climate. Nature 429, 15.|
|Yarincik, K. & O'Dor, R. (2005) The Census of Marine Life: goals, scope and strategy. Science Marine 69 (Suppl. 1), 201–208.|
|Zeller, D., Froese, R. & Pauly, D. (2005) On losing and recovering fisheries and marine science data. Marine Policy 29, 69–73.|
|Zhang, Y. & Grassle, J.F. (2003) A portal for the Ocean Biogeographic Information System. Oceanologica Acta 25, 193–197.|