Making Geological Data Available
‘Long tail’ data is the difficult-to-get-at data that sits in libraries, institutes and on the computers of individual scientists. Informatics specialists contrast it with the smaller number of large, more accessible data sets (e.g. Sinha et al., 2013). The name ‘long tail’ derives from graphs drawn of the size of data sets against their number: there are relatively few large datasets and a lot of smaller ones. Geological science has more long tail data than sciences like physics or meteorology, probably because historically it has been less associated with big science infrastructure and sensors. Much of this ‘long tail’ data is found in geological surveys – institutes created by nations to survey and ‘inventorise’ geological resources. Most countries in the world have geological surveys and these organizations are vital for the optimal use of geological resources that underpins modern nations. For many large and long-established geological surveys, improving the discoverability of long tail geoscience data involves making historical paper data available to cyberspace (Stephenson 2019).
Between 2018 and 2019, a unique collaboration between the British Geological Survey (BGS) and computer scientists of the GeoBiodiversity Database (GBDB) opened up some of this data. The BGS has an abundance of biostratigraphical data collected over almost two centuries associated with about 3 million fossils and thousands of localities and stratigraphic sections, to exacting and consistent standards. The data has great potential for science, but much of it is contained within paper documents or simple document scans and so is inaccessible to big data tools. It needs lifting from the page and into cyberspace.
A core from the BGS National Geological Repository which contains records of 23500 wells and boreholes, 600 km of drillcore, 6000000 washed cuttings samples, 200000 thin sections, 1000000 mineralogy & petrology samples, 3000000 fossils, 250000 micro-palaeontological slides, and 10000 biostratigraphic and palaeontological reports.
GBDB (the official database of the International Commission on Stratigraphy, ICS) is almost unique in being the only large database to hold sequences of fossils tied to sections, rather than just geographically defined spot collection points. To date GBDB and BGS scientists have placed live manipulable data from more than 6000 UK stratigraphic sections on a public access website, the BGS-GBDB portal. The project is also using machine-learning methods to get at biostratigraphical information directly from text.
During an 8 week stay in BGS in the UK, GBDB scientists scanned 8200 documents and 1800 outcrops/cores were input to the BGS-GBDB portal.
The BGS GBDB collaboration was one of the first activities of the fledgling Deep-time Digital Earth (DDE) program, a part of the International Union of Geological Sciences. DDE works with UNESCO, the International Geosphere-Biosphere Programme (IGBP), the Global Sedimentary Geology Program (GSGP), the International Geoscience and Geopark Program (IGGP), the Commission of the Geologic Map of the World (CGMW), the Global Geochemical Baseline (GGB), the International Lithosphere Program (ILP), and OneGeology. DDE will also operate the full FAIR data concept (Findable, Accessible, Interoperable, and Re-usable) and link to desktop systems for geoscientists all over the world as well as to students and teachers in classrooms and on the internet.
The BGS-GBDB portal at http://www.geobiodiversity.co.uk/bgsportal.aspx
Geology could be said to have lagged behind other physical sciences in capitalizing on its big data, but DDE will enable bridges between ‘data islands’ to be built and for data to be interrogated using modern tools tackling some of the most important and pressing questions of our time.
The BGS-GBDB team posing in front of William Smith’s geological map at BGS headquarters in Nottingham.
Stephenson, M H. 2019. The Uses and Benefits of Big Data for Geological Surveys. Acta Geologica Sinica, v 93 S3, pp. 64−65.
Sinha, A.K., Thessen, A.E., and Barnes, C.G., 2013, Geoinformatics: Toward an integrative view of Earth as a system, in Bickford, M.E., ed., The Web of Geological Sciences: Advances, Impacts, and Interactions: Geological Society of America Special Paper 500, p. 591–604, doi:10.1130/2013.2500(19).