From Data to Publications Back to Data
Research Infrastructures (RIs) are an essential part of the European scientific ecosystem. They produce ever-increasing quantities of scientific data and belong to the top data producers in the scientific landscape. Analysing huge volumes of data is more and more challenging for scientists using these facilities. The current model of exporting the data to the scientists’ home institutes and then reducing and analysing the data does not fit any more. The European Commission (EC) has recognised this growing problem and has proposed to combine the resources from Pan-European e-infrastructures (e.g. EGI, GÉANT, EUDAT, etc.) which offer compute, storage and networking services with the data produced by the RIs into the European Open Science Cloud (EOSC), officially launched in November 23rd, 2018 and since July 2020 an Association with over 100 members.
The ambitious EOSC project implies that, in order to make the data available to everyone, the data has to comply with the principles of being Findable, Accessible, Interoperable and Reproducible (FAIR). In line with these principles, the Photon and Neutron facilities have strived to make the open data produced easily accessible to the users and the public, by providing scientific data management for enabling Open Science in Photon and Neutron (PaN) facilities in Europe.
One of the first steps for data to be FAIR at PaN facilities is to endorse the adoption of a common FAIR data policy framework, to regulate and describe a common framework for data stewardship, by defining the curation of data and metadata, from the generation of raw data from each experiment, to analysis of the data.
This is why harmonisation of PaN specific data policies and management of Intellectual Property Rights (IPR) and ethical issues is one of the key objectives of the PaNOSC Science Cluster. In particular, considering that FAIR data principles more clearly define the concepts of Open Data, current policies at PaNOSC partners are in the process of being updated to better align with the current understanding of FAIR principles, while respecting the specific needs of the PaN community.
The updated research data policy framework gives indications on the rights and obligations of both research infrastructures and researchers, in terms of acquisition, storage, preservation and sharing of data generated at the facility and its associated metadata. The data format is an essential part of making data inter-operable and machine-readable. Thus, for the raw data, a common data format, i.e. NeXUS/HDF5, is recommended, which in addition to the detector data includes sample, instrument and scientific metadata. This ensures a higher compatibility and reusability of data, as well as of all the analysis tools developed by any of these RIs.
The data policy recommends users to ensure that raw and processed data are collected with accurate metadata to fulfil the FAIR principles, and that access to raw data, facility processed data, auxiliary data, results, and the associated metadata is restricted to the experimental team during the embargo period (i.e., the maximum period during which the data generated by experiments performed will remain private), after which data and metadata have to be made publicly accessible.
On the other hand, facilities shall generate DOIs for one or more specific datasets, to be cited in a publication, which would allow reusability of the same datasets by other research teams from the same, or from different domains.
Having an open access data policy with data in well-defined formats has many benefits:
- Raw data becomes open to scrutiny by other researchers, helping to reproduce results and prevent cases of scientific fraud. Open access policies thus foster reproducibility and scientific integrity;
- It makes previously measured data available for further analysis without the need to measure the same sample again;
- It promotes reusability and interdisciplinary research;
- Scientists can mine data in previously unknown ways or apply new methods to existing data.
The full strength of the approach will be achieved once all datasets – from detector data to final publication – and the open source data analysis software, are accessible via a Persistent Identifier (PID) like a Digital Object Identifier (DOI), which is machine readable, giving full advantage to the experimental team and the scientific community.
In addition to providing a policy framework for PaN facilities in Europe, the PaNOSC Science Cluster aims to create a common analysis environment with analysis software available through a Data Search Portal and Data Analysis Portal connected to the facility-specific services, such as authentication, metadata catalogues, file location information and remote analysis services.
Thus, for the proper implementation of FAIR data, PaNOSC members have started to establish a useful set of metadata, and to design an extensible query API (Application Programmers Interface) that enables domain-specific federated search across the PaNOSC data repositories on these terms. Such a federated data catalogue will be a major entry point to further services and software for data analysis and simulation, which will continue to be developed over time. To make data accessible and findable across the federated catalogue, sites have implemented the OAI-PMH protocol, to index data and metadata by OpenAIRE/re3data and B2Find.
In addition to federated data catalogue services, and to foster re-usability of the data, the Common Portal for Data Analysis Services facilitates starting a data analysis session after a dataset of interest has been collected. The Portal is deployed as part of the EOSC to provide federated remote analysis of data across the facilities, via both remote desktop environments and Jupyter Notebooks.
However, PaN facilities cannot achieve the goal of making data open and FAIR alone. The contribution of researchers is key to make EOSC and FAIR data a reality. This is why, as also mentioned in the PaNOSC data policy framework, authors shall embrace FAIR research practices as well. For instance, they are strongly encouraged to link the software used to obtain the results of their analyses with the raw data / metadata, and to make such software and the results openly accessible. Furthermore, it is crucial that they make available the analysis procedure description, scripts, software and software environments that completely describe the process of data analysis from the raw and metadata to the published results, to allow others to reproduce that analysis.
To provide full compatibility of our community AAI with EOSC and the other community services, the PaN Authentication and Authorization Infrastructure (AAI) service UmbrellaID is now integrated with the eduTEAMS service operated by GÉANT.