Published the report on current technical elements of data analysis at PaNOSC partner sites

In November 2019, the project’s deliverable D4.1 – Report on current technical elements of data analysis at PaNOSC partner sites was submitted and published. The document includes the results of a survey of existing data analysis infrastructure, which was designed by and distributed to all PaNOSC partner facilities (CERIC-ERICILLXFELESSELI and ESRF) with the aim of building up a view of data analysis needs and services at each facility and to provide a basis for developing a set of requirements for the development of services for WP4. The document also includes the responses from PaNOSC partners, as well as from the ISIS Neutron and Muon Facility, which is a partner in the ExPaNDS project and offered to participate in the survey 

The survey was split into five sections:

  1. Scientists and Data: Data generation, user-community and scientific nature of each facility for today’s situation and forecast for 2023.
  2. Data analysis and reduction: Tools and services concerning data analysis and reduction.
  3. Technology: Determines the IT infrastructure available for data analysis purposes.
  4. Security: Provides a global view of the security requirements and solutions at each facility.
  5. Other information: Additional information that could be useful for the development of analysis services for PaNOSC.

As a result of the survey, a common practice in the provision of data analysis infrastructure at photon and neutron facilities in the frame of PaNOSC has been identified. Based on the requirements and existing services from partners, the advice is to focus further work on remote data analysis through Jupyter notebooks and remote desktop services.

Remote exploration of scientific data stored in the hdf5 file format

The Jupyter Notebook approach shows great potential for reproducibility, user convenience and for a further move towards FAIR data. However, it is not applicable for all analysis requirements. In particular, notebooks cannot be used for existing analysis tools that are based on a graphical user interface, whereas remote desktop services can cater for all existing software, as they make a graphical desktop available in a web browser.

The goal is to put together demonstrator use cases that show how facility data can be analysed remotely to pave the ground for creating analysis templates. Together with WP3, a portal is under development that aims to serve as an entry point to remotely start an analysis session for a given data set.

In addition, the development and deployment of this portal across multiple sites is likely to be challenging due to the different AAI, data and computing infrastructure. Another challenge for data access through EOSC is that, for data analysis of large data sets, the only realistic option is to carry out the data analysis close to the data, and that each facility uses somewhat unique hardware and infrastructure. It may thus be necessary to exclude some data sets of largest size from the demonstrator. Finally, it was noted that it is non-trivial to automatically propose appropriate data analysis templates for given data sets, which will be a challenge for EOSC and the community for a long time. The efforts to better capture metadata (WP3) will help to move closer to this goal, but it is expected to be limited to a set of example data sets and associated data analysis processes within this project.

The use of containers as a way to ship software environments and enhance reproducibility shows potential and will be explored further.

To ensure that the project is sustainable in the future, an eye shall be kept on how the container ecosystem evolves over time. It is also needed to monitor, and if appropriate engage with, other developments in the field of technologies relevant to the EOSC vision, including advances in the HDF5 and Jupyter ecosystem.

Download the deliverable here.

Share this content