Data Analysis

In this section you will find a brief description of the Data Analysis Services being developed and provided.

PaN portal

PaNOSC is developing the PaN Portal for Data Analysis Services to enable to start a data analysis session as soon as a dataset has been collected. The Portal provides access to both remote desktop environments and Jupyter Notebooks, enabling users to remotely analyse data from PaN facilities during or after the experiment.

Source code for the portal can be found on github:

Documentation for the portal development is available on the PaNOSC confluence site:

Jupyter Notebooks

jupyter logo

PaNOSC has chosen the Jupyter Notebooks and Jupyter Lab from the Jupyter project as general purpose data analysis tool. Notebooks allow code and documentation to be intermingled in one document in the web browser. The uptake of the notebooks is proving to be very popular in data science partly because they support Python as programming language. A number of the scientific Use Cases for PaNOSC request solutions based on Jupyter notebooks. All PaNOSC sites have implemented a Jupyter notebook service. EGI provides a Jupyter notebook service for all PaNOSC users with an UmbrellaId.

Binder is a service built on top of Jupyter notebooks to make scientific data analysis reproducible. EGI provides a Binder service for all PaNOSC users with an UmbrellaId.

Remote Desktops

PaNOSC offers virtual machines for scientists to run applications which do cannot be converted to Jupyter notebooks. The VMs are accessed through a remote desktop which exports the graphics to a browser. The PaN portal and its main back-end service VISA use Guacamole to export the desktop to a web browser. Extra features have been implemented to allow sharing of desktops between scientists.

Milestones reached

  • Existing data analysis requirements and solutions from all partner sites (including ExPaNDS) have been surveyed [1] [2];
  • All sites now provide remote desktop analysis services or remote Jupyter Notebook analysis services in a variety of states (some in production with large user numbers);
  • Provision of a citizen science prototype environment for remote and reproducible data analysis of COVID 19 infection data OSCOVIDA

Ongoing activities

  • Developing standard data analysis notebooks for specific techniques;
  • Providing tools used in the Notebook-based data analysis at the facilities, and contributions to open source data analysis tools that are used in PaNOSC and elsewhere, specifically, h5py, h5glance, hdf5plugin;
  • Developing web-based viewers for HDF5 files: h5nuvola and h5web;
  • Providing an infrastructure (e.g., JupyterHub, or Jupyter-Slurm) so that notebooks can be executed remotely on the computing and data infrastructure of the facility;
  • Exploring the use of software packaging managers to deploy versioned software at HPC installations and provide the same software in a portable container to support remote and cloud-based analysis software environment provision. 

Common Portal Achievements

  • Possible use cases of the Portal have been listed [3];
  • Definition of the Portal Architecture by adopting a microservices approach (foundation services, user services and compute services [4] [5], for more flexible integration into site-specific infrastructures. 

After initial deployment at facilities to provide remote analysis services to local data, the Portal will be deployed as part of the EOSC to provide federated data analysis of data across the facilities. 


Share this content