WP3 – Workflow metadata and metrics tools

The IO-SEA stack will significantly change the way users interact with the data storage service on supercomputers. Up to now, data storage is considered as a generic service sized by the supercomputer designers and shared by all running workflows. The amount of control given to users to run their simulations is often reduced to choosing between different shared file systems with different characteristics in terms of storage space, bandwidth, data retention policies, etc. In such environments, the actual I/O performance observed by a given application can be deeply affected by the other ones running in parallel and sharing the same storage resources. Due to this exposure to external shared resources, I/O performance is often not considered as critical as the compute performance, and it is usually less understood by application developers.

With IO-SEA, new concepts will be exposed to the users to identify the data they need access to through ephemeral services when running their applications. IO-SEA users will have to explicitly size and set up the ephemeral I/O services needed for their workflows, and system administrators will have to manage them. Users will need to get a deep understanding of the behaviour of their applications, and system administrators will need tools to manage the storage infrastructure and the ephemeral services. WP3 will equip them with the right set of tools to get the information they need.

The IO-SEA tools will collect and store a set of metadata and metrics on workflows. The information will be presented to users and system administrators through Graphical User Interfaces and made accessible through APIs. Instrumentation will be activated automatically for all jobs run on the IO-SEA infrastructure in order to build a knowledge base.

They will also monitor the data storage infrastructure, with a focus on specific aspects such as the ephemeral services (for instance, checking if they are really started before passing control to the application, and if they are correctly shut down), the data nodes (Are they having transient hardware errors? System level saturation?) and the object store (Is it able to handle requests?). Those tools will complement – and not replace – legacy monitoring tools for disk-based storage subsystems or tape libraries.

Then, with this gathered information, we can build a recommendation system that will leverage the metadata and metrics from workflows, and the infrastructure monitoring data to produce recommendations about ephemeral services configuration, data placement and failure predictions.

Similar Posts