Causal discovery with ESS data
This page describes a data set from the European Spallation Source (ESS). The data was generated by the ACCP system (a refrigeration system), and system experts have constructed a ground-truth causal graph from their knowledge of the system. If you use the data, please cite
S.W.Mogensen, K.Rathsman, P.Nilsson, Causal discovery in a complex industrial system: A time series benchmark, in Proceedings of the 3rd Conference on Causal Learning and Reasoning (CLeaR), 2024, https://doi.org/10.48550/arXiv.2310.18654
The data can be downloaded from https://zenodo.org/records/10641290 . The full data set is quite large, and a smaller version is available for testing https://zenodo.org/records/10679737 . Code for preprocessing and analysis can be found on github along with a description of how to get started with the data. The github repo also contains a preprocessed data set to allow a quick look at the data. The following provides a short introduction to the data, and this is a five-minute talk presenting the paper.
Introduction and terminology
A number of process variables (PVs) describing the state of the ACCP were measured over time. Observations from three separate time intervals are available (Period 1, 2, and 3). Each PV belongs to a subsystem, and each subsystem is a physical component of the ACCP system (such as a compressor). The causal graph was constructed at the subsystem level, and each subsystem may have several PVs describing its state.
In the raw data, the sampling is irregular: The PVs are measured at different time points, and for each PV these time points are not necessarily equidistant. The github repo contains a regularly sampled data set and code which constructs this data set by temporal aggregation. This aggregation may be problematic in the context of causal discovery (see also the paper). However, significant changes in the system are thought to occur considerably slower than the sampling frequency for many PVs, and therefore temporal aggregation over small intervals may be a reasonable way to obtain a dataset in a standard format.