Constraint-based causal discovery with tiered background knowledge and latent variables in single or overlapping datasets

This paper introduces tFCI and tIOD algorithms that leverage tiered background knowledge to enhance the efficiency and informativeness of constraint-based causal discovery in settings with latent variables and overlapping datasets, demonstrating theoretical gains under oracle conditions.

Causal Inference, Graphical Models, Latent Variables, Temporal Structure, Multi-Cohort Studies

Christine W. Bang, Vanessa Didelez

Leibniz Institute for Prevention Research and Epidemiology – BIPS, Bremen, Germany, Faculty of Mathematics and Computer Science, University of Bremen, Bremen, Germany

Generated by grok-3

Background Problem

This research addresses the challenge of causal discovery in settings with latent variables and multiple overlapping datasets, such as multi-cohort studies in life sciences, where data may not be jointly measured or may be missing entirely. Traditional constraint-based methods like FCI and IOD struggle with reduced identifiability due to latent variables, leading to less informative causal graphs. The key problem solved is improving the efficiency and informativeness of causal discovery by leveraging tiered background knowledge, often derived from temporal structures, to restrict equivalence classes and reduce computational complexity in identifying causal relationships across long time spans.

Method

The paper proposes two main algorithms: the tiered FCI (tFCI) and tiered IOD (tIOD), extending the Fast Causal Inference (FCI) and Integrating Overlapping Datasets (IOD) algorithms by incorporating tiered background knowledge, which arises from temporal orderings (e.g., variables in later tiers cannot cause those in earlier tiers).

Core Idea: Use tiered background knowledge to restrict conditional independence tests to the ‘past’ of variables and orient cross-tier edges, thereby reducing the search space for causal structures and enhancing informativeness.
tFCI: Applied to single datasets with latent variables, it has a ‘simple’ version that limits independence tests to past variables (sound and complete) and a ‘full’ version that also orients edges based on tiered knowledge (sound but not complete).
tIOD: Designed for multiple overlapping datasets, it similarly has ‘simple’ (sound and complete) and ‘full’ (sound) versions. It restricts tests and orientations using tiered knowledge, reduces potential edge removals and v-structures, and discards inconsistent graphs.
Implementation Steps: Both algorithms start by constructing initial graphs, apply conditional independence tests limited by tiered ordering, orient edges (especially cross-tier ones in full versions), and use orientation rules to refine the output graphs (PAGs or PMGs).

Experiment

The paper does not present empirical experiments with real or synthetic datasets but relies on theoretical analysis and illustrative examples to demonstrate the effectiveness of tFCI and tIOD algorithms. The setup involves oracle assumptions (perfect knowledge of conditional independencies), with examples showing how tiered knowledge reduces the number of graphs visited by tIOD compared to IOD (e.g., from 73 to 18 graphs in one case). The results indicate that simple tIOD is more efficient and often more informative by outputting fewer PAGs, as formalized in Proposition 11, which specifies conditions for efficiency gains. However, the lack of finite sample analysis or real data testing leaves uncertainty about practical performance, especially regarding statistical errors in conditional independence tests. The theoretical superiority in efficiency and informativeness matches expectations under oracle conditions, but the absence of comprehensive experimental validation limits the assessment of real-world applicability.

Further Thoughts

The concept of tiered background knowledge is a valuable contribution to causal discovery, particularly for temporal data like cohort studies, but its practical impact remains uncertain without empirical validation on real datasets with finite samples. An interesting direction could be integrating this approach with time series causal discovery methods, as hinted in the discussion, to handle overlapping multivariate time series data—potentially bridging gaps in fields like econometrics or climate science where temporal causality is critical. Additionally, combining tiered knowledge with partial expert knowledge, as suggested, could align with hybrid AI systems where human expertise and data-driven methods coexist, enhancing robustness in domains like epidemiology. A critical concern is the assumption of a single underlying MAG; in reality, cohort studies might reflect heterogeneous populations or settings, leading to inconsistent marginal models. Future work could explore relaxing this assumption or integrating methods to reconcile inconsistencies, perhaps drawing from federated learning paradigms to handle data privacy and heterogeneity across cohorts.