From science to practice my work revolves around bridging the gap between causality theory and its practical applications, with a focus on healthcare data, addressing the intricate challenges that arise when dealing with multiple data sets.
In many instances data are collected at different points in time, where baseline information is gathered from a prior study and the outcome data are collected later.
At the intersection of causal inference and record linkage I seek to develop statistical methods that propagates the uncertainty inherent in record linkage procedures to ensure reliable causal inference on linked data.
CoMeEcon second episod with Martijn Gösgens! Together with Nuria we gather young researchers to talk about computational statistics methods through time and developments.
I presented a poster at the European Causal Inference Meeting 2025 in Gent, on Causal Record Linkage: Critical Issues and Novel Approaches to False Discovery Propagation. [abstract, poster]
Combining data from various sources empowers researchers to explore innovative questions, for example those raised by conducting healthcare monitoring studies. However, the lack of a unique identifier often poses challenges. Record linkage procedures determine whether pairs of observations collected on different occasions belong to the same individual using partially identifying variables (e.g. birth year, postal code). Existing methodologies typically involve a compromise between computational efficiency and accuracy. Traditional approaches simplify this task by condensing information, yet they neglect dependencies among linkage decisions and disregard the one-to-one relationship required to establish coherent links. Modern approaches offer a comprehensive representation of the data generation process, at the expense of computational overhead and reduced flexibility. We propose a flexible method, that adapts to varying data complexities, addressing registration errors and accommodating changes of the identifying information over time. Our approach balances accuracy and scalability, estimating the linkage using a Stochastic Expectation Maximization algorithm on a latent variable model. We illustrate the ability of our methodology to connect observations using large real data applications and demonstrate the robustness of our model to the linking variables quality in a simulation study. The proposed algorithm FlexRL is implemented and available in an open source R package.
@article{robachetal2025,author={Robach, Kayané and {van der Pas}, Stéphanie L and {van de Wiel}, Mark A and Hof, Michel H},title={A flexible model for record linkage},journal={Journal of the Royal Statistical Society Series C: Applied Statistics},volume={74},number={4},pages={1100-1127},year={2025},month=feb,issn={0035-9254},doi={10.1093/jrsssc/qlaf016},url={https://doi.org/10.1093/jrsssc/qlaf016},eprint={https://academic.oup.com/jrsssc/article-pdf/74/4/1100/62206007/qlaf016.pdf},}
False Discovery estimation in Record Linkage
Kayané Robach, Michel H Hof , and Mark A van de Wiel
Integrating data from multiple sources expands research opportunities at low cost. However, due to different data collection processes and privacy constraints, unique identifiers are unavailable. Record Linkage (RL) algorithms address this by probabilistically linking records based on partially identifying variables. Since these variables lack the strength to perfectly combine information, RL procedures yield an imperfect set of linked records. Therefore, assessing the false discovery proportion (FDP) in RL is crucial for ensuring the reliability of subsequent analyses. In this paper, we introduce a novel method for estimating the FDP in RL for two overlapping data sets. We synthesise data from their estimated empirical distribution and use it along with real data in the linkage process. Since synthetic records cannot form links with real entities, they provide a means to estimate the amount of falsely linked pairs. Notably, this method applies to all RL techniques and across diverse settings where links and non-links have similar distributions—typical in complex tasks with poorly discriminative linking variables and multiple records sharing similar information while representing different entities. By identifying the FDP in RL and selecting suitable model parameters, our approach enables to assess and improve the reliability of linked data. We evaluate its performance using established RL algorithms and benchmark data applications before deploying it to link siblings from the Netherlands Perinatal Registry, where the reliability of previous RL applications has never been confirmed. Through this application, we highlight the importance of accounting for linkage errors when studying mother-child dynamics in healthcare records.
@article{robachetal2026,author={Robach, Kayané and Hof, Michel H and {van de Wiel}, Mark A},title={False Discovery estimation in Record Linkage},journal={Statistics in Medicine},volume={},number={},pages={},year={2025},month=sep,issn={},doi={},url={https://doi.org/10.1002/sim.70292},eprint={},}
Contact: k dot c dot robach at amsterdamumc dot nl Linkedin: GitHub: Scholar: