From science to practice my work revolves around bridging the gap between causality theory and its practical applications, particularly in the context of survival analysis, addressing the intricate nuances that arise when dealing with multiple datasets.
In many instances data are collected at different points in time, where baseline information is gathered from a prior study and the outcome data are collected later.
At the intersection of causal inference and record linkage I seek to develop statistical methods that propagates the uncertainty inherent in record linkage procedures to ensure reliable causal estimates.
Combining data from various sources empowers researchers to explore innovative questions, including those raised by tallying casualties and conducting healthcare monitoring studies. However, the lack of a unique identifier often poses challenges. Record linkage procedures determine whether pairs of observations collected on different occasions belong to the same individual (referred to as links) using partially identifying variables (e.g. initials, birth year, zip code). Existing methodologies typically involve a compromise between computational efficiency and accuracy. Traditional approaches simplify this task by condensing information, yet they neglect dependencies among linkage decisions and disregard the one-to-one relationship required to establish coherent links. Modern approaches offer a comprehensive representation of the data generation process, at the expense of substantial computational overhead and reduced flexibility. We propose a flexible method to determine the set of links, that adapts to varying data complexities, addressing registration errors, including inaccuracies and missing values, and accommodating changes of the identifying information over time. Our approach balances computational scalability and accuracy, estimating the linkage by maximum likelihood using a Stochastic Expectation Maximisation algorithm on a latent variable model. We illustrate the ability of our methodology to connect observations using two large real data applications and demonstrate the robustness of our model to the linking variables quality in a simulation study. The proposed algorithm FlexRL is implemented and available in an open source R package.
Contact: k dot c dot robach at amsterdamumc dot nl Linkedin: GitHub: