Can synthetic data accurately mimic oncology clinical trials? Journal Articles uri icon

  •  
  • Overview
  •  
  • Research
  •  
  • Identity
  •  
  • Additional Document Info
  •  
  • View All
  •  

abstract

  • 1554 Background: There is strong interest by researchers, the pharmaceutical industry, medical journal editors, funders of research, and regulators in sharing clinical trial data. Reusing data extracts the most utility possible from patient contributions. The majority of patients do want to share their data for secondary research purposes. However, data access for secondary analysis remains a challenge. A key reason why individual-level data is not made directly available to data users by authors and data custodians is concern over breaches of patient privacy. Synthetic data generation (SDG) is an effective way to address privacy concerns that can enable the broader sharing of clinical trial datasets. However, a key question is whether the reproducibility of the generated data is adequate to draw reliable conclusions. Methods: We synthesized datasets from five pragmatic breast cancer clinical trials performed by the REaCT group (https://react.ohri.ca/). A sequential synthesis method, a type of machine learning was performed. The published analysis of each trial was repeated on each synthetic dataset to evaluate reproducibility. We evaluated reproducibility on three criteria: (a) decision agreement: the direction and statistical significance of the primary endpoint effect estimates are the same as the real data, (b) estimate agreement: the parameter estimates from the synthetic data are within the 95% confidence interval of the real data, and (c) the confidence interval overlap between real and synthetic parameters is above 50%. In addition, we evaluated privacy using a membership disclosure metric. This evaluates the ability of an adversary to determine that a target individual was in the original dataset using the synthetic data, computed as an F1 classification accuracy score. Results: Our results show that decision and estimate agreements held true across all five trials, and the confidence interval overlap was high. The risks of membership disclosure are all below the established 0.2 threshold. Conclusions: In this study, we were able to successfully generate synthetic datasets that accurately replicated original data from 5 oncology trials and yielded the same results as in the original published studies, with a very low risk of membership disclosure. With proper modeling techniques, synthetic datasets can play a key role in data democratization and the reuse of oncology clinical trials.[Table: see text]

authors

  • El Kababji, Samer
  • Mitsakakis, Nicholas
  • Fang, Xi
  • Beltran-Bless, Ana-Alicia
  • Pond, Gregory
  • Vandermeer, Lisa
  • Radhakrishnan, Dhenuka
  • Mosquera, Lucy
  • Clemons, Mark J
  • El Emam, Khaled

publication date

  • June 1, 2023