Exploration of E. coli contamination drivers in private drinking water wells: An application of machine learning to a large, multivariable, geo-spatio-temporal dataset Academic Article uri icon

  • Overview
  • Research
  • Identity
  • Additional Document Info
  • View All


  • Groundwater resources are under increasing threats from contamination and overuse, posing direct threats to human and environmental health. The purpose of this study is to better understand drivers of, and relationships between, well and aquifer characteristics, sampling frequencies, and microbiological contamination indicators (specifically E. coli) as a precursor for improving knowledge and tools to assess aquifer vulnerability and well contamination within Ontario, Canada. A dataset with 795, 023 microbiological testing observations over an eight-year period (2010 to 2017) from 253,136 unique wells across Ontario was employed. Variables in this dataset include date and location of test, test results (E. coli concentration), well characteristics (well depth, location), and hydrogeological characteristics (bottom of well stratigraphy, specific capacity). Association rule analysis, univariate and bivariate analyses, regression analyses, and variable discretization techniques were utilized to identify relationships between E. coli concentration and the other variables in the dataset. These relationships can be used to identify drivers of contamination, their relative importance, and therefore potential public health risks associated with the use of private wells in Ontario. Key findings are that: i) bedrock wells completed in sedimentary or igneous rock are more susceptible to contamination events; ii) while shallow wells pose a greater risk to consumers, deep wells are also subject to contamination events and pose a potentially unanticipated risk to health of well users; and, iii) well testing practices are influenced by results of previous tests. Further, while there is a general correlation between months with the greatest testing frequencies and concentrations of E. coli occurring in samples, an offset in this timing is observed in recent years. Testing remains highest in July while peaks in adverse results occur up to three months later. The realization of these trends prompts a need to further explore the bases for such occurrences.


  • White, Katie
  • Dickson, Sarah
  • Majury, Anna
  • McDermott, Kevin
  • Hynds, Paul
  • Brown, R Stephen
  • Schuster-Wallace, Corinne

publication date

  • June 2021