Artificial intelligence as an instrument for the reduction of infant and youth mortality: understanding its determinants and predicting outcomes / Inteligência artificial para reduzir a mortalidade e identificar os padrões de vida saudável
Project name: Artificial intelligence as an instrument for the reduction of infant and youth mortality: understanding its determinants and predicting outcomes / Inteligência artificial para reduzir a mortalidade e identificar os padrões de vida saudável
Project code: AI4Life
Project timeline: 2020 - 2023
Research group: Intelligent Systems
Objective
Abstract
The digital transformation of the Public Administration is imperative so that the necessary mechanisms can be developed to reduce the number of years of life lost and to increase the quality of life of the population. The main objective of this project is to leverage existing information in public administration databases and others in order to support decision-makers regarding the best response to emerging diseases, better adaptation of public health intervention programs and improve the capacity of the health systems in the future.
The data proposed for this work will provide important information in terms of morbidity, sociodemographic and socioeconomic context of the entire Portuguese population, and are available mainly in the following databases: 1 Hospital Morbidity Database (BDMH); 2) Information System of Death Certificates (SICO); 3) National Health Service Information and Monitoring System (SIM@SNS) and Monitoring System of Regional Health Administrations (SIARS), which includes records of follow-up of children in Primary Health Care. In addition, several other important sources of data will be consulted.
INSTITUTIONS
Main Contractor: Instituto de Engenharia Mecânica (IDMEC)
Participating Institutions: Faculdade de Medicina da Universidade de Lisboa (FM/ULisboa); Faculdade de Ciências Médicas (FCM/UNL); Direcção-Geral da Saúde (DGS)
Main Research Unit: Laboratório Associado de Energia, Transportes e Aeronáutica (LAETA)
TEAM
Principal Investigator: Susana Margarida da Silva Vieira
Researchers: Maria Cristina de Brito Eusébio Bárbara Prista Caetano; Fernando Miguel Teixeira Xavier; Joaquim Paul Laurens Viegas; Cátia Matos Salgado
Co-investigador Responsável: João Miguel Costa Sousa
PROJECT SUMMARY
As the first relatively extensive study on the determinants of infant and youth mortality in Portugal making use of Machine Learning techniques, this work contributes to the development of this research area in multiple ways. First, the exploratory data analysis and data preprocessing steps were important on their own, as they led to important insights, even though their main purpose was to prepare the data for the subsequent modelling steps. Additionally, by using sophisticated feature selection methods, important determinants with a highly nonlinear relation to infant and youth mortality could be discerned in a way that classical statistical analysis simply would not allow. Furthermore, the prediction model is capable of making new predictions online, which in turn can be used to simulate scenarios, and analyze their eventual impact on infant and youth mortality. This will allow decision makers (Public Health professionals, for instance) to study the effects of each and every variable (and combination of variables) on infant and youth mortality.
To the best of our knowledge, no tool of this kind (incorporating a Machine Learning model) had been conceived prior to this project.
The clustering analysis performed for this work is also a valuable contribution, as, despite being a fairly common unsupervised learning technique, this was the first time (to the authors’ best knowledge) that an approach of this sort was applied to mortality data concerning Portugal. The clustering approach whereby municipalities are grouped together according not only to the mortality variable being considered, but to a given ”mortality determinant” as well is particularly noteworthy in this regard. Finally, this success in attaining its goals paves the way for future applications of Machine Learning algorithms in infant and youth mortality studies, particularly those concerning Portugal.
METHODOLOGY
A brief description of the methodology is included here to frame the outputs of the project. This section provides a brief overview of the methodology used in this work. Figure 1 illustrates the general workflow followed by the authors.
The available data comprised 178 databases sourced from various authorities, including the Directorate-General of Health, Statistics Portugal, the World Health Organization, PORDATA, the Portuguese Environment Agency, OpenRouteService, the Childhood Obesity Surveillance Initiative, and the Health Behaviour in School-Aged Children. Two of these databases, belonging to the Directorate-General of Health, are restricted and not accessible to the general public.
This study systematically categorizes databases into three primary types: External data, Mortality data, and Auxiliary data. External data represents factors indirectly related to mortality. These data incorporates six areas — Economics, Healthcare, Society, Demographics, Education, and Environment — captured over a six-year period (2014-2019). These variables serve as the potential determinants whose impact on mortality the study aims to investigate.
Regarding the external dataset, the predominant challenges stemmed from a substantial amount of missing values and disparities in the sampling frequency of the data, including a mix of monthly and annual measures. To address the issue of missing values, a hybrid approach involving both feature elimination and data imputation was employed. Features with more than 75% of their values missing were discarded. For the remaining features with missing values, a data imputation approach based on K-Nearest Neighbours (with K=5) was adopted. Using this hybrid approach, 72 features were rejected.