Machine learning for the prediction of preclinical airway management in injured patients: a registry-based trial
Article information
Abstract
Objective
The aim of this study was to determine the feasibility of using machine learning to establish the need for preclinical airway management for injured patients based on a standardized emergency dataset.
Methods
A registry-based, retrospective analysis was conducted of adult trauma patients who were treated by physician-staffed emergency medical services in southwestern Germany between 2018 and 2020. The primary outcome was to assess the feasibility of using the random forest (RF) and Naive Bayes (NB) machine learning algorithms to predict the need for preclinical airway management. The secondary outcome was to use a principal component analysis to determine the attributes that can be used and advanced for future model development.
Results
In total, 25,556 adults with multiple injuries were identified, including 1,451 patients (5.7%) who required airway management. Key attributes were auscultation, injury pattern, oxygen therapy, thoracic drainage, noninvasive ventilation, catecholamines, pelvic sling, colloid infusion, initial vital signs, preemergency status, and shock index. The area under the receiver operating characteristics curve was between 0.96 (RF; 95% confidence interval [CI], 0.96–0.97) and 0.93 (NB; 95% CI, 0.92–0.93; P<0.01). For the prediction of airway management, RF yielded a higher precision-recall area than NB (0.83 [95% CI, 0.8–0.85] vs. 0.66 [95% CI, 0.61–0.72], respectively; P<0.01).
Conclusion
To predict the need for preclinical airway management in injured patients, attributes that are commonly recorded in standardized datasets can be used with machine learning. In future models, the RF algorithm could be used because it has robust prediction accuracy.
INTRODUCTION
International guidelines recommend preclinical airway management as a potential life-saving procedure for severely injured patients with traumatic brain injury and a Glasgow Coma Scale (GCS) <9; severe respiratory insufficiency, for example, due to thoracic trauma or airway injuries; or trauma-associated shock [1-3]. However, preclinical airway management is a high-risk procedure due to imminent hypoxia, challenging environmental conditions, and varying clinician experience in managing difficult airway situations [4,5]. Because hemodynamic conditions and the patient’s state of awareness can change quickly, preclinical trauma care is a highly dynamic situation. Therefore, an ability to predict or exclude the need for airway management would assist decision-making.
In recent years, several machine learning models that can predict the need for endotracheal intubation in intensive care patients have been published. They are based on electronic medical record systems and common clinical hemodynamic and laboratory parameters [6-9]. In preclinical trauma medicine, no such model exists.
German emergency medical services are divided into paramedic and emergency physician systems (grounded or air), which are alarmed by the rescue coordination center in parallel or sequentially depending on the emergency. Certain medical interventions, such as drug therapy or airway management, are restricted by law to emergency physicians except when needed for resuscitation or when an emergency physician is unavailable. German emergency physicians recruit themselves mainly from fields such as anesthesiology, internal medicine, and surgery. The specialization can be achieved in parallel with main medical specialist training after two years of clinical practice, which must contain at least a 6-month rotation in the accident and emergency department or intensive care unit [5,10]. For quality improvement, the German state of Baden-Wuerttemberg (population, 11.1 million in 2020; area, 35,751 km2; capital, Stuttgart) created a Center for Quality Management in Emergency Medical Services in 2011. Since then, all paramedics and preclinical emergency physicians have had to provide anonymous, digital documentation to the minimal emergency dataset (MIND) [10,11]. The MIND has the advantage of being used throughout Germany, and it also contains international standardized examination findings, diagnoses, and interventions that are used in the German Trauma Registry and the German Resuscitation Registry. Divided into subcategories according to the Advanced Trauma Life Support (ABCDE) algorithm at first contact and hospital admission and supplemented by a free text anamnesis and history (including vital signs diagram) of pharmaceutical therapy and medical interventions, the MIND provides nationwide, standardized, emergency documentation. Although the free text and history sections are not available digitally, the MIND seems suitable for research with machine learning.
Therefore, the aim of this study was to evaluate the feasibility of building machine learning models to predict the need for preclinical airway management in trauma patients. As a first step, attributes of the MIND that define patients who need preclinical airway management were identified. Second, two machine learning algorithms were tested to demonstrate the accuracy of the models.
METHODS
Ethical statements
This study is reported based on the TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) statement [12]. The trial was approved by the Ethics Committee of the State Medical Association of Rhineland-Palatinate (No. 2021-15767-retrospektiv). The study is a retrospective registry analysis with anonymized data. Informed consent was waived due to the retrospective nature of the study.
Design and setting
Adult patients with multiple injuries who were primarily treated by a physician-staffed ground or air ambulance from 2018 to 2020 were selected from the MIND. Dead patients and those requiring resuscitation were excluded. Briefly, the MIND files of the remaining patients were preprocessed for attribute selection using medical causality and a principal component analysis (PCA). With the help of the resulting attributes, Naive Bayes (NB) and random forest (RF) models were trained and tested to find their accuracy in predicting whether those injured patients were given preclinical airway management. Patient selection, dataset creation, and the analyses are illustrated in Fig. 1.
Definition
The MIND does not yet contain anesthesia as an attribute. Therefore, emergency general anesthesia in any injured patient was defined as documentation of invasive airway management, positive end tidal CO2 without noninvasive ventilation (NIV) at admission, documented invasive ventilation at admission and the use of a muscle relaxant, or any use of a muscle relaxant. The main assumption was the correct indication of preclinical emergency anesthesia.
Attribute selection and data preprocessing
The MIND includes more than 550 anonymized attributes, including specialization of the physician, standardized clinical examination findings, medical diagnoses, injury patterns in relation to particular body parts (classified as none, mild, moderate, severe, or deadly by the attending physicians), blunt or penetrating trauma, and vital signs at first contact and hospital admission, including the GCS, heart rate, systolic blood pressure, respiratory rate, oxygen saturation, end tidal CO2, temperature, blood glucose level, and pain level. Furthermore, electrocardiogram findings (at first contact and hospital admission), medication (without dosage or timing), treatment (NIV, invasive airway management, thoracic drainage, pelvic sling), infusion therapy (crystalloid/colloid infusion, blood products), age, preemergency status (PES; a preclinically adapted classification of the American Society of Anesthesiologists), time on site, and transport time are recorded in the dataset [13].
The datasets of patients with cardiac arrest were excluded because abstaining from resuscitation could bias the weighting of certain attributes. Only datasets with at least two of the following three attributes, initial GCS, systolic blood pressure, and oxygen saturation, were included because those parameters represent the guidelines’ recommendations [1-3].
In data preprocessing, generally accepted attributes in the training set with potential correlations but no medical causality were excluded from the machine learning analysis (e.g., place of accident), as were causal attributes without any frequent occurrence in one of the two classes. Attributes correlating with indications for airway management were identified using international guidelines about respiratory, neurological, or hemodynamic findings and injury patterns [2,3,14]. However, because critical volume loss and (developing) shock are not directly recorded in the MIND, surrogate parameters such as pelvic sling or tranexamic acid were also included.
The imputation of missing data was not considered due to the nominal character of most attributes. Because the remaining attributes all contributed with different weightings, a PCA was performed on the whole dataset using the wrapper method with a bidirectional search and a C4 decision tree (J48) with tenfold crossvalidation (settings in Supplementary Table 2) [15]. The Java-based software Weka ver. 3.8.4 (University of Waikato, Hamilton, New Zealand) was used for the PCA and machine learning [16,17]. Statistical comparison of the attributes between the two classes (airway management and no airway management) was performed with chi-square test, U-test, or t-test, as appropriate, in Microsoft Excel (Microsoft Corp., Redmond, WA, USA). A P-value of less than 0.05 was defined as significant. Continuous variables are expressed as means and standard deviations, and categorical variables are expressed as percentages.
Class balancing, training, and testing
The data were split into a 60% training set and 40% test set 10 times with a randomized split procedure to define the performance of the algorithms with different frequencies of invasively ventilated patients. In general, machine learning algorithms tend to learn and predict the majority class, whereas most studies are interested in the minority class. To handle that class imbalance problem for the minority class that received airway management, the synthetic minority oversampling method (SMOTE) algorithm was used to triple the airway management class in the training sets, but not in the test sets. SMOTE synthesis creates one new minority instance out of k=5 existing minority instances using the k-nearest neighbor approach (Supplementary Table 3) [18]. This procedure was chosen because Weka does not offer a cross-validation that uses SMOTE in training but not in testing. Tripling the minority class was an appropriate assessment to improve the predictions and prevent overfitting. For supervised machine learning, the NB and RF methods were chosen (Supplementary Table 4). Both algorithms can handle missing values.
Model performance
All results are presented as means with 95% confidence intervals (CIs). As performance criteria, overall correctness, kappa value, the area under the receiver operator curve (AUC-ROC), sensitivity (need for airway management), specificity (no need for airway management), positive predictive value (PPV) and negative predictive value (NPV), and the precision-recall (PRC) area were chosen [15]. The Matthews correlation coefficient (MCC) was used to measure the quality of the two presented classes of very different sizes (range: –1, total disagreement; 0, random prediction; +1, perfect prediction) [19]. The cost-benefit calculation for the RF algorithm was performed automatically for the lowest overall error rate. The performance across all 10 test sets was averaged and compared with a t-test (P<0.05 as significant, calculated in Microsoft Excel).
RESULTS
Out of more than 130,000 injured patients, 26,765 patients with multiple injuries were selected. Of the selections, 869 resuscitations, 6 fatal cases, and 335 insufficiently documented datasets were then excluded, leaving 25,556 datasets with 1,451 cases (5.67%) of airway management.
Data preprocessing identified 31 attributes with potential correlation or medical causality. In the PCA, 24 attributes were selected, among them auscultation, injury pattern without the upper limbs or soft parts, oxygen therapy, NIV, tranexamic acid and catecholamines, pelvic sling, vital signs, PES, and shock index. With the exception of initial systolic blood pressure and respiratory rate (P>0.05), the groups with and without airway management differed significantly (Table 1). For further information about nonselected attributes see Supplementary Table 1.
In overall correctness, the RF outperformed the NB (97.8 [95% CI, 97.57–98.03] vs. 93.55 [95% CI, 93.11–93.99], respectively; P<0.01). The RF reached a significantly higher kappa value (0.78 [95% CI, 0.75–0.8]) than the NB (0.54 [95% CI, 0.52–0.56]; P<0.01). In the AUC-ROC analysis, the RF reached 0.96 (95% CI, 0.96–0.97), and the NB reached 0.93 (95% CI, 0.92–0.93; P<0.01) (Fig. 2A). Furthermore, the RF model had a significantly higher MCC than the NB approach (0.78 [95% CI, 0.76–0.8] vs. 0.56 [95% CI, 0.54–0.57], respectively; P<0.01).
In predicting the use of airway management, the difference between the NB and RF results was not statistically significant (0.75 [95% CI, 0.73–0.76] vs. 0.73 [95% CI, 0.71–0.76], respectively; P=0.38). The best PPV was gained with the RF (0.85 [95% CI, 0.84–0.87]; NB, 0.46 [95% CI, 0.44–0.49]; P<0.01). This also resulted in a larger PRC area for the RF (0.83 [95% CI, 0.80–0.85]; NB, 0.66 [95% CI, 0.61–0.72]; P<0.01) (Fig. 2B).
Both algorithms yielded a very high specificity (RF, 0.993 [95% CI, 0.992–0.994] vs. NB, 0.947 [95% CI, 0.942–0.952]; P<0.01), a high NPV (RF, 0.984 [95% CI, 0.980–0.987] vs. NB, 0.984 [95% CI, 0.983–0.985]; P=0.85), and a high PRC area (RF, 0.996 [95% CI, 0.996–0.997] vs. NB, 0.992 [95% CI, 0.992–0.993]; P<0.01) (Table 2).
The average threshold of the RF model was 0.51 (95% CI, 0.49–0.53). Due to the decision process used by the NB, no average threshold can be given for it. The three most important attributes in the RF were systolic blood pressure (0.306±0.019), head injury (0.305±0.013), and initial heart rate (0.294±0.018) (Fig. 3).
DISCUSSION
This study set out to develop a decision model for determining the necessity of preclinical airway management in adult trauma patients. Commonly recorded preclinical attributes such as injury pattern, certain examination findings, vital signs, and emergency medical interventions were found to be most influential in forecasting the need for preclinical airway management. Both models developed here showed excellent results in excluding the need for airway management, but only the RF model had satisfactory accuracy in predicting it. Therefore, the feasibility of using machine learning to predict the need for airway management in preclinical trauma patients has been confirmed, but the models need to be advanced. Nonetheless, even before a final model can be implemented in the electronic medical records, the attributes determined here can already be used clinically to alert emergency physicians about trauma patients at increased risk of requiring airway management. For example, the absence of severe head or thoracic injury, catecholamine therapy, thoracic drainage, or NIV could justify a later evaluation of airway protection. To the best of our knowledge, this analysis is the first to use machine learning to forecast airway management in a preclinical environment. However, several factors need to be considered to interpret and advance the results.
Database, attribute selection, and model comparison
The more distinct the pathological findings in the initial parameters, the better the classification by the algorithms could be. However, differences in attributes such as GCS or oxygen saturation were marginal, and their averages were physiological, which was partly reported in other clinical modeling studies [8,20,21]. This could be explained by belated documentation of paramedically stabilized vital signs.
Attribute choice is always a compromise between overgeneralization (selecting only attributes with strong correlation or causality) and overfitting (selecting many attributes, even those with weak correlation). The PCA in this study filtered in attributes with strong indirect correlations with airway management. For example, the use of catecholamines can be interpreted as a surrogate for hemodynamic instability before or after airway management in emergency anesthesia. Other surrogates were colloid infusion, pelvic sling, and tranexamic acid for potential blood loss (attribute tourniquet not included in MIND). NIV can be discussed as a surrogate for respiratory failure or a method of preoxygenation. Although the shock index is only to some extent reliable for the diagnosis of shock, it had weight in combination with other attributes [22,23]. Because preclinical emergency physicians in Germany usually lack point-of-care and radiographic findings, they have to use a less-reliable clinical examination with baseline vital signs for their time-critical decision-making. The surrogate parameters used in this study can therefore be seen as a replacement for real-time vital signs. They also reflect to some extent the recommendations for airway management in patients with traumatic respiratory disorder, brain injury, and shock [1-3]. Future prediction models in preclinical airway management should combine attributes emphasized in the guidelines with selected surrogates that reflect the dynamics of preclinical emergency medicine to compensate for any lack of real-time parameters.
Compared with other studies, a main distinction of this study is the restriction to initial vital signs and adaptation to preclinical conditions [2,3]. Siu et al. [20] used an additional blood gas analysis with sequential organ failure assessments at multiple time points for their RF model to predict the need for intubation in the first 24 hours after a critical care admission (sensitivity, 0.88; specificity, 0.66; AUC-ROC, 0.86; PPV, 0.73; NPV, 0.85). Arvind et al. [6] indicated a AUC-ROC of 0.84 and PRC area of 0.3 for their RF model for predicting mechanical ventilation in COVID-19 patients based on vital signs and a blood gas analysis. In neonatal intensive care, Clark et al. [8] demonstrated a boosted logistic regression model with an AUC-ROC of 0.84. Politano et al. [21] could predict urgent intubation in a trauma intensive care unit with an AUC-ROC of 0.770 to 0.865 with the help of a boosted logistic regression using multiple sampling windows for vital signs along with age, oxygen partial pressure, and days since extubation.
Model performance
With regard to the performance of both algorithms, several factors about their basic method of calculation and the prevalence of airway management must be considered. In this study, the ROC curve alone overestimates the model performance because of the class imbalance problem (94% without emergency anesthesia) and the very high specificities and negative predictive values. Therefore, the goodness of class prediction can best be evaluated by the PRC area, which showed that the RF had a robust predication accuracy [24].
The basic assumption of the NB is the independence of all attributes without any correlation. Such a level of independence is almost never found in real-world data. In this study, the auscultation findings, respiratory rate, and oxygen saturation all influence one another, as do the GCS score and face and/or head injury. The decision process in favor of or against a class is performed by comparing the summed probability of the test case to the summed probability of the class, which leads to the shown bad calibration. The advantage of an NB approach is its fast calculation and simple implementation. Also, the arithmetic means and variance are parameterized independently of all other variables [15].
Unlike in the NB, independence is not a basic assumption of an RF. Decision trees have the advantage of using the same attributes on different levels in different dependencies. In contrast to a single decision tree model, an RF uses the bagging procedure, by which multiple random trees each calculate a prediction. Those are then averaged to reach a final decision. This explains not only why RF got better outcomes than NB but also the weights of certain attributes whose differences were marginal. Those same effects also appear in the PCA, because it also uses a decision tree model. Therefore, the RF is robust to outliers, works well with nonlinear data, and has a lower risk of overfitting than single decision trees. As a result, the RF could handle even the relatively small prevalence of airway management cases in the test sets, achieved a good PRC area, and had a robust performance [15,25]. Given the prevalence between the different test sets, the RFs differ, and a final model cannot be given.
Further limitations
Due to the former and following limitations, this study represents only a first attempt to build a sustainable, general model for predicting preclinical airway management. Overreliance on machine learning in high-risk situations can result in potential patient hazards. Future models are also needed for internal and neurological patients. These results were developed in a physician-staffed emergency medical system and therefore cannot be simply transferred to paramedic systems [26]. The weighting of certain attributes could be changed by alterations in clinical practice. The timing of interventions is missing from MIND, which limits the applicability of the models presented here. Unlike previous prediction models for resuscitation, attributes such as trauma site were not included in the data used here. Whereas in resuscitation, the site of cardiac arrest is directly linked to bystander cardio-pulmonary resuscitation, there is no such correlation for trauma site or mechanism and airway management, only for trauma severity [3,27]. Unfortunately, that severity can only be assessed by the primary physical exam and not by later radiographic findings and hospital data. Although this study used data from a statewide emergency medical service, no independent external test set from another German region was used here. Therefore, predications of stability with regard to noise and overfitting must be restrained. Unlike in other studies, the imputation of missing values in this study was not reasonable, mainly due to static nominal, binary, or ordinal attributes [6,20]. Whether emergency physicians postponed endotracheal intubation because of a potentially difficult airway or a lack of experience cannot be stated because no further clinical records were available [5]. Also, the correct indication for airway management and primary assessment according to the ABCDE algorithm could not be checked in every single case due to the retrospective design and dataset structure. In machine learning, unsupervised deep learning neural networks have recently outperformed supervised approaches such as the RF. However, those deep learning models require a large amount of data and computing power. Network creation is complex, unstandardized, and time-consuming. Because this study focused on a simple binary problem, and the data structure was inconsistent, RF and NB were chosen. The supplementary data contain a first approach to a deep learning neural network, but it performed worse than the RF in predicting the need for airway management (Supplementary Table 5 and Supplementary fig. 1). Nonetheless, a deep learning application might be suitable for future models, especially with real-time attributes [15].
CONCLUSION
In conclusion, this study has shown the feasibility of using a machine learning model to predict the need for airway management in injured patients. The RF model combined a satisfactory prediction performance with an excellent ability to exclude the need for airway management in trauma patients. Because the many attributes available can be a hindrance in quickly assessing trauma patients, models such as those presented here could already be used as surveillance tools in the background or to send the intubation probability to the hospital, where additional resources could be activated. Embedded in a continuous electronic medical record and expanded by data about internal patients, real-time parameters and point-of-care tests, an RF-based prediction model could be made more reliable and support preclinical decision-making or quality management. In the future, patients at risk could be identified at an early time with the help of such a machine learning model.
SUPPLEMENTARY MATERIAL
Supplementary materials are available at https://doi.org/10.15441/ceem.22.335. Further supplementary data, including single random forest models, are available upon reasonable request via e-mail. Due to data protection, the datasets cannot be published, but research with the database is possible upon request to the Center for Quality Management in Emergency Medical Services Baden-Wuerttemberg (SQR-BW).
Notes
CONFLICT OF INTEREST
No potential conflict of interest relevant to this article was reported.
FUNDING
This work was supported by the Department of Anesthesiology, Operative Intensive Care Medicine and Emergency Medicine, Ludwigshafen Municipal Hospital.
AUTHOR CONTRIBUTIONS
Conceptualization: AL; Data curation: AL, TL, JE; Formal analysis: AL; Funding acquisition: WZ; Investigation: AL; Methodology: AL; Project administration: TV; Resources: TL, JE; Software: AL; Supervision: WZ, MT, TV; Validation: AL; Visualization: AL; Writing–original draft: AL; Writing–review & editing: WZ, TL, JE, MT, TV.
All authors read and approved the final manuscript.
References
Article information Continued
Notes
Capsule Summary
What is already known
Preclinical airway management is a high risk procedure. Other than a Glascow Coma Scale of less than 9 or acute respiratory insufficiency, there are few methods to predict the need for preclinical airway management.
What is new in the current study
We developed and validated a machine learning model to predict the need for airway management in injured patients.