Application of convolutional neural networks for distal radio-ulnar fracture detection on plain radiographs in the emergency room
Article information
Abstract
Objective
Recent studies have suggested that deep-learning models can satisfactorily assist in fracture diagnosis. We aimed to evaluate the performance of two of such models in wrist fracture detection.
Methods
We collected image data of patients who visited with wrist trauma at the emergency department. A dataset extracted from January 2018 to May 2020 was split into training (90%) and test (10%) datasets, and two types of convolutional neural networks (i.e., DenseNet-161 and ResNet-152) were trained to detect wrist fractures. Gradient-weighted class activation mapping was used to highlight the regions of radiograph scans that contributed to the decision of the model. Performance of the convolutional neural network models was evaluated using the area under the receiver operating characteristic curve.
Results
For model training, we used 4,551 radiographs from 798 patients and 4,443 radiographs from 1,481 patients with and without fractures, respectively. The remaining 10% (300 radiographs from 100 patients with fractures and 690 radiographs from 230 patients without fractures) was used as a test dataset. The sensitivity, specificity, positive predictive value, negative predictive value, and accuracy of DenseNet-161 and ResNet-152 in the test dataset were 90.3%, 90.3%, 80.3%, 95.6%, and 90.3% and 88.6%, 88.4%, 76.9%, 94.7%, and 88.5%, respectively. The area under the receiver operating characteristic curves of DenseNet-161 and ResNet-152 for wrist fracture detection were 0.962 and 0.947, respectively.
Conclusion
We demonstrated that DenseNet-161 and ResNet-152 models could help detect wrist fractures in the emergency room with satisfactory performance.
INTRODUCTION
Wrist fractures are commonly diagnosed using simple radiographic images [1], and corresponding treatment depends on the shape and stability of the fracture. Computed tomography can be used to provide a more accurate assessment of the presence and type of fractures and a better joint evaluation [2]; nevertheless, radiographs remain a rapid and low-cost primary method for early wrist trauma evaluation [3]. However, up to 30% of wrist fractures are misdiagnosed using radiographs [4]. This can result in mistreatment.
Recently, studies have used various deep-learning models as assistive methods for more accurate and efficient fracture diagnoses; several models (e.g., VGGNet, Inception-ResNet, Faster RCNN, and ViDi) have shown satisfactory performance [5-10]. However, because previous studies collected and analyzed heterogeneous image data, there were limitations to the clinical use of these models, especially in emergency room (ER) scenarios.
In the present study, we collected image data of emergency room patients with wrist trauma. Two types of convolutional neural networks (CNN) (i.e., DenseNet-161 and ResNet-152) were applied. The purpose of this study was, therefore, to evaluate the performance of the fracture detection model by analyzing the accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC) of each CNN.
METHODS
Study participants
We included data of ER patients with wrist trauma who underwent plain radiography between January 2018 and May 2020. Basic metabolic panel images of 1,776×2,132 pixels were retrieved from the hospital’s picture archiving and communication system (PACS) using INFINITT PACS M6 software (INFINITT Healthcare, Seoul, Korea). Poor quality images and those lacking radiologist classifications were excluded. Annotations and personal information were omitted. Three image views (i.e., anteroposterior [AP] and bilateral oblique) were included for each patient. Images were classified into non-fracture and fracture groups based on dual radiological reporting. The fracture group images included those of radial, ulnar, and radio-ulnar fractures. The participation flowchart is presented in Fig. 1. Approval from the institutional review board of Hallym University Sacred Heart Hospital was obtained (2020-07-030), and participant informed consent was waived because of the retrospective nature of the study; moreover, this study adhered to the principles of the Declaration of Helsinki.
Dataset construction
We split the dataset into two subsets: training and testing. Radiographs taken between January 2018 and December 2019 and between January 2020 and May 2020 were included in the training and testing datasets, respectively. Furthermore, we allocated 10% of the training dataset, using patient identification numbers, to a tuning dataset for hyperparameter tuning. The three datasets were separated from each other. The number of radiographs included in the fracture group was approximately half that of the non-fracture group. This fact presented a limitation, which was mitigated by oversampling; the number of radiographs in the fracture group in the training dataset was doubled by zooming-in 10%. Otherwise, no image flipping or rotation was applied.
Data preprocessing
Images were preprocessed using a contrast-limited adaptive histogram equalization (CLAHE) algorithm to enhance local contrast [11]. The CLAHE algorithm resolves noise amplification problems in small image regions. These problems often occur when adaptive histogram equalization via contrast limitation is applied. Before calculating a cumulative distribution function, histogram values were clipped to a predefined value to limit noise amplification. Consequently, the processed images more clearly revealed fractures. The CLAHE algorithm was implemented using OpenCV ver. 4.1.2.30 in Python. All images were then reduced to 550×660 pixels in consideration of memory capacity, batch sizes, training times, and model performance.
CNN
In this study, we adopted two CNN architectures: DenseNet-161 and ResNet-152. DenseNet-161 consisted of dense blocks. For this test, all output feature maps were propagated to all deeper layers as input to the blocks. The architecture utilized all previous feature maps to classify target objects without adding layers [12]. ResNet-152 was designed with a residual block [13]. Thus, a skip connection, in which an input feature map is added to the output feature map of a deeper layer, enabled the CNN model to learn residual features and have deeper layers [13]. This type of CNN model won the ImageNet Large Scale Visual Recognition Challenge in 2015 in the field of image classification, detection, and localization [14]. The two CNN models used in this study were pretrained with the ImageNet dataset and were fine-tuned during training.
The Adam optimizer was used to train the two CNN models, with a beta-1 of 0.9 and a beta-2 of 0.999, using a binary crossentropy loss function. The initial learning rate was 1e-4, and a learning rate decay policy was used. Every 10 epochs, the learning rate was decreased by 90% until it reached 1e-7. The batch size was 4. The weight decay was 1e-4, and early stopping was used, with a starting point of 30 and a patience of 20, to avoid overfitting. Dropout was not applied to either CNN, and both models were implemented on the Pytorch deep-learning framework and trained on the NVIDIA GeForce Titan RTX graphics-processing unit (NVIDIA, Santa Clara, CA, USA).
Gradient-weighted class activation mapping (Grad-CAM) was used to present the region of the radiograph scan that contributed to the classification decision of the artificial intelligence (AI) model. Grad-CAM is a generalized version of class activation mapping [15]. Grad-CAM results were obtained using feature maps of the last layer in a CNN model generated from an input image and its gradient [16]. The gradient of feature maps was averaged using global average pooling, and the feature maps were multiplied using the averaged gradient along the channel side [16]. The final color map was achieved using an element-wise summation of the feature maps and a clipping of negative values to zero [16].
Statistical analysis
The normality of data distributions was evaluated using the Kolmogorov-Smirnov test to select the appropriate parametric and non-parametric statistical methods. Categorical variables were analyzed using the chi-square test. Continuous variables were expressed as mean±standard deviation and analyzed using the Student’s t-test. Performance of the CNN models was evaluated using the separate test dataset. AUROCs were then computed. The accuracy, sensitivity, specificity, positive predictive value, and negative predictive value were calculated, in terms of Youden index, using receiver operating characteristic (ROC) curves at the maximum point. Youden index, J, is calculated using the formula J=sensitivity+specificity-1; it is a criterion for finding the optimal threshold of ROC curves, regardless of prevalence [6,17]. We used DeLong’s test to compare the performances of the two models. Two-tailed tests were used for all comparisons, and group differences with P<0.05 were considered statistically significant. All statistical analyses were performed using IBM SPSS Statistics ver. 21.0 (IBM Corp., Armonk, NY, USA).
RESULTS
Dataset details
Patients’ demographic characteristics are summarized in Table 1. We included and analyzed the radiographs of 2,609 patients with wrist trauma admitted to the ER from January 2018 to May 2020. Among 898 patients with fractures, 22 (2.4%), 482 (53.7%), and 394 (43.9%) had ulnar fractures alone, radial fractures alone, and radio-ulnar fractures, respectively. The overall, non-fracture group, and fracture group mean ages were 42.1, 41.5, and 44.2 years old (P=0.004), respectively. There was no statistically significant difference in age groups between patients with and without fractures (P=0.936). However, there was a significant difference in sex ratio between patients in the non-fracture and fracture groups (53.8% vs. 45.9% men in the non-fracture vs. fracture groups, respectively; P<0.010). For model training, we used 4,551 radiographs from 798 patients with fractures and 4,443 radiographs from 1,481 patients without fractures. The remaining 10% of the whole dataset (300 radiographs from 100 patients with fractures and 690 radiographs from 230 patients without fractures) was used as a test dataset (Table 2).
Performance of DenseNet-161 and ResNet-152 in wrist fracture detection
The sensitivity, specificity, positive predictive value, negative predictive value, and accuracy of DenseNet-161 and ResNet-152 with the test dataset are shown in Table 3 (90.3%, 90.3%, 80.3%, 95.6%, and 90.3%, respectively vs. 88.6%, 88.4%, 76.9%, 94.7%, and 88.5%, respectively). The confusion matrix and ROC curve for the test dataset are shown in Figs. 2 and 3. The AUROCs of DenseNet-161 and ResNet-152 for wrist fracture detection were 0.962 and 0.947, respectively. DeLong’s test demonstrated that DenseNet-161 had a significantly different AUROC from ResNet-152 (P<0.050).
Fracture localization
The Grad-CAM algorithm of DenseNet-161 and ResNet-152 emphasized the most important area for image detection and classification (Fig. 4). The percentages indicated the probability of wrist fracture occurrence. The probabilities in Fig. 4A and 4B images of a pediatric patient were both 100%. The probabilities in Fig. 4C and 4D images of an adult patient were 100% and 99.4%, respectively.
Missed diagnosis
Fig. 5 presents false-negative and false-positive detection of wrist fractures. Findings determined to be false negatives were mainly undisplaced, minimally displaced, or ulnar styloid process fractures (Table 4). However, the deep learning models mainly classified old fractures or artifacts as new fractures, which led to falsepositive results.
DISCUSSION
The present study demonstrated that DenseNet-161 and ResNet-152 models could be trained to satisfactorily detect wrist fractures in the ER. Fractures are frequent issues in medical litigation, and misdiagnoses or delays can result in prolonged pain and long-term complications. Thus, the presence of a fracture should be judged quickly and carefully. Several studies have been conducted using deep learning in real clinical cases beyond obtaining simple data results. In Table 5, we summarize such study results for the detection of wrist fractures [6-10,18,19]. Olczak et al. [9] supported the use of AI to identify orthopedic radiographs, providing human-level accuracies of 83% and 84% for AI compared to 82% for clinicians. In clinical practice, during overnight shifts in small ERs when radiologists and orthopedic surgeons are absent, AI tools can be used for triage. Lindsey et al. [8] evaluated the utility of a trained model by measuring its effect on the diagnostic accuracy of a group of emergency medicine clinicians. They reported that the ability of clinicians to diagnose wrist fractures could be improved with the aid of a deep-learning model. The average clinician sensitivity and specificity were improved from 80.8% to 91.5% and from 87.5% to 93.9%, respectively. These findings suggest that AI can be used efficiently in clinical practice to aid in fracture diagnosis.
In the present study, we reported DenseNet-161 and ResNet-152 AUROCs of 0.962 and 0.947, respectively, which were similar or somewhat lower than those reported in previous studies (with AUROCs of 0.80–0.98) [6-8,10]. We suggest the following reasons for these results. First, results can be interpreted using only one dataset obtained from the confined ER environment. However, the prevalence of specific environment data use affects the actual learning performance, and therefore the process may be more suitable for use in real settings. Thian et al. [18] collected data from an ER and found an AUROC of 0.90, which was lower than those found in other studies. Second, we used a relatively small dataset, which had a greater impact compared to those of other studies. Lindsey et al. [8] used 31,490 wrist radiographs during model training. Similarly, Olczak et al. [9] obtained 65,264 wrist images. Third, fracture interpretation is typically radiologist-dependent, which can explain the differences in model performance across studies. Fourth, differences in learning methods could have affected the results. The DenseNet-161 and ResNet-152 used in the present study were trained only with images with and without fractures for classification. However, in recurrent CNN models [6], the fracture areas were indicated, and the models were trained directly for localization. Fifth, our data included radiographs of children. Different classifications are sometimes required for children because the degree of growth plate adhesion can vary, depending on bone age. The distal radial and ulnar epiphyses appear at ages of 1 and 5 years, respectively, and all tend to close between the ages of 16 and 18 years. Furthermore, fracture types vary in children. Sixth, we included radiographs with splints, which were easily interpretable, to increase their resemblance to real clinical data. During the initial learning phase, the model recognized fractures by first recognizing the splints. This method could produce false-positive results. Additionally, the splint acts as an artifact that can affect image interpretation.
Our study had the following strengths. First, we developed a model that is representative of the real-world clinical environment; it included all wrist radiographs from ER patients over a certain period, unlike previous studies [7,9], wherein AI was trained using heterogeneous datasets. Second, DenseNet comprised dense blocks with densely connected layers [12]. DenseNet showed an improvement in accuracy, without performance degradation or overfitting, as parameters increased. This model encourages feature reuse, and substantially reduces the number of parameters and the amount of computation required to achieve state-of-the-art performance. In previous studies, DenseNet was applied in the diagnosis of ankle and hip fractures [20,21]. This study was the first to apply the technique in wrist fracture diagnosis. Third, previous studies [6,18] were conducted using radiographs comprising AP and lateral views. In the current study, the radiograph data consisted of AP and bilateral oblique views. Thus, accuracy was improved over the assessment of two views.
This study had some limitations. First, the dataset was small because the study was conducted over a short period in a single research institute. To mitigate this limitation, we used data augmentation methods to increase the sample size. Subsequent studies having large sample datasets are required for external validation and more accurate model development. The second limitation resulted from DenseNet-161 and ResNet-152 characteristics. These models allowed algorithms to create good visualizations of fractures; moreover, they identified features previously missed by humans because they learned the most predictable features. In particular, it is possible that wrist angle or splint application at the time of radiography played a significant role in the judgment. Third, this study did not compare the ability of DenseNet-161 and ResNet-152 model with the ability of a clinician to diagnose wrist fractures.
In summary, this study demonstrated that DenseNet-161 and ResNet-152 models could be trained to detect wrist fractures in the ER with satisfactory performance.
Notes
No potential conflict of interest relevant to this article was reported.
Acknowledgements
This research was supported by the Bio & Medical Technology Development Program of the National Research Foundation (NRF) and funded by the Korean government (MSIT) (No. NRF-2019R1G1A1011227).
References
Article information Continued
Notes
Capsule Summary
What is already known
Recently, studies using various deep learning models as an assistive method for more accurate and efficient diagnosis of fractures have been actively conducted, and the models have shown satisfactory performance. Prior studies collected and analyzed heterogeneous image data. However, there are limitations in the possibility of their clinical use in the emergency room.
What is new in the current study
The study developed a model that is representative of emergency room settings with satisfactory performance to recognize fractures of the wrist.