AbstractObjectiveWe previously developed and validated an artificial intelligence-based electrocardiogram (ECG) analysis tool (ECG Buddy) in a Korean population. This study investigated the performance of this tool in a US population, specifically assessing the left ventricular (LV) dysfunction score and LV ejection fraction (LVEF)-ECG feature for predicting LVEF <40%. The study used N-terminal pro-B-type natriuretic peptide (NT-ProBNP) as a comparator.
MethodsWe identified emergency department (ED) visits from the MIMIC-IV dataset with information on LVEF <40% or ≥40% and matched 12-lead ECG data recorded within 72 hours of the ED visit. The performance of ECG Buddy’s LV dysfunction score and the LVEF-ECG feature was compared with those of NT-ProBNP using area under the receiver operating characteristic curve (AUC) analysis.
ResultsA total of 22,599 ED visits were analyzed. The LV dysfunction score had an AUC of 0.905 (95% confidence interval [CI], 0.899–0.910), with a sensitivity of 85.4% and specificity of 80.8%. The LVEF-ECG feature had an AUC of 0.908 (95% CI, 0.902–0.913), sensitivity of 83.5%, and specificity of 83.0%. NT-ProBNP had an AUC of 0.740 (95% CI, 0.727–0.752), with a sensitivity of 74.8% and specificity of 62.0%. The ECG-based predictors demonstrated superior diagnostic performance compared to NT-ProBNP (all P<0.001). In the sinus rhythm subgroup, the LV dysfunction score achieved an AUC of 0.913 and LVEF-ECG had an AUC of 0.917, both outperforming NT-ProBNP (AUC, 0.748; 95% CI, 0.732–0.763; all P<0.001).
INTRODUCTIONAccurate assessment of left ventricular (LV) dysfunction is critical in the emergency department (ED), where timely and effective treatment decisions often depend on evaluation of cardiac function. In conditions such as respiratory distress or shock, early identification of LV dysfunction can guide appropriate interventions. Echocardiography is the standard method for evaluating LV systolic function. However, it requires skilled operators and devices. In addition, performing echocardiography on every patient that can benefit from LV function information during times of high patient volumes in the ED can be challenging.
In recent years, artificial intelligence (AI) has become a valuable tool for analyzing electrocardiograms (ECGs) to detect conditions like myocardial infarction, heart failure, and electrolyte imbalances [1–3]. As ECGs are more readily available in the ED, AI-based ECG analysis offers a scalable alternative to echocardiography.
ECG Buddy (ARPI Inc), an AI-driven ECG analysis tool, can assess various emergencies and cardiac function abnormalities using 12-lead ECG images. Studies have demonstrated that ECG Buddy outperforms clinical experts in diagnosing critical conditions such as ST-elevation myocardial infarction (STEMI), hyperkalemia, and right ventricular dysfunction [4–7]. Additionally, its evaluation of LV systolic function has been shown to outperform N-terminal pro-B-type natriuretic peptide (NT-ProBNP), a standard biomarker used in heart failure assessment, and to match the accuracy of point-of-care ultrasound (POCUS) performed by emergency physicians [8]. However, the previous external validation studies have primarily been conducted in a Korean population [4–11], leaving a gap in evidence regarding its performance in populations with different racial profiles, particularly non-East Asian groups. This limitation must be addressed in broader populations.
This study investigated the performance of ECG Buddy in a US population, particularly its ability to predict heart failure with reduced ejection fraction (HFrEF) as defined by LV ejection fraction (LVEF) <40% and to compare its diagnostic accuracy with that of NT-ProBNP.
METHODSEthics statementThe study used deidentified data from the Medical Information Mart for Intensive Care IV (MIMIC-IV) database [12,13], which is publicly available for research purposes and complies with the Health Insurance Portability and Accountability Act (HIPAA) standards for de-identification. No patient consent was required due to the retrospective nature of the study and the use of anonymized data.
Study designThis study was a retrospective analysis designed to validate the performance of ECG Buddy, an AI-based ECG analysis tool, in predicting reduced LV systolic dysfunction (LVEF <40%) in a US population. We chose 40% as the cutoff for defining LV dysfunction on the basis of established clinical guidelines for HFrEF [14,15].
Data were obtained from the MIMIC-IV dataset version 2.2 and its associated ECG dataset (MIMIC-IV-ECG ver. 1.0), which contains comprehensive de-identified health records and ECG waveform data from patients admitted to the EDs and critical care units of the Beth Israel Deaconess Medical Center (Boston, MA, USA) [12,13].
Study population and data collectionSelection of ED visits from the MIMIC-IV dataset is shown in Fig. 1. Briefly, we began by filtering 331,794 discharge notes using a regular expression search for echocardiography-related terms, reducing the pool to 97,929 records. Next, we used the GPT-4o mini model (OpenAI) to label each note based on three key requirements: (1) echocardiogram performed during admission; (2) heart-related surgeries or procedures; and (3) evidence of LVEF <40% (Supplementary Material 1). We excluded records without an echocardiogram or those with major cardiac surgeries/procedures, resulting in 33,780 records.
For patients with multiple mentions of echocardiographic examinations, we selected the first mention to capture LV function as close to the time of the initial ECG as possible. We then chose the single ECG for each visit that was chronologically nearest the time of hospital arrival, drawing on ECGs performed between 24 hours before and 72 hours after arrival, which yielded 26,042 potential cases.
At this stage, one board-certified emergency medicine specialist performed a targeted manual review of the discharge summaries to verify or correct the presence of LV dysfunction (LVEF <40%). Cases with ambiguous or missing LVEF data were excluded, reducing the dataset to 24,947 cases. Finally, we included only ED visits, resulting in a final dataset of 22,599 patient visits. Our enrollment criteria were as follows: (1) documentation of an echocardiogram (per GPT-4o mini screening); (2) no major cardiac surgeries or structural procedures during the admission; and (3) a definitive statement of LVEF (i.e., an explicit value or a clear description of LV dysfunction) in the discharge note. This multistep process enabled us to efficiently isolate ED visits with confirmed LV dysfunction status while minimizing the burden of manual chart review.
Extraction of ECG biomarkers using ECG BuddyECG Buddy is a deep learning-based AI platform that analyzes 12-lead ECG images, available via smartphones, desktop computers, or direct electronic health records integration. It classifies heart rhythms and generates 10 digital biomarkers, termed quantitative ECG scores, for a range of emergencies, cardiac dysfunctions, and hyperkalemia [4–7]. The underlying deep learning model was pretrained on various open ECG datasets (49,731 total recordings) using a self-supervised learning scheme and then fine-tuned on a clinical dataset of 47,194 annotated ECGs from more than 32,968 patients who visited Seoul National University Bundang Hospital ED (Seongnam, Korea) from 2017 to 2019 [7]. ECG Buddy has been approved as a class II medical device in Korea and is available in various app stores. The developers have recently added a new biomarker, LVEF-ECG, which directly estimates the patient’s LVEF. This feature is not yet available for public use, as it is currently under evaluation for the Korean Ministry of Food and Drug Safety approval. In this study, we used the research version of ECG Buddy for Windows, which allows batch analysis of large ECG datasets for both biomarkers. We extracted the LV dysfunction score (a risk score for LV systolic dysfunction, LVEF<40), ranging from 0 to 100, and LVEF-ECG, ranging from 20 to 65, to test their performance in identifying reduced LVEF (<40%) as determined from review of discharge notes.
Statistical analysisThe performance of the software for screening LV dysfunction was assessed using the area under the receiver operating characteristic curve (AUC). The initial NT-ProBNP levels within 72 hours for each of the ED visits were used as comparators. Sensitivity and specificity were determined by binning the predictors using the thresholds maximizing their Youden index. In addition to the primary analysis conducted on the entire cohort, a planned secondary subgroup analysis was performed for patients with noted sinus rhythms (normal sinus, sinus bradycardia, and sinus tachycardia). This approach allowed us to evaluate the software’s performance in the absence of arrhythmias (e.g., atrial fibrillation), which can confound ECG-based predictions. A P-value of <0.05 indicated statistical significance, and R ver. 4.1.0 (R Foundation for Statistical Computing) was used for data analysis.
RESULTSA total of 22,599 ED visits was analyzed. Patients were divided into two groups by LV dysfunction status. The baseline characteristics of the two groups are presented in Table 1 (missing data are shown in Supplementary Table 1). Patients with LV dysfunction were older (median age, 75 years vs. 72 years; P<0.001) and more likely to be male (64.5% vs. 46.0%, P<0.001). The racial composition was predominantly White in both groups, though the proportion of Black/African American patients was slightly higher in the LVEF <40% group (22.1% vs. 18.2%, P<0.001). Dyspnea was more common in the LV dysfunction group (33.6% vs. 25.5%, P<0.001).
The performance of the ECG biomarkers is shown in Fig. 2 and Table 2. In the overall population, the LV dysfunction score generated by ECG Buddy had an AUC of 0.905 (95% confidence interval [CI], 0.899–0.910). The sensitivity was 85.4% and specificity was 80.8%, with a threshold of 0.352 determined by the Youden index. The LVEF-ECG feature had an AUC of 0.908 (95% CI, 0.902–0.913), with a sensitivity of 83.5% and specificity of 83.0% using a threshold of 47.178. In comparison, NT-ProBNP had an AUC of 0.740 (95% CI, 0.727–0.752), with a sensitivity of 74.8% and specificity of 62.0%, at a threshold of 2,828.5.
In the sinus rhythm subgroup (n=16,527), the LV dysfunction score achieved an AUC of 0.913 (95% CI, 0.906–0.920), with a sensitivity of 83.1%, specificity of 85.5%, and threshold of 0.352. The LVEF-ECG feature had an AUC of 0.917 (95% CI, 0.910–0.924), with a sensitivity of 85.5% and specificity of 83.8%, at a threshold of 49.138. NT-ProBNP’s performance in the sinus rhythm subgroup showed an AUC of 0.748 (95% CI, 0.732–0.763), with a sensitivity of 79.7% and specificity of 59.2%, using a threshold of 1,796.5.
In both the overall population and the sinus rhythm subgroup, ECG Buddy demonstrated superior diagnostic performance compared with NT-ProBNP (all P<0.001) (Supplementary Tables 2, 3).
DISCUSSIONIn this study, we evaluated the diagnostic accuracy of two ECG-based AI biomarkers, the LV dysfunction score and LVEF-ECG, in comparison to NT-ProBNP for detecting left ventricular systolic dysfunction in ED patients. Both ECG-based biomarkers outperformed NT-ProBNP in identifying patients with LVEF <40%. This marks the first demonstration of ECG Buddy’s accuracy in a US population, highlighting the potential of its digital biomarkers to serve as effective diagnostic tools for LV dysfunction in emergency settings and offering a more accessible and rapid alternative to traditional methods like NT-ProBNP.
Previous studies have shown ECG Buddy’s ability to outperform clinical experts in screening conditions such as STEMI, hyperkalemia, and right ventricular dysfunction using 12-lead ECGs. Additionally, its performance to screen LV dysfunction has been shown to be on par with POCUS performed by emergency physicians [4–9]. However, these previous studies were primarily conducted in a Korean population. One of the strengths of this study is the analysis of a US population with a broader racial profile, increasing its applicability. This validation in a more diverse population provides new evidence of the generalizability of ECG Buddy across different racial and ethnic groups.
In the fast-paced environment of an ED, timely and accurate identification of LV dysfunction is critical for guiding treatment decisions. While echocardiography is the gold standard for assessing LVEF, it can be challenging to perform on all patients due to logistical constraints, especially during periods of high patient volume. As a result, NT-ProBNP is often used to screen for heart failure in EDs. However, NT-ProBNP is not particularly accurate or cost-effective. In contrast, ECG Buddy’s AI-driven approach offers a more accurate and scalable solution, particularly in emergency settings where ECGs are already used to evaluate multiple differential diagnoses, such as arrhythmias and myocardial ischemia. Additionally, its image-based analysis enables easy and rapid deployment through smartphones or PCs, simplifying implementation in real-world settings.
In this study, we leveraged a language model (GPT-4o mini) to perform large-scale analysis of discharge notes. With more than 97,929 discharge notes to review, manual inspection would have been extremely time consuming and resource intensive. Using GPT-4o mini, we efficiently filtered the dataset to approximately 26,000 cases, significantly reducing the workload and enabling a more manageable review process. While some errors were observed in determining LVEF <40%, necessitating manual chart reviews, the combination of automated filtering and selective manual review proved highly effective in handling a large clinical dataset.
This study had several limitations. First, this study was conducted retrospectively using archived ECG data rather than in a real-time ED setting. Therefore, further prospective validation is required to assess the performance of ECG Buddy in actual clinical workflows. Second, this study relied on discharge notes to determine LVEF, which introduces potential variability, as the timing of echocardiography relative to the ECG in the ED was unclear. Cardiac function may also change during hospitalization due to disease progression and treatments. Third, NT-ProBNP levels can be influenced by various clinical factors, such as renal function, age, and other comorbidities, which may partially explain the observed differences in performance between NT-ProBNP and ECG-based biomarkers. Further prospective studies are needed to better understand these confounding variables.
In summary, this study demonstrates that ECG Buddy’s AI-driven digital biomarkers, LV dysfunction score and LVEF-ECG, provided superior diagnostic accuracy compared with NT-ProBNP for identifying LV systolic dysfunction in a US ED population. As the first validation of ECG Buddy in a racially diverse US population, these findings confirm its utility as a reliable tool for detecting LV dysfunction across different clinical and demographic groups. To ensure reproducibility and generalizability, external validation studies are planned in diverse healthcare settings. Future research will explore the broader clinical applications of these biomarkers in real-world environments.
NOTESAuthor contributions
Conceptualization: HL, JEH, Y Choi; Data curation: HL, JEH, Y Choi; Formal analysis: HL, JEH, Y Choi; Funding acquisition: JK, Y Cho, JEH, Y Choi, WYK, KJS, YHJ; Investigation: HL, JEH, Y Choi; Methodology: HL, JEH, Y Choi, WYK, KJS, YHJ; Project administration: HL; Resources: HL, JK, Y Cho; Supervision: JEH, Y Choi, WYK, KJS, YHJ; Validation: JEH, Y Choi; Visualization: HL; Writing–original draft: HL; Writing–review & editing: all authors. All authors read and approved the final manuscript.
Conflicts of interest
Joonghee Kim developed the algorithm and founded the start-up company, ARPI Inc, where he serves as the chief executive officer. Youngjin Cho works for ARPI Inc as a research director. Haemin Lee works for ARPI Inc as a data scientist. Woon Yong Kwon and You Hwan Jo are editorial board members of this journal, but were not involved in the peer reviewer selection, evaluation, or decision process of this article. The authors have no other conflicts of interest to declare.
Funding
This study was supported by a Korea Health Technology R&D Project grant through the Korea Health Industry Development Institute (KHIDI), funded by the Korean Ministry of Health and Welfare (No. RS-2023-00265933), and by the Seoul National University Bundang Hospital Research Fund (No. 14-2022-0014).
Data availability
Data analyzed in this study, MIMIC-IV ver. 2.2 database, are available in PhysioNet (https://doi.org/10.13026/6mm1-ek67).
Supplementary materialsSupplementary materials are available from https://doi.org/10.15441/ceem.24.342.
Supplementary Material 1.GPT-4o mini (OpenAI) prompt design (accessed August 20, 2024).
Supplementary Table 1.Missing values for patient characteristics (n=22,599)
Supplementary Table 2.Performance summary in a subset of patients with both ECG and NT-ProBNP measurements (n=8,649)
Supplementary Table 3.Contingency tables for binary classification
REFERENCES1. Cho Y, Kwon JM, Kim KH, et al. Artificial intelligence algorithm for detecting myocardial infarction using six-lead electrocardiography. Sci Rep 2020; 10:20495.
2. Zhao Y, Xiong J, Hou Y, et al. Early detection of ST-segment elevated myocardial infarction by artificial intelligence with 12-lead electrocardiogram. Int J Cardiol 2020; 317:223-30.
3. Makimoto H, Hockmann M, Lin T, et al. Performance of a convolutional neural network derived from an ECG database in recognizing myocardial infarction. Sci Rep 2020; 10:8445.
4. Kim D, Hwang JE, Cho Y, et al. A retrospective clinical evaluation of an artificial intelligence screening method for early detection of STEMI in the emergency department. J Korean Med Sci 2022; 37:e81.
5. Kim D, Jeong J, Kim J, et al. Hyperkalemia detection in emergency departments using initial ECGs: a smartphone AI ECG analyzer vs. board-certified physicians. J Korean Med Sci 2023; 38:e322.
6. Choi YJ, Park MJ, Cho Y, et al. Screening for RV dysfunction using smartphone ECG analysis app: validation study with acute pulmonary embolism patients. J Clin Med 2024; 13:4792.
7. Choi YJ, Park MJ, Ko Y, et al. Artificial intelligence versus physicians on interpretation of printed ECG images: diagnostic performance of ST-elevation myocardial infarction on electrocardiography. Int J Cardiol 2022; 363:6-10.
8. Kim JH, Jung JY, Kim J, Cho Y, Lee E, Son D. Non-inferiority analysis of electrocardiography analysis application vs. point-of-care ultrasound for screening left ventricular dysfunction. Yonsei Med J 2025; 66:172-8.
9. Choi H, Cho Y, Kim J, et al. ECG-derived global longitudinal strain using artificial intelligence: a comparative study with transthoracic echocardiography. JACC 2024; 83(13_Supplement):2360.
10. Cho Y, Yoon M, Kim J, et al. Artificial intelligence-based electrocardiographic biomarker for outcome prediction in patients with acute heart failure: prospective cohort study. J Med Internet Res 2024; 26:e52139.
11. Park MJ, Choi YJ, Shim M, et al. Performance of ECG-derived digital biomarker for screening coronary occlusion in resuscitated out-of-hospital cardiac arrest patients: a comparative study between artificial intelligence and a group of experts. J Clin Med 2024; 13:1354.
12. Johnson A, Bulgarelli L, Pollard T, Horng S, Celi LA, Mark R. MIMIC-IV [Internet]. Version 2.2. PhysioNet; 2023 [cited DATE]. Available from: https://doi.org/10.13026/6mm1-ek67
13. Johnson AE, Bulgarelli L, Shen L, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 2023; 10:1.
Fig. 1.Data Selection process from the Medical Information Mart for Intensive Care IV (MIMIC-IV) dataset [12]. LV, left ventricular; ECG, electrocardiogram; LVEF, left ventricular ejection fraction; EM, emergency medicine; ED, emergency department. Fig. 2.Receiver operating characteristic curves for the digital biomarkers of ECG Buddy (ARPI Inc). (A) Whole rhythm population. (B) Sinus rhythm population. LVEF, left ventricular ejection fraction; ECG, electrocardiogram; AUC, area under the receiver operating characteristic curve; LV, left ventricular; NT-ProBNP, N-terminal pro-B-type natriuretic peptide. Table 1.Patient characteristics (n=22,599)
Values are presented as median (interquartile range) or number (%). LV, left ventricular; LVEF, left ventricular ejection fraction; bpm, beats per minute; NT-ProBNP, N-terminal pro-B-type natriuretic peptide; ECG, electrocardiogram. a)Assessed at triage. Chief complaints were missing in 7,861 (34.8%), including 6,868 (35.1%) without LV dysfunction and 993 (32.9%) with LV dysfunction. Percentages are calculated based on available data (without LV dysfunction, n=12,710; with LV dysfunction, n=2,028). Missing data are shown in Supplementary Table 1. Table 2.Performance summary
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||