par Boostani, Mehdi;Zouboulis, Christos C.C.;Pellacani, Giovanni;Navarrete-Dechent, Cristian;Boussingault, Lucas;Kiss, Tara;Goldfarb, Noah;Cantisani, Carmen;Nádudvari, Nóra;Bánvölgyi, András;Wikonkál, Norbert Miklós;Suppa, Mariano
;Paragh, Gyorgy;Kiss, Norbert
Référence JID Innovations, 6, 3, 100463
Publication Publié, 2026-05
;Paragh, Gyorgy;Kiss, NorbertRéférence JID Innovations, 6, 3, 100463
Publication Publié, 2026-05
Article révisé par les pairs
| Résumé : | Basal cell carcinoma (BCC) is the most common skin cancer. Off-the-shelf multimodal large language models are widely accessible, yet their performance for BCC remains unclear. The aim of this study was to assess BCC detection (BCC vs non-BCC) and BCC subtype classification from clinical and dermoscopic images using 3 web-based large language models (ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4). We evaluated 772 images: 402 from 290 histopathology-confirmed BCCs (290 clinical, 112 dermoscopic) and 370 from an independent BCC-mimicker cohort (250 clinical, 120 dermoscopic). Standardized prompts were used. Primary outcome was BCC detection accuracy; secondary outcomes were subtype-classification accuracy and performance by lesion features. For clinical images, ChatGPT-5 achieved the highest detection accuracy (75%), followed by Claude (64.3%) and Gemini (50.7%). For dermoscopy, Claude performed best (69.8%), compared with ChatGPT-5 (55.2%) and Gemini (50.9%). Accuracy was lower in crusted and flat lesions and higher in exophytic lesions; pigmentation effects were model dependent. Subtype-classification accuracy was modest across models. Images were primarily from European centers with limited skin-type diversity; several subgroups were small. Current web-based large language models are not clinically suitable for BCC detection or subtyping. Dermatology-specific training, transparent reporting, and rigorous prospective validation are required before any clinical use. |



