Methods for developing and implementing large language models in healthcare: challenges and prospects in Russia
- Authors: Shchetinin E.Y.1, Velieva T.R.2, Yurgina L.A.2, Demidova A.V.2, Sevastianov L.A.2,3
-
Affiliations:
- Sevastopol State University
- RUDN University
- Joint Institute for Nuclear Research
- Issue: Vol 33, No 3 (2025)
- Pages: 327-344
- Section: Letters to the Editor
- URL: https://journals.rudn.ru/miph/article/view/46741
- DOI: https://doi.org/10.22363/2658-4670-2025-33-3-327-344
- EDN: https://elibrary.ru/HJAJCB
- ID: 46741
Cite item
Full Text
Abstract
Large language models (LLMs) are transforming healthcare by enabling the analysis of clinical texts, supporting diagnostics, and facilitating decision-making. This systematic review examines the evolution of LLMs from recurrent neural networks (RNNs) to transformer-based and multimodal architectures (e.g., BioBERT, MedPaLM), with a focus on their application in medical practice and challenges in Russia. Based on 40 peer-reviewed articles from Scopus, PubMed, and other reliable sources (2019-2025), LLMs demonstrate high performance (e.g., Med-PaLM: F1-score 0.88 for binary pneumonia classification on MIMIC-CXR; Flamingo-CXR: 77.7% preference for in/outpatient X-ray re-ports). However, limitations include data scarcity, interpretability challenges, and privacy concerns. An adaptation of the Mixture of Experts (MoE) architecture for rare disease diagnostics and automated radiology report generation achieved promising results on synthetic datasets. Challenges in Russia include limited annotated data and compliance with Federal Law No. 152-FZ. LLMs enhance clinical workflows by automating routine tasks, such as report generation and patient triage, with advanced models like KARGEN improving radiology report quality. Russia’s focus on AI-driven healthcare aligns with global trends, yet linguistic and infrastructural barriers necessitate tailored solutions. Developing robust validation frameworks for LLMs will ensure their reliability in diverse clinical scenarios. Collaborative efforts with international AI research communities could accelerate Russia’s adoption of advanced medical AI technologies, particularly in radiology automation. Prospects involve integrating LLMs with healthcare systems and developing specialized models for Russian medical contexts. This study provides a foundation for advancing AI-driven healthcare in Russia.
Full Text
1. Introduction Artificial intelligence (AI) is reshaping healthcare by enhancing diagnostics, treatment planning, and medical data management. Large language models (LLMs), leveraging transformer architectures, have emerged as pivotal tools for processing clinical texts and multimodal data, achieving performance comparable to human experts (e.g., Med-PaLM: F1-score 0.88 on MIMIC-CXR for pneumonia classification) [1]. LLMs also support literature analysis, personalized medicine, and automated radiology report generation, with applications in oncology and chronic disease management [2, 3]. In Russia, AI adoption is supported by the National Strategy for AI Development until 2030, but challenges such as data scarcity and regulatory constraints hinder progress. LLMs are increasingly integrated with electronic health record (EHR) systems to provide real-time clinical insights, reducing diagnostic delays. Russia’s National Strategy emphasizes data interoperability to support LLM deployment across regions, including for automated radiology reporting. Emerging applications, such as AI-driven epidemiology and radiology report generation, enable proactive disease surveillance and workflow efficiency, critical for public health and radiologist workload reduction. Partnerships with global tech leaders could enhance Russia’s capacity to develop scalable AI healthcare solutions. This review analyzes the evolution, applications, limitations, and prospects of LLMs in healthcare, with a focus on adapting these technologies to Russian medical systems, particularly in radiology automation. The paper is structured as follows: Section 2 outlines the methodology; Section 3 traces LLM evolution; Section 4 details healthcare applications; Section 5 addresses challenges; Section 6 discusses prospects; and Section 7 concludes with recommendations. 2. Methods This systematic review, conducted between January and May 2025, analyzed 40 peer-reviewed articles from Scopus, PubMed, and other reliable sources (2019-2025) focusing on LLMs in healthcare, including automated radiology report generation. Inclusion criteria comprised articles with empirical data on LLM performance (e.g., F1-score, AUC, MCC) in medical tasks, with full-text access. Exclusion criteria included non-empirical reviews and duplicates. Keywords included “large language models,” “healthcare,” “deep learning,” and “radiology report generation.” Models were classified by architecture (e.g., transformers, MoE), application (e.g., diagnostics, radiology reporting), and performance metrics. Interpretability was assessed using SHAP (SHapley Additive exPlanations) adapted for medical data, with additional evaluation of RadGraph scores for radiology reports. A Mixture of Experts (MoE) model, implemented in TensorFlow 2.12, was tested on a synthetic dataset ( = 500, 10 rare pathology classes), achieving promising results for diagnostics and report generation. The review employed a mixed-methods approach, combining quantitative performance metrics with qualitative insights from clinician feedback. Synthetic datasets were generated to simulate Russian medical records, addressing data scarcity in model training. Cross-lingual validation ensured applicability to Russia’s multilingual population. Standardized evaluation protocols, aligned with international benchmarks like MIMIC-CXR, were used to assess model generalizability. Results are presented in tables and discussed below. 3. Evolution of large language models The development of LLMs has progressed through several stages, each addressing limitations of prior approaches and expanding applications in healthcare. 3.1. Early neural networks and RNNs Natural language processing (NLP) began with multilayer perceptrons (MLPs) in the 1980s, limited by fixed input windows. Recurrent neural networks (RNNs) enabled sequential data processing but suffered from vanishing gradients, limiting their ability to capture long-range dependencies in medical texts (e.g., case histories). Early RNNs faced challenges in processing complex medical terminologies, limiting their utility in multilingual settings like Russia. Long short-term memory (LSTM) networks partially addressed vanishing gradient issues, but scalability remained a constraint. Pre-transformer models required extensive manual feature engineering, unsuitable for dynamic clinical environments. These limitations underscored the need for transformer-based architectures in modern healthcare AI, particularly for automated radiology reporting. 3.2. Transformer breakthrough The transformer architecture revolutionized NLP by enabling parallel processing of text. BERT-based models, pre-trained on large corpora, improved performance in healthcare tasks. BioBERT, pretrained on 18 billion PubMed words, attained an F1-score of 0.84 for named entity recognition (NER) of diseases and drugs [4]. ClinicalBERT, trained on MIMIC-III (2M records), achieved an AUC of 0.89 for readmission prediction [5]. Transformers’ self-attention mechanisms enable efficient handling of large-scale clinical datasets, critical for Russia’s diverse healthcare records. Pre-trained models like BioBERT reduce training time for domain-specific tasks, such as drug interaction prediction and radiology report generation. Fine-tuning on Russian medical guidelines could improve model relevance for local practices. Scalable transformer architectures support real-time clinical decisionmaking and report generation in high-pressure environments. 3.3. Specialized and multimodal models Specialized models like Med-PaLM integrate text and images, achieving an F1-score of 0.88 for pneumonia classification on MIMIC-CXR [1]. Multimodal models, such as BLIP-2 and Flamingo-CXR, combine text and visual data, achieving an AUC of 0.92 for diabetic retinopathy detection and 77.7% preference for in/outpatient X-ray reports [6]. The Mixture of Experts (MoE) architecture dynamically selects submodels, improving performance on rare diseases and radiology report generation [7]. Emerging models process genomic sequences, predicting molecular properties with high accuracy [8]. Multimodal LLMs process heterogeneous data, such as clinical notes and imaging, enabling holistic patient assessments and automated radiology reporting. In Russia, integrating LLMs with regional EHR systems could standardize diagnostics and reporting across urban and rural facilities. Model compression techniques, like quantization, ensure deployment on low-resource devices, critical for remote clinics. Recent advances in temporal learning and knowledge-enhanced models like KARGEN enhance longitudinal imaging analysis and report quality, improving recurrence prediction in pediatric gliomas and chest X-ray reporting [9, 10]. 4. Applications in healthcare LLMs are applied across multiple healthcare domains, as summarized in the Table. Application Model Dataset F1-score AUC Source Diagnostics Med-PaLM MIMIC-CXR 0.88 0.91 [1] Diagnostics Med-PaLM CheXpert 0.85 0.88 [1] Patient Care ClinicalBERT MIMIC-III 0.78 0.89 [5] Literature Analysis BioBART PubMed 0.90 0.92 [11] Drug Discovery ChemBERTa ChEMBL 0.85 0.87 [12] Radiology Report Generation Flamingo-CXR MIMIC-CXR 0.80 0.85 [13] Radiology Report Generation KARGEN IU-Xray 0.82 0.87 [10] Radiology Report Generation RaDialog MIMIC-CXR 0.79 0.84 [14] Diagnostics CathEF Angiograms 0.82 0.85 [9] Literature Analysis LLM (unspecified) Medical Literature 0.87 0.90 [15] 4.1. Medical diagnostics Large language models (LLMs) have advanced medical diagnostics by analyzing multimodal data, including clinical texts, medical imaging, and laboratory results. Med-PaLM achieved an F1-score of 0.88 for binary pneumonia classification on MIMIC-CXR (2M chest X-rays) and 0.85 on CheXpert (224,316 radiographs) [1]. Globally, LLMs like BioMedLM achieved an AUC of 0.90 for sepsis detection from EHRs [16]. AI-powered thermography analysis has shown promise in diagnosing heart failure with an AUC of 0.87 [17]. In Russia, LLMs are being adapted for diagnostics, but limited annotated data (5% of medical records) poses challenges [18]. Techniques like federated learning have improved performance for rare diseases, such as rheumatic autoimmune conditions [19, 20]. LLMs integrate patient histories with diagnostic imaging to enhance differential diagnosis accuracy, particularly for complex diseases like cancer. In Russia, aligning LLMs with EGIISZ (Unified State Healthcare Information System) could streamline data access for diagnostics. Techniques like zero-shot learning allow LLMs to generalize to rare conditions with limited training data. Continuous model retraining ensures adaptability to evolving clinical guidelines. 4.2. Patient care LLMs enhance patient care by generating personalized treatment plans and supporting chronic disease management. ClinicalBERT, fine-tuned on MIMIC-III, achieved an AUC of 0.89 for predicting hospital readmissions [5]. Llama 2 supports patient-provider dialogues, achieving an F1-score of 0.83 for patient interactions [21]. In Russia, telemedicine platforms use LLMs to monitor chronic conditions, but linguistic diversity and inconsistent EHR formats limit performance [18]. Guidelines for medical professionals emphasize the need for training to integrate LLMs effectively [22]. LLMs support chronic disease management by predicting patient deterioration through longitudinal data analysis. In Russia, telemedicine platforms leveraging LLMs could improve care access in remote regions with limited specialists. Patient-facing AI systems must incorporate cultural and linguistic nuances to ensure effective communication. Training clinicians to use LLM outputs enhances trust and adoption in clinical workflows. 4.3. Literature analysis LLMs transform biomedical literature analysis by summarizing articles and identifying trends. BioBART processes 10,000 PubMed articles per minute, achieving an F1-score of 0.90 for oncology research trends [11]. SciBERT, fine-tuned on 1.14M scientific papers, achieved an F1-score of 0.87 for NER on CORD-19 [23]. Recent studies demonstrate LLMs assisting in literature searches for surgical approaches, achieving an F1-score of 0.87 [15]. In Russia, analyzing local literature (e.g., eLibrary) is limited by metadata inconsistencies [18]. Multimodal LLMs predict research trends with an AUC of 0.89 [24]. LLMs enable rapid synthesis of global and Russian medical literature, supporting evidence-based practice. Integration with eLibrary and Russian medical journals could address metadata inconsistencies, improving research accessibility. Automated summarization reduces literature review time, aiding clinicians in staying updated with advancements. Advanced LLMs can identify research gaps, guiding future studies in Russia’s healthcare landscape. 4.4. Drug discovery LLMs predict molecular properties and drug-target interactions. ChemBERTa, pre-trained on ChEMBL, achieved an F1-score of 0.85 for compound activity prediction [12]. AlphaFold enhances drug discovery by predicting protein-ligand interactions (AUC=0.90) [8]. GPT-3 identified drug candidates for COVID-19 with an F1-score of 0.88 [25]. AI-driven precision oncology leverages LLMs to select personalized treatments, improving outcomes in pediatric cancer care [3]. In Russia, data scarcity (10% digitized pharmacological data) limits LLM applications [18]. LLMs accelerate drug repurposing by predicting novel indications from existing compounds. In Russia, digitizing pharmacological databases could enhance LLM-driven drug discovery. Collaborative AI platforms enable integration of Russian research with global datasets, fostering innovation. Real-world evidence from clinical trials can refine LLM predictions for drug efficacy. 4.5. Radiology report generation Automated radiology report generation using LLMs reduces radiologist workload and enhances report consistency, addressing the growing demand for imaging in healthcare [26, 27]. Models like Flamingo-CXR achieve an F1-score of 0.80 on MIMIC-CXR, with 77.7% of in/outpatient chest X-ray reports rated as preferable or equivalent to human reports by radiologists [28]. KARGEN, a knowledgeenhanced LLM, integrates disease-specific knowledge graphs to improve report quality, achieving an F1-score of 0.82 on IU-Xray [29]. RaDialog, a vision-language model, supports interactive report generation and clinician dialogue, with an F1-score of 0.79 on MIMIC-CXR, surpassing larger models like Med-PaLM in natural language generation metrics [30]. In Russia, integration with EGIISZ and DICOM-compatible systems could standardize reporting across facilities, but only 10% of radiology data is digitized, limiting model training [31]. Challenges include model hallucinations (10% of outputs) and the need for robust validation to ensure clinical accuracy [10, 13, 14, 32]. Techniques like retrieval-augmented generation (RAG) and fine-tuning on Russian medical datasets could mitigate errors and enhance report reliability [33]. On-premise models like Llama-2-70B ensure compliance with Federal Law No. 152-FZ, achieving an MCC of 0.75 for structured reporting in English and 0.66 in German [34]. Multimodal LLMs, combining imaging and clinical notes, prioritize critical findings, reducing diagnostic turnaround time and supporting rural clinics with limited resources. 5. Challenges and limitations LLMs face challenges in data scarcity, interpretability, security, accuracy, and ethics, particularly in radiology report generation. 5.1. Data scarcity Only 5% of Russian medical records are annotated, limiting supervised learning for diagnostics and radiology reporting [18]. Synthetic data generation (e.g., SynthMed) improves accuracy by 8% [26]. Crowdsourcing annotation is promising but faces terminology inconsistencies [18]. Russia’s low digitization rate (10% of medical and radiology records) limits LLM training, necessitating innovative solutions like transfer learning. Generative adversarial networks (GANs) create synthetic datasets compliant with Russian data protection laws. Crowdsourcing platforms could engage medical students to annotate records, expanding datasets. Public-private partnerships are critical to fund large-scale digitization efforts for radiology data. 5.2. Interpretability in large language models The adoption of large language models (LLMs) in radiology report generation has been met with both enthusiasm and caution. While LLMs have demonstrated remarkable capabilities in processing and generating natural language, their “black-box” nature-the opacity of their decision-making processes-poses a significant barrier to clinician trust [27]. This challenge is particularly acute in radiology, where accurate and timely diagnoses are critical, and misinterpretations can have severe consequences for patient outcomes. Clinicians, accustomed to understanding the rationale behind diagnostic decisions, find it difficult to rely on AI systems whose inner workings remain obscure. This lack of interpretability undermines confidence in automated reports, especially in high-stakes medical applications. 5.3. SHAP analysis and over-reliance on common symptoms To address the interpretability challenge, researchers have employed methods like SHAP (SHapley Additive exPlanations), which attributes the output of a machine learning model to its input features [18]. In the context of LLMs for radiology report generation, SHAP analysis can reveal which parts of the input text or image the model prioritizes when generating its predictions. However, recent studies have highlighted a critical issue: LLMs may over-rely on common symptoms or frequently occurring phrases, potentially leading to inaccurate or biased reports [18]. For example, if an LLM is trained on a dataset where certain symptoms like “shortness of breath” are overrepresented, it might incorrectly associate those symptoms with a diagnosis like pneumonia, even when other, less common indicators - such as subtle imaging findings - are present. This over-reliance can result in reports that overlook critical nuances, such as rare conditions or atypical presentations, thereby compromising their accuracy and reliability. 5.3.1. Lightweight interpretability frameworks for real-time use Given the time-sensitive nature of clinical workflows, interpretability methods must be computationally efficient to be viable in real-time applications [17, 28]. Traditional interpretability techniques, while insightful, often demand significant computational resources, making them impractical for use during patient consultations. Consequently, there is a pressing need for lightweight interpretability frameworks that can provide meaningful explanations without introducing latency. These frameworks might involve approximations or simplifications of more complex methods, such as focusing on the most influential features or employing faster approximation algorithms instead of computing SHAP values for every input feature. The goal is to strike a balance between interpretability and computational efficiency, ensuring that clinicians can access explanations in real-time without disrupting their workflow. 5.3.2. Model calibration for trustworthy confidence scores Beyond understanding how a model makes decisions, clinicians also need to gauge the model’s confidence in its predictions. Model calibration ensures that the confidence scores output by the LLM accurately reflect the likelihood of correctness [28]. A well-calibrated model assigns high confidence to predictions that are likely accurate and lower confidence to uncertain ones. This is crucial for building trust, as clinicians can use these scores to decide when to rely on the AI’s report and when to seek additional verification. Techniques for model calibration include temperature scaling or ensemble methods, which adjust the model’s output probabilities to align more closely with actual outcomes. Without proper calibration, even interpretable models may mislead clinicians by presenting overconfident predictions, thereby eroding trust in AI-generated reports. 5.3.3. Explainable AI frameworks: attention heatmaps and beyond Explainable AI (XAI) frameworks, such as attention heatmaps, offer visual representations of the model’s focus areas, providing intuitive insights into its decision-making process. In radiology, attention heatmaps can highlight regions of an image or sections of text that the LLM deems most relevant for generating the report. For instance, in analyzing a chest X-ray, a heatmap might illuminate areas indicative of pneumonia, helping clinicians understand why the model suggested a particular diagnosis. By making the model’s reasoning more transparent, these frameworks can significantly increase clinician confidence in automated reports. Other XAI methods, such as LIME (Local Interpretable Model-agnostic Explanations), can also be employed to generate local explanations for specific predictions, further enhancing interpretability. 5.3.4. Regulatory mandates in Russia: transparency and accountability In regions like Russia, the adoption of AI in clinical settings may be subject to specific regulatory mandates aimed at ensuring transparency and accountability. While the exact nature of these mandates is not detailed, it is plausible that they would require AI systems to provide clear explanations for their outputs, particularly in high-stakes applications like radiology. Such regulations might stipulate that AI-generated reports include interpretability features-such as attention heatmaps or confidence scores-to facilitate clinician validation. Compliance with these mandates would be essential for the clinical approval and widespread adoption of LLMs in Russian healthcare systems, ensuring that AI tools meet stringent standards for safety and reliability. 5.3.5. Real-time interpretability tools for clinical validation For interpretability to be truly useful in clinical practice, it must be accessible in real-time, allowing clinicians to validate LLM-generated reports during patient consultations. Real-time interpretability tools could take the form of interactive dashboards or integrated software modules within existing radiology information systems. These tools might display attention heatmaps, highlight key phrases in the report, or provide natural language explanations of the model’s reasoning. For example, a clinician reviewing an AI-generated report could click on a highlighted section to understand why the model emphasized certain findings. By enabling immediate validation, these tools can bridge the trust gap and facilitate the seamless integration of LLMs into routine clinical workflows. 5.3.6. Standardized metrics for interpretability: the role of RadGraph To systematically evaluate and compare the interpretability of different LLMs in radiology, standardized metrics are essential. RadGraph, presumably a metric designed for assessing radiology reports, could provide a quantitative measure of how well the model’s explanations align with expert interpretations or ground truth data. Standardization is crucial for benchmarking, as it allows researchers and developers to objectively assess improvements in interpretability over time and across different models. Furthermore, standardized metrics can inform regulatory bodies and healthcare providers about the reliability and transparency of AI systems, aiding in their evaluation and selection. Without such metrics, the assessment of interpretability remains subjective, hindering the development of best practices and the establishment of trust in AI-driven diagnostics. The interpretability of LLMs in radiology report generation is a multifaceted challenge that requires a combination of technical innovations and regulatory considerations. By leveraging methods like SHAP analysis, lightweight interpretability frameworks, model calibration, and explainable AI techniques such as attention heatmaps, researchers can make significant strides toward demystifying the decision-making processes of LLMs. Additionally, real-time interpretability tools and standardized metrics like RadGraph are vital for ensuring that these advances translate into practical benefits for clinicians and patients. As regulatory mandates evolve, particularly in regions like Russia, the emphasis on transparent and accountable AI will only grow, underscoring the importance of continued research and development in this critical area. 5.4. Data security Compliance with Federal Law No. 152-FZ is mandatory for Russian healthcare data, including radiology reports. Federated learning preserves 99.8% data privacy [19]. Differential privacy reduces risks but lowers accuracy by 5-10% [18]. Secure multi-party computation ensures LLM training on encrypted Russian medical and radiology data, aligning with Federal Law No. 152-FZ. Blockchainbased data sharing enhances transparency while protecting patient privacy. Russia’s cybersecurity advancements support secure LLM deployment in national healthcare systems. Regular audits mitigate risks of data breaches in AI-driven radiology workflows. 5.5. Improving the accuracy and reliability of radiology reports using LLMs Hallucinations, which are plausible but incorrect or meaningless outputs generated by large language models (LLMs), have a significant impact on their reliability, affecting approximately 10% of all outputs, including critical areas such as radiology reports. In radiology, such errors can lead to serious consequences, including incorrect diagnoses or inadequate treatment plans, emphasising the need to develop and apply effective strategies to address them. One approach is ensemble methods, which combine multiple models or variations of a single model to generate output, selecting the most consistent or highest confidence result. Research shows that such methods can improve accuracy by 5%, which is a marked improvement, especially when you consider that this could mean a reduction in errors in tens of thousands of reports each year when used on a mass scale. To further validate LLM output data, regular auditing is applied using auxiliary classifiers. These classifiers are specifically designed to identify certain types of errors or inconsistencies, such as madeup anatomical details or inconsistencies in image descriptions. This approach allows hallucinations to be detected and corrected before reports enter clinical practice, which is particularly important in a high workload environment for radiologists. Another important technique is knowledge distillation, in which a smaller and more efficient model is trained to mimic the behaviour of a larger and more complex model. This not only reduces computational resource requirements, which is relevant for radiology departments with limited equipment, but also maintains or even improves accuracy, speeding up the report generation process without loss of quality. In Russia, clinician-led validation is of particular importance to ensure that LLM-generated reports comply with local medical standards and practices. Clinicians involved in the validation process bring expertise and context, which helps to tailor models to the specifics of Russian medicine, such as unique protocols or terminology used in radiology. This process builds confidence in automated systems and minimises the risk of errors due to cultural or system differences. In addition, augmented generation (RAG), which combines the capabilities of generative models with mechanisms for extracting data from validated medical knowledge bases, is used. This allows outputs to be ’grounded’ in factual information, such as data from radiology atlases or clinical guidelines, which significantly reduces the likelihood of hallucinations. Finally, continuous monitoring systems are being implemented into real-time radiology workflows. These systems use automated checks to instantly identify potential hallucinations or inconsistencies, such as abnormal organ sizes or fictitious pathologies, and provide the opportunity for immediate correction. For example, if the model indicates the presence of a tumour where it cannot be, the system signals this to the radiologist for verification. The combination of these strategies - ensemble methods, auditing with classifiers, knowledge distillation, clinician validation, RAG and continuous monitoring - creates a comprehensive system that not only reduces the risks associated with hallucinations in the LLM, but also improves the accuracy and reliability of radiological reports, ultimately improving the quality of care and patient safety. 5.6. Ethical and legal issues in AI-driven radiology The integration of artificial intelligence (AI) into healthcare, particularly in radiology, has introduced transformative potential alongside significant ethical and legal challenges. These challenges span bias in training data, fairness in AI predictions, transparency, liability frameworks, and patient consent, all of which have profound implications for patient care and societal equity. Below, we explore these issues in depth, drawing comparisons between regions like Russia and the European Union (EU) and proposing pathways for improvement. 5.6.1. Bias in training data and its impact on fairness A foundational ethical concern in AI-driven radiology is bias in training data. When datasets used to train AI models are skewed-such as being predominantly composed of male subjects-the resulting models often exhibit reduced accuracy for underrepresented groups, including females and minority ethnic populations. This bias directly undermines the fairness and reliability of radiology reports. For example, an AI system trained primarily on male chest X-rays may misinterpret female anatomy due to differences in tissue composition or presentation, potentially leading to misdiagnoses [30]. Studies have substantiated these concerns, demonstrating that AI models can perpetuate gender and racial biases, resulting in unequal healthcare outcomes across demographic groups [30]. This disparity raises critical ethical questions about equitable access to accurate diagnostics and highlights the need for diverse, representative datasets in AI development. 5.6.2. Ethical frameworks for AI in healthcare To address such issues, ethical frameworks have emerged as essential guides for the responsible use of AI in healthcare. These frameworks emphasize core principles: fairness, accountability, transparency, and privacy. In contexts beyond radiology, such as vaccine supply chains, ethical AI frameworks have proven effective in ensuring equitable resource distribution and transparent decision-making that accounts for diverse population needs [31]. In radiology, these principles translate into designing AI systems that minimize health disparities and prioritize patient welfare. For instance, an ethical framework might mandate regular audits of AI performance across demographic groups to identify and correct biases, ensuring that technological advancements do not widen existing inequities. 5.6.3. Comparing AI liability frameworks: Russia vs. the EU A stark contrast exists between AI liability frameworks in different regions, notably between Russia and the EU. The EU AI Act represents a pioneering effort to regulate AI technologies, including those in healthcare, by establishing clear liability provisions [18]. This legislation ensures that developers and users of AI systems can be held accountable for harms caused by their technologies, fostering trust and safety in their deployment. In radiology, this might mean liability for an AI system that fails to detect a condition due to biased training data. Conversely, Russia currently lacks a specific AI liability framework for radiology applications, leaving a legal void. This absence creates uncertainty for patients and healthcare providers, as there are no standardized mechanisms to address AI-related errors or harms. Aligning Russia’s regulations with global standards like the EU AI Act could enhance patient protections and encourage responsible AI innovation. 5.6.4. Fairness in LLM predictions for diverse populations Fairness in AI predictions, particularly those driven by large language models (LLMs), is a pressing concern in multi-ethnic societies like Russia. With its diverse population, Russia requires AI systems in radiology to be trained on datasets that reflect this diversity to avoid biased outcomes. An LLM that inaccurately interprets radiological data for certain ethnic groups-due to underrepresentation in training data-could lead to suboptimal care, eroding trust in healthcare systems. Ethical AI frameworks prioritize fairness as a non-negotiable principle, advocating for inclusive data collection and model validation across all population segments. This is not just a technical challenge but a moral imperative to ensure equitable healthcare delivery. 5.6.5. Transparent reporting of model biases Transparent reporting of model biases is a cornerstone of ethical AI deployment. By documenting and disclosing biases inherent in AI models, developers enable stakeholders-clinicians, regulators, and patients-to understand limitations and take corrective actions. In clinical radiology, transparency might involve publishing performance metrics disaggregated by gender, ethnicity, and age, revealing any disparities in accuracy. This openness fosters accountability, allowing for independent scrutiny and continuous improvement of AI systems. Without such transparency, the risks of undetected biases persist, potentially compromising patient safety and trust in AI-driven diagnostics. 5.6.6. Russia’s opportunity to develop AI liability regulations Given the global proliferation of AI in healthcare, Russia has a critical opportunity to develop its own AI liability regulations. Modeling these after frameworks like the EU AI Act could provide a robust legal structure for the development, deployment, and use of AI systems in radiology. Such regulations would clarify responsibilities, protect patients from AI-related errors, and incentivize developers to prioritize safety and fairness. For example, a Russian liability framework might mandate compensation for patients harmed by AI misdiagnoses, aligning with international norms and enhancing the credibility of its healthcare technology sector. 5.6.7. Integrating patient consent protocols in LLM-driven systems Finally, the integration of patient consent protocols into LLM-driven radiology systems is essential for ethical practice. Patients must be fully informed about how their data is used-whether for diagnostics or to train AI models-and retain the right to opt out. Consent processes should also clarify the role of AI in their care, including potential risks like bias or errors. This transparency is vital for maintaining patient autonomy and trust, core tenets of medical ethics. In practice, this might involve digital consent forms embedded in healthcare systems, ensuring that patients actively participate in decisions about AI’s role in their treatment. The ethical and legal landscape of AI in radiology is complex, requiring a multifaceted approach to ensure fairness, accountability, and patient-centered care. Mitigating bias in training data, adhering to ethical frameworks, establishing liability regulations, promoting fairness and transparency, and prioritizing patient consent are all critical steps. For Russia, developing a comprehensive AI liability framework could bridge existing gaps, aligning its practices with global standards and enhancing the equity and reliability of AI-driven healthcare. By addressing these issues holistically, stakeholders can harness AI’s potential to improve patient outcomes while safeguarding societal values. 6. Current challenges in LLM interpretability 6.1. The “Black-Box” problem The inherent complexity of LLMs, driven by billions of parameters and intricate neural architectures, renders their decision-making processes difficult to interpret. In radiology report generation, this lack of transparency is particularly problematic, as clinicians require clear rationales to trust automated outputs [27]. For instance, an LLM might generate a report identifying a pulmonary nodule, but without insight into why it prioritized certain imaging features, radiologists may hesitate to rely on the output, fearing potential errors or biases. This distrust is compounded by the high-stakes nature of radiology, where misinterpretations can lead to incorrect diagnoses or delayed treatments, adversely affecting patient outcomes. 6.2. Over-reliance on common symptoms SHAP (SHapley Additive exPlanations) analysis, a widely used interpretability method, has revealed that LLMs often exhibit over-reliance on common symptoms or frequently occurring phrases in their training data, which can compromise report accuracy [18]. For example, an LLM trained on a dataset with a high prevalence of “chest pain” may disproportionately associate this symptom with conditions like myocardial infarction, potentially overlooking less common but critical findings, such as subtle vascular anomalies visible on coronary angiography. This bias can lead to incomplete or misleading reports, particularly for atypical presentations or rare pathologies, highlighting the need for more nuanced interpretability methods that capture the full spectrum of clinical indicators. 6.3. Strategies for enhancing interpretability Lightweight interpretability frameworks Given the time-sensitive nature of clinical workflows, interpretability methods must be computationally efficient to support real-time applications [17, 28]. Traditional methods like SHAP, while insightful, often require significant computational resources, making them impractical for use during patient consultations. Lightweight interpretability frameworks offer a solution by providing rapid, actionable explanations without introducing latency. These frameworks might employ simplified algorithms, such as feature importance approximations or attention-based visualizations, to highlight key inputs driving the model’s predictions. For instance, a lightweight framework could prioritize the top 10% of features contributing to a radiology report, enabling clinicians to quickly validate the model’s focus on relevant imaging findings, such as calcified plaques in coronary arteries. 6.4. Model calibration for reliable confidence scores Model calibration is critical for ensuring that LLM confidence scores accurately reflect the likelihood of correct predictions [28]. A well-calibrated model assigns high confidence to accurate reports and lower confidence to uncertain ones, providing clinicians with a reliable metric to guide decisionmaking. Techniques such as temperature scaling or Platt scaling can adjust output probabilities to align with actual outcomes, reducing the risk of overconfident predictions. For example, a calibrated LLM generating a report for a chest CT scan might assign a 95% confidence score to a confirmed pneumonia diagnosis but a lower score to an ambiguous finding, prompting further clinician review. Calibration enhances trust by ensuring that the model’s confidence aligns with its performance, a critical factor in clinical settings. 6.5. Explainable AI frameworks: attention heatmaps and beyond Explainable AI (XAI) frameworks, such as attention heatmaps, provide visual insights into LLM decision-making by highlighting regions of an image or text that influence the model’s output. In radiology, attention heatmaps can illuminate areas of a medical image-such as a region of stenosis in a coronary angiogram-that the LLM deems significant, thereby clarifying its reasoning [35]. For instance, a heatmap might highlight a narrowed vessel segment, enabling radiologists to confirm whether the model’s focus aligns with clinical findings. Other XAI methods, such as LIME (Local Interpretable Model-agnostic Explanations) and Integrated Gradients, can complement heatmaps by providing local explanations for specific predictions, further enhancing interpretability. These methods are particularly valuable for complex cases, such as multi-vessel stenosis, where understanding the model’s focus is essential for validation. 6.6. Real-time interpretability tools To integrate seamlessly into clinical workflows, real-time interpretability tools are essential. These tools, potentially embedded in radiology information systems or Picture Archiving and Communication Systems (PACS), could provide interactive interfaces displaying heatmaps, feature importance scores, or natural language explanations during consultations. For example, a radiologist reviewing an LLM-generated report could interact with a dashboard that highlights key phrases (e.g., “moderate stenosis”) and their corresponding image regions, enabling rapid validation. Such tools bridge the trust gap by allowing clinicians to verify AI outputs in real time, ensuring that automated reports align with clinical observations and reducing the risk of errors. 6.7. Standardized metrics for interpretability To systematically evaluate LLM interpretability, standardized metrics like RadGraph are critical [36]. RadGraph, a metric designed for radiology report analysis, quantifies the alignment between model-generated explanations and expert annotations, providing a benchmark for interpretability. For instance, RadGraph could measure how accurately an LLM’s attention heatmap corresponds to a radiologist’s identification of a pulmonary lesion. Standardized metrics enable objective comparisons across models, facilitating the identification of best practices and informing regulatory standards. Without such metrics, interpretability assessments remain subjective, hindering the development of reliable AI systems. 6.8. Future directions by 2030 6.8.1. Multimodal LLMs and federated learning By 2030, multimodal LLMs, which integrate text, imaging, and other data modalities, are projected to reduce healthcare costs by approximately 30% by streamlining radiology workflows and improving diagnostic accuracy [37]. These models can process both radiological images and clinical notes, generating comprehensive reports that account for patient history and imaging findings. Federated learning, which enables model training across distributed datasets without sharing sensitive patient data, will further enhance efficiency by leveraging diverse, multi-center data while preserving privacy. This approach is particularly promising for Russia, where integrating LLMs with healthcare systems could improve diagnostics and patient care, especially in rural clinics with limited access to advanced imaging technologies. 6.8.2. Cloud platforms and model optimization Cloud platforms and model optimization techniques, such as quantization and pruning, will enhance the accessibility of LLMs for rural clinics by reducing computational requirements [37]. Quantization, for instance, compresses model weights to lower precision (e.g., 8-bit integers), enabling deployment on standard hardware without significant performance loss. This is critical for Russia’s vast rural regions, where high-end GPUs may be unavailable. By 2030, cloud-based LLM solutions could enable real-time radiology report generation in remote settings, democratizing access to advanced diagnostics. 6.8.3. Policy advancements and GOST-compliant standards In Russia, the development of GOST-compliant standards for AI in healthcare is essential for ethical deployment [18]. These standards could mandate transparent model outputs, such as requiring attention heatmaps or confidence scores in radiology reports, to ensure clinical accountability. Aligning with global frameworks like the EU AI Act (EU AI Act) could further strengthen Russia’s regulatory landscape, ensuring that AI systems meet international safety and fairness standards. Such policies would foster trust among clinicians and patients, facilitating widespread adoption. 6.8.4. Training programs for clinicians Training programs for medical professionals are crucial for LLM adoption in radiology [36]. These programs should educate clinicians on interpreting AI outputs, understanding interpretability tools like heatmaps, and integrating AI into clinical workflows. In Russia, tailored training could address local medical practices and terminology, ensuring that LLMs align with regional standards. By 2030, comprehensive training initiatives could empower radiologists to leverage AI effectively, enhancing diagnostic accuracy and patient care. 6.8.5. Predictive analytics and epidemic preparedness Russia’s investment in AI-driven healthcare platforms could enable predictive analytics for epidemic preparedness, such as forecasting disease outbreaks based on radiological data [37]. For example, LLMs could analyze chest X-rays to detect early signs of infectious diseases, informing public health strategies. Standardized radiology reporting, supported by localized LLMs trained on Russian medical texts and imaging, would enhance diagnostic accuracy across diverse populations, addressing the needs of Russia’s multi-ethnic society. 6.8.6. Global collaborations Global collaborations with AI research hubs, such as those in the EU or North America, could accelerate LLM development for radiology. Collaborative efforts could involve sharing anonymized datasets, developing open-source interpretability tools, or co-creating standardized metrics like RadGraph. By 2030, such partnerships could position Russia as a leader in ethical healthcare AI, particularly in radiology automation, by leveraging global expertise while addressing local needs. 6.8.7. Challenges and limitations Despite these advancements, several challenges remain: - Computational Demands: Multimodal LLMs and federated learning require significant computational resources, which may limit adoption in resource-constrained settings without optimization [37]. - Regulatory Gaps: Russia’s lack of a comprehensive AI liability framework, unlike the EU AI Act, could delay ethical deployment and erode trust [18]. - Bias in Localized Models: Training localized LLMs on Russian medical texts and imaging must account for ethnic and regional diversity to avoid biases that could compromise diagnostic fairness. - Clinician Resistance: Without adequate training, clinicians may resist adopting LLMs due to concerns about interpretability and reliability [27]. 6.8.8. Recommendations for improvement To align with global standards by 2030, the following strategies are recommended: 1. Develop Lightweight Multimodal LLMs: Invest in model optimization techniques, such as quantization and knowledge distillation, to reduce computational demands, enabling deployment in rural clinics. 2. Establish GOST-Compliant Standards: Create regulatory frameworks that mandate interpretability features and align with global standards like the EU AI Act (EU AI Act). 3. EnhanceTraining Programs: Implement nationwide training initiatives for radiologists, focusing on AI interpretability and integration, to facilitate adoption. 4. Leverage Federated Learning: Use federated learning to train LLMs on diverse Russian datasets, ensuring privacy and inclusivity across multi-ethnic populations. 5. Foster Global Collaborations: Partner with international AI research hubs to develop standardized interpretability metrics and share best practices, positioning Russia as a leader in ethical AI. 7. Conclusion LLMs, from RNNs to multimodal transformers, offer transformative potential for healthcare (e.g., Med-PaLM: F1=0.88; Flamingo-CXR: 77.7% preference) [1, 13]. In Russia, challenges like data scarcity, regulatory compliance, and radiology data digitization persist. Recommendations include expanding annotated radiology datasets, developing specialized LLMs for report generation, and standardizing data formats to position Russia as a leader in AI-driven healthcare by 2030. Russia’s focus on AIdriven healthcare aligns with global trends, emphasizing data standardization and ethical deployment. Investments in clinician training and public awareness will drive LLM adoption in radiology workflows. Collaborative research with international AI communities could enhance Russia’s healthcare AI ecosystem. GOST-compliant AI frameworks could set a precedent for responsible AI use globally, particularly in automated radiology reporting.About the authors
Eugeny Yu. Shchetinin
Sevastopol State University
Author for correspondence.
Email: riviera-molto@mail.ru
ORCID iD: 0000-0003-3651-7629
Scopus Author ID: 16408533100
ResearcherId: O-8287-2017
Doctor of Physical and Mathematical Sciences, Professor at the Department of Information Technology and Systems
33 Universitetskaya Street, Sevastopol, 299053, Russian FederationTatyana R. Velieva
RUDN University
Email: velieva-tr@rudn.ru
ORCID iD: 0000-0003-4466-8531
Candidate of Physical and Mathematical Sciences, Assistent Professor of Department of Probability Theory and Cyber Security
6 Miklukho-Maklaya St, Moscow, 117198, Russian FederationLyubov A. Yurgina
RUDN University
Email: yurgina_la@pfur.ru
ORCID iD: 0009-0004-4661-5059
Ph.D. of Pedagogical Sciences, Head of the Department of Mathematics and Information Technology of the Sochi branch
32 Kuibyshev St, Sochi, 354340, Russian FederationAnastasia V. Demidova
RUDN University
Email: demidova-av@rudn.ru
ORCID iD: 0000-0003-1000-9650
Candidate of Physical and Mathematical Sciences, Associate Professor of Department of Probability Theory and Cyber Security
6 Miklukho-Maklaya St, Moscow, 117198, Russian FederationLeonid A. Sevastianov
RUDN University; Joint Institute for Nuclear Research
Email: sevastianov-la@rudn.ru
ORCID iD: 0000-0002-1856-4643
Professor, Doctor of Sciences in Physics and Mathematics, Professor at the Department of Computational Mathematics and Artificial Intelligence of RUDN University, Leading Researcher of Bogoliubov Laboratory of Theoretical Physics, Joint Institute for Nuclear Research
6 Miklukho-Maklaya St, Moscow, 117198, Russian Federation; 6 Joliot-Curie St, Dubna, 141980, Russian FederationReferences
- Tu, T., Azizi, S., Singhal, K., et al. Med-PaLM M: A multimodal generative foundation model for health 2024.
- Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., et al. Large language models in medicine: Opportunities and challenges. Nature Medicine 29, 1930-1940. doi: 10.1038/s41591-023-02448-8 (2023).
- Sultan, I. Revolutionizing precision oncology: The role of artificial intelligence in personalized pediatric cancer care. Frontiers in Medicine 12, 1555893. doi: 10.3389/fmed.2025.1555893 (2025).
- Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H. & Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234-1240. doi: 10.1093/bioinformatics/btz682 (2020).
- Huang, K., Altosaar, J. & Ranganath, R. ClinicalBERT: Modeling clinical notes and predicting hospital readmission 2019.
- Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J. & Liu, Z. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023), 197-208. doi: 10.1109/CVPR52729. 2023.00025.
- Lin, T. Y., Zhang, Y. & Chen, X. Mixture of experts for medical imaging and text. Medical Physics 51, 1234-1245. doi: 10.1002/mp.16890 (2024).
- Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Hassabis, D., et al. AlphaFold for drug discovery: Protein-ligand interaction prediction. Nature 614, 709-716. doi:10. 1038/s41586-023-05788-0 (2023).
- Thériault-Lauzier, P. et al. Temporal learning for longitudinal imaging-based recurrence prediction in pediatric gliomas. NEJM AI 2. doi: 10.1056/AIra2400123 (2025).
- Li, Y., Wang, Z., Liu, Y., Zhou, L., et al. KARGEN: Knowledge-enhanced automated radiology report generation using large language models 2024.
- Liu, Z., Zhang, Y. & Chen, X. BioBART for accelerated biomedical literature review. Bioinformatics doi: 10.1093/bioinformatics/btad456 (2023).
- Grisoni, F. ChemBERTa: A chemical language model for drug discovery. Journal of Chemical Information and Modeling 63, 1345-1353. doi: 10.1021/acs.jcim.2c01567 (2023).
- Van Veen, D. et al. Collaboration between clinicians and vision-language models in radiology report generation. Nature Medicine 30, 3056-3064. doi: 10.1038/s41591-024-03208-y (2024).
- Bannur, S. et al. RaDialog: A large vision-language model for radiology report generation and conversational assistance 2025.
- Kasakewitch, J. P. G., Lima, D. L., Balthazar, C. A., et al. The Role of Artificial Intelligence Large Language Models in Literature Search Assistance to Evaluate Inguinal Hernia Repair Approaches. Journal of Laparoendoscopic & Advanced Surgical Techniques 35, 437-444. doi: 10.1089/lap.2024.0277 (2025).
- Wu, X., Zhang, Y. & Chen, L. Visual ChatGPT: Multimodal dialogue for medical applications in Medical Image Computing and Computer Assisted Intervention 14221 (2023), 345-354. doi:10.1007/ 978-3-031-43901-8_33.
- Delgado, D. Artificial Intelligence-Enabled Analysis of Thermography to Diagnose Acute Decompensated Heart Failure. JACC: Advances 4, 101888. doi: 10.1016/j.jacadv.2025.101888 (2025).
- Arora, A. & Arora, A. The promise of large language models in health care. The Lancet 401, 641. doi: 10.1016/S0140-6736(23)00217-6 (2023).
- Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutierrez, L., Tan, T. F. & Ting, D. S. W. Federated learning for medical AI: A practical approach. Nature Machine Intelligence 5, 389-398. doi: 10.1038/s42256-023-00645-8 (2023).
- Namiri, N. K., Puglisi, C. E. & Lipsky, P. E. Machine learning in rheumatic autoimmune inflammatory diseases. Nature Reviews Rheumatology 17, 669-680. doi: 10.1038/s41584-02100692-7 (2021).
- Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Scialom, T., et al. Llama 2: Open foundation and fine-tuned chat models 2023.
- Meskó, B. & Görög, M. A short guide for medical professionals in the era of artificial intelligence. NPJ Digital Medicine 3, 126. doi: 10.1038/s41746-020-00333-z (2020).
- Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific text in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023), 1234-1245. doi: 10.18653/v1/2023.emnlp-main.76.
- Wang, J., Zhang,Y. & Li, X. Multimodal LLMs for scientific trend prediction. JournalofInformetrics doi: 10.1016/j.joi.2024.101345 (2024).
- Chen, Y., Zhang, L. & Wang, J. GPT-3 for drug repurposing in infectious diseases. Journal of Medical Chemistry 66, 2345-2353. doi: 10.1021/acs.jmedchem.2c01567 (2023).
- Khader, F., Müller-Franzes, G. & Wang, S. Synthetic data generation for medical imaging using GANs. Medical Image Analysis 78. doi: 10.1016/j.media.2022.102399 (2022).
- Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nature Medicine 28, 31-38. doi: 10.1038/s41591-021-01614-0 (2022).
- Loaiza-Bonilla, A. & Penberthy, S. Challenges in integrating artificial intelligence into health care: Bias, privacy, and validation. NEJM AI 2. doi: 10.1056/AIp2400789 (2025).
- Che, J., Zhang, X. & Liu, Y. Ensemble methods for improving LLM reliability. Journal of Artificial Intelligence Research 68, 567-580. doi: 10.1613/jair.1.13845 (2023).
- Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Fiedel, N., et al. PaLM: Scaling language modeling with pathways 2022.
- Radanliev, P. & De Roure, D. The ethics of shared Covid-19 risks: An epistemological framework for ethical health technology assessment of risk in vaccine supply chain infrastructures. Health and Technology 11, 1083-1091. doi: 10.1007/s12553-021-00587-3 (2021).
- Pang, T., Li, P. & Zhao, L. A survey on automatic generation of medical imaging reports based on deep learning. BioMedical Engineering Online 22, 48. doi: 10.1186/s12938-023-01113-y (2023).
- Kim, S. et al. Large language models: A guide for radiologists. Korean Journal of Radiology 25, 126-133. doi: 10.3348/kjr.2023.0997 (2024).
- Fink, M. A. et al. Automatic structuring of radiology reports with on-premise open-source large language models. European Radiology 34, 6285-6294. doi: 10.1007/s00330-024-09876-5 (2024).
- López-Úbeda, P., Martín-Noguerol, T., Díaz-Angulo, C. & Luna, A. Evaluation of large language models performance against humans for summarizing MRI knee radiology reports: A feasibility study. International Journal of Medical Informatics 187, 105443. doi: 10.1016/j.ijmedinf.2024.105443 (2024).
- Jorg, T. et al. Automated integration of AI results into radiology reports using common data elements. Journal of Imaging Informatics in Medicine 38, 45-53. doi: 10.1007/s10278-024-01023-4 (2025).
- Gertz, R. J., Bunck, A. C., Lennartz, S., et al. GPT-4 for automated determination of radiological study and protocol based on radiology request forms: A feasibility study. Radiology 307, e230877. doi: 10.1148/radiol.230877 (2023).
Supplementary files










