Method of Counteracting Manipulative Queries to Large Language Models

Kovalchuk, Yehor; Kolomytsev, Mykhailo

Method of Counteracting Manipulative Queries to Large Language Models

dc.contributor.author	Kovalchuk, Yehor
dc.contributor.author	Kolomytsev, Mykhailo
dc.date.accessioned	2026-03-03T10:34:52Z
dc.date.available	2026-03-03T10:34:52Z
dc.date.issued	2025
dc.description.abstract	The integration of Large Language Models (LLMs) into critical infrastructure (SIEM, SOAR) has introduced new attack vectors, specifically prompt injection and jailbreaking. Traditional defense mechanisms, such as input sanitization and Reinforcement Learning from Human Feedback (RLHF), often fail against semantic obfuscation and indirect injections due to their inability to distinguish between control instructions and data context. This paper proposes a novel method for detecting manipulative prompts based on a Multi-Head DistilBERT architecture. Unlike standard binary classifiers, the proposed model decomposes the detection task into four semantic vectors: malicious intent, instruction override, persona adoption, and high-risk action. To address the scarcity of labeled adversarial datasets, we implemented a hybrid data generation strategy using Knowledge Distillation, employing a superior model (Teacher) to label synthetic attacks for the compact Student model. Experimental results on both synthetic and real-world datasets demonstrate that the proposed system achieves a Recall of 0.99, significantly outperforming traditional TF-IDF and keyword-based baselines. The solution operates effectively as a middleware layer, ensuring real-time protection with low computational latency suitable for deployment on edge devices
dc.format.pagerange	P. 114-118
dc.identifier.citation	Kovalchuk, Y. Method of Counteracting Manipulative Queries to Large Language Models / Yehor Kovalchuk, Mykhailo Kolomytsev // Theoretical and Applied Cybersecurity: scientific journal. – 2025. – Vol. 7, No. 3. – P. 114-118. – Bibliogr.: 16 ref.
dc.identifier.doi	https://doi.org/10.20535/tacs.2664-29132025.3.345389
dc.identifier.uri	https://ela.kpi.ua/handle/123456789/79184
dc.language.iso	en
dc.publisher	Igor Sikorsky Kyiv Polytechnic Institute
dc.publisher.place	Kyiv
dc.relation.ispartof	Theoretical and Applied Cybersecurity: scientific journal, Vol. 7, No. 3
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/deed.uk
dc.subject	large language models
dc.subject	prompt injection
dc.subject	jailbreaking
dc.subject	nlp security
dc.subject	distilbert
dc.subject	adversarial machine learning
dc.subject.udc	004.8:004.056
dc.title	Method of Counteracting Manipulative Queries to Large Language Models
dc.type	Article

Файли

Контейнер файлів

Зараз показуємо 1 - 1 з 1

Назва:: 114-118.pdf
Розмір:: 12.63 MB
Формат:: Adobe Portable Document Format

Завантажити

Ліцензійна угода

Зараз показуємо 1 - 1 з 1

Назва:: license.txt
Розмір:: 8.98 KB
Формат:: Item-specific license agreed upon to submission
Опис:

Завантажити

Зібрання

Theoretical and Applied Cybersecurity: scientific journal, Vol. 7, No. 3