Prompt injections are attacks against LLM applications where an attacker is able to override the original instructions of the programmer. Neither input validation nor output filtering have so far been successful in mitigating this kind of attack. The reviewed paper presents a novel approach to defend against prompt injections by substituting (“signing”) the original instructions with random keywords very unlikely to appear in natural language.
This blog post reviews the paper named “Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications” by Xuchen Suo. I am in no way affiliated with the research.
As of January 2024, prompt injections are still an unsolved problem when deploying applications integrating LLMs are passing user input to the model. Conventional defences against similar kinds of injection attacks (like SQL injection or XSS) are not applicable since they can be bypassed by creative prompting techniques, allowing attackers to embed seemingly innocuous commands into their inputs.
The approach presented in this paper neither relies on input filtering nor on output encoding. It rather “signs” (not in a cryptographic sense) the authorized instructions and instructs the LLM to discard any instructions that were not signed this way.
Signing in this case basically means substituting by keywords. For example, instead of using the command “delete”, it is substituted with the random word “toeowx”. As long as the attacker is not aware of (and cannot brute-force) the substitution, it is expected for them to be impossible to provide signed commands to the model.
The following figure (created based on the methodology described in the paper) depicts the basic idea. The raw instructions are signed before the user-controlled input is inserted into the prompt. Afterwards (not depicted here), the model is instructed (by prompt engineering or fine-tuning) to ignore all unsigned commands.
The author claims that his proposed mechanism was effectively preventing prompt injections in 100% of investigated cases (while neither publishing the exact data set used in these experiments nor the source code).
One limitation of this approach that the author does not acknowledge explicitly is that it requires to keep track of a whitelist of approved commands (i.e. the commands that are “signed”). Although this might be feasible for some use cases, it severely restricts the model’s generic capabilities.
While it seems too good to be true that the Signed Prompts approach could be the one and only solution against prompt injection attacks, it seems to be a useful step in the right direction. I am looking forward to see some actual source code implementing this approach and to run more experiments with it.