- PreprintComprehensive Assessment of Jailbreak Attacks Against LLMsPreprint. 2024.
Misuse of the Large Language Models (LLMs) has raised widespread concern. To address this issue, safeguards have been taken to ensure that LLMs align with social ethics. However, recent findings have revealed an unsettling vulnerability bypassing the safeguards of LLMs, known as jailbreak attacks. By applying techniques, such as employing role-playing scenarios, adversarial examples, or subtle subversion of safety objectives as a prompt, LLMs can produce an inappropriate or even harmful response. While researchers have studied several categories of jailbreak attacks, they have done so in isolation. To fill this gap, we present the first large-scale measurement of various jailbreak attack methods. We concentrate on 13 cutting-edge jailbreak methods from four categories, 160 questions from 16 violation categories, and six popular LLMs. Our extensive experimental results demonstrate that the optimized jailbreak prompts consistently achieve the highest attack success rates, as well as exhibit robustness across different LLMs. Some jailbreak prompt datasets, available from the Internet, can also achieve high attack success rates on many LLMs, such as ChatGLM3, GPT-3.5, and PaLM2. Despite the claims from many organizations regarding the coverage of violation categories in their policies, the attack success rates from these categories remain high, indicating the challenges of effectively aligning LLM policies and the ability to counter jailbreak attacks. We also discuss the trade-off between the attack performance and efficiency, as well as show that the transferability of the jailbreak prompts is still viable, becoming an option for black-box models. Overall, our research highlights the necessity of evaluating different jailbreak methods. We hope our study can provide insights for future research on jailbreak attacks and serve as a benchmark tool for evaluating them for practitioners.
- PreprintRobustness Over Time: Understanding Adversarial Examples’ Effectiveness on
Longitudinal Versions of Large Language ModelsYugeng Liu*, Tianshuo Cong*, Zhengyu Zhao, Michael Backes, Yun Shen, Yang Zhang (* equal contribution)Preprint. 2023.
Large Language Models (LLMs) have led to significant improvements in many tasks across various domains, such as code interpretation, response generation, and ambiguity handling. These LLMs, however, when upgrading, primarily prioritize enhancing user experience while neglecting security, privacy, and safety implications. Consequently, unintended vulnerabilities or biases can be introduced. Previous studies have predominantly focused on specific versions of the models and disregard the potential emergence of new attack vectors targeting the updated versions. Through the lens of adversarial examples within the in-context learning framework, this longitudinal study addresses this gap by conducting a comprehensive assessment of the robustness of successive versions of LLMs, vis-à-vis GPT-3.5. We conduct extensive experiments to analyze and understand the impact of the robustness in two distinct learning categories: zero-shot learning and few-shot learning. Our findings indicate that, in comparison to earlier versions of LLMs, the updated versions do not exhibit the anticipated level of robustness against adversarial attacks. In addition, our study emphasizes the increased effectiveness of synergized adversarial queries in most zero-shot learning and few-shot learning cases. We hope that our study can lead to a more refined assessment of the robustness of LLMs over time and provide valuable insights of these models for both developers and users.
- PreprintWatermarking Diffusion ModelPreprint. 2023.
The availability and accessibility of diffusion models (DMs) have significantly increased in recent years, making them a popular tool for analyzing and predicting the spread of information, behaviors, or phenomena through a population. Particularly, text-to-image diffusion models (e.g., DALL·E 2 and Latent Diffusion Models (LDMs) have gained significant attention in recent years for their ability to generate high-quality images and perform various image synthesis tasks. Despite their widespread adoption in many fields, DMs are often susceptible to various intellectual property violations. These can include not only copyright infringement but also more subtle forms of misappropriation, such as unauthorized use or modification of the model. Therefore, DM owners must be aware of these potential risks and take appropriate steps to protect their models. In this work, we are the first to protect the intellectual property of DMs. We propose a simple but effective watermarking scheme that injects the watermark into the DMs and can be verified by the pre-defined prompts. In particular, we propose two different watermarking methods, namely NAIVEWM and FIXWM. The NAIVEWM method injects the watermark into the LDMs and activates it using a prompt containing the watermark. On the other hand, the FIXWM is considered more advanced and stealthy compared to the NAIVEWM, as it can only activate the watermark when using a prompt containing a trigger in a fixed position. We conducted a rigorous evaluation of both approaches, demonstrating their effectiveness in watermark injection and verification with minimal impact on the LDM’s functionality.
- Network and Distributed System Security Symposium (NDSS). San Diego, CA, USA. Feb, 2023.
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate ASR scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases, while both maintaining decent model utility performance. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
- Yugeng Liu*, Rui Wen*, Xinlei He, Ahmed Salem, Zhikun Zhang, Michael Backes, Emiliano De Cristofaro, Mario Fritz, Yang Zhang (* equal contribution)USENIX Security Symposium. Boston, MA, USA. Aug, 2022.
Inference attacks against Machine Learning (ML) models allow adversaries to learn information about training data, model parameters, and so on. While researchers have studied several kinds of attacks thoroughly, they have done soin isolation. As a result, we lack a comprehensive picture of the risks caused by the attacks, e.g., the different scenarios they can be applied to, the common factors that influence their performance, the relationship among them, or the effectiveness of defense techniques. In this paper, we fill this gap by presenting a first-of-its-kind holistic risk assessment of different inference attacks against machine learning models. We concentrate on four attacks – membership inference, model inversion, attribute inference, and model stealing – and establish a threat model taxonomy.
Our extensive experimental evaluation, conducted over five model architectures and four image datasets, shows that the complexity of the training dataset plays an important role with respect to the attack’s performance, while the effectiveness of model stealing and membership inference attacks are negatively correlated. We also show that defenses like DP-SGD and Knowledge Distillation can only mitigate some of the inference attacks. Our analysis relies on a modular re-usable software, ML-DOCTOR, which enables ML model owners toassess the risks of deploying their models, and equally serves as a benchmark tool for researchers and practitioners.
- COSE'19Computers & Security. 2019.
Third-party library (TPL) detection in Android has been a hot topic to security researchers for a long time. A precise yet scalable detection of TPLs in applications can greatly facilitate other security activities such as TPL integrity checking, malware detection, and privacy leakage detection. Since TPLs of specific versions may exhibit their own security issues, the identification of TPL as well as its concrete version, can help assess the security of Android APPs. However in reality, existing approaches of TPL detection suffer from low efficiency for their detection algorithm to impracticable and low accuracy due to insufficient analysis data, inappropriate features, or the disturbance from code obfuscation, shrinkage, and optimization.
In this paper, we present an automated approach, named PanGuard, to detect TPLs from an enormous number of Android APPs. We propose a novel combination of features including both structural and content information for packages in APPs to characterize TPLs. In order to address the difficulties caused by code obfuscation, shrinkage, and optimization, we identify the invariants that are unchanged during mutation, separate TPLs from the primary code in APPs, and use these invariants to determine the contained TPLs as well as their versions. The extensive experiments show that PanGuard achieves a high accuracy and scalability simultaneously in TPL detection. In order to accommodate to optimized TPL detection, which has not been mentioned by previous work, we adopt set analysis, which speed up the detection as a side effect.
PanGuard is implemented and applied on an industrial edge computing platform, and powers the identification of TPL. Beside fast detection algorithm, the edge computing deployment architecture make the detection scalable to real-time detection on a large volume of emerging APPs. Based on the detection results from millions of Android APPs, we successfully identify over 800 TPLs with 12 versions on average. By investigating the differences amongst these versions, we identify over 10 security issues in TPLs, and shed light on the significance of TPL detection with the caused harmful impacts on the Android ecosystem.
- ACM Conference on Computer and Communications Security (CCS). Toronto, Canada. Oct, 2018.
Smart home is an emerging technology for intelligently connecting a large variety of smart sensors and devices to facilitate automation of home appliances, lighting, heating and cooling systems, and security and safety systems. Our research revolves around Samsung SmartThings, a smart home platform with the largest number of apps among currently available smart home platforms. The previous research has revealed several security flaws in the design of SmartThings, which allow malicious smart home apps (or SmartApps) to possess more privileges than they were designed and to eavesdrop or spoof events in the SmartThings platform. To address these problems, this paper leverages side-channel inference capabilities to design and develop a system, dubbed HoMonit, to monitor SmartApps from encrypted wireless traffic. To detect anomaly, HoMonit compares the SmartApps activities inferred from the encrypted traffic with their expected behaviors dictated in their source code or UI interfaces. To evaluate the effectiveness of HoMonit, we analyzed 181 official SmartApps and performed evaluation on 60 malicious SmartApps, which either performed overprivileged accesses to smart devices or conducted event-spoofing attacks. The evaluation results suggest that HoMonit can effectively validate the working logic of SmartApps and achieve a high accuracy in the detection of SmartApp misbehaviors.