Blog

A Nostalgia Reading List for Beginners on AI Security

1. Adversarial Attacks

1.1. Adversarial Examples (AE) & Defenses

1.1.1. Survey

  • Wild patterns: Ten years after the rise of adversarial machine learning — A survey covering AI security research before 2018, focusing primarily on adversarial examples and poisoning attacks.

1.1.2. Attack Side

  • FGSM — the first AE
  • PGD — the first iterative AE generation algorithm
  • C&W — systematization work
  • TextBugger — AE attacks on NLP models
  • Black-box Adversarial Attacks on Commercial Speech Platforms with Minimal Information — AE on audio models

1.1.3. Empirical Defense

  • MagNet: a Two-Pronged Defense against Adversarial Examples (CCS'17) — manifold-based, unsupervised approach
  • Obfuscated Gradients Give a False Sense of Security (ICML'18 Best Paper) — surveys pre-2018 defenses and breaks them; circumventing defenses to adversarial examples

1.1.4. Certified Defense

  • SoK: Certified Robustness for Deep Neural Networks — a survey
  • Certified Adversarial Robustness via Randomized Smoothing (ICML'19) — an early work on randomized smoothing
  • AI2: Safety and Robustness Certification of Neural Networks with Abstract Interpretation (S&P'18) — deterministic certification based on ideas from program verification

1.2. Backdoor Attacks & Defenses

1.2.1. Survey

  • TrojanZoo — huge engineering efforts with an open-sourced framework

1.2.2. Attack Side

  • TrojanNN (NDSS'17) — neuron-based injection
  • BadNet (IEEE Access) — data-based injection, classic
  • Latent Backdoor (CCS'19) — extending the backdoor attack to pretrained encoders, via feature alignment
  • Input-Aware Backdoor (NeurIPS'20) — the first dynamic trigger backdoor
  • Hidden Trigger Backdoor Attack on NLP Models via Linguistic Style Manipulation (Security'22) — the first dynamic backdoor on NLP models (Ours)
  • Towards Backdoor Attack on Deep Learning based Time Series Classification (ICDE'22) — the first effective backdoor attack on time series models (Ours)

1.2.3. Defense Side

  • Fine-pruning (RAID'18) — pruning and finetuning, based on the hypothesized differences in activation patterns
  • STRIP (ACSAC'19) — detection based on the hypothesis that triggered input is resilient to noise
  • Neural Cleanse (S&P'19) — strong link between backdoor behavior and static trigger pattern
  • ABS (CCS'19) — neuron-level inspection

1.3. Poisoning Attacks

1.3.1. Clean-Label Attacks

  • Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks (NIPS'18) — present the idea of clean-label attack, feature-level alignment
  • Bullseye Polytope (EuroS&P'21) — enhancing the attack effectiveness from a geometric view, simple yet effective
  • Label-consistent backdoor attack — another approach towards clean-label attack

1.4. Byzantine Attacks

  • The Hidden Vulnerability of Distributed Learning in Byzantium (Krum) (NIPS'17) — Statistics-based defense
  • Justinian’s GAAvernor: Robust Distributed Learning with Gradient Aggregation Agent (Security'20) — RL-based defense (Ours)

2. Privacy Attacks

2.1. Membership Inference

2.1.1. Survey

  • Membership Inference Attacks on Machine Learning: A Survey (ACM Computing Surveys)

2.1.2. Attack Side

  • Membership Inference Attacks against Machine Learning Models (S&P'17) — the earliest MIA
  • ML-Leaks (NDSS'18) — a minimalist

2.1.3. Defense Side

  • MemGuard (CCS'19) — defense by logit-level obfuscation

2.2. Property Inference

2.2.1. Global Property

  • A CCS'18 paper on inference based on model weights

2.2.2. Individual Property

  • Exploiting Unintended Feature Leakage in Collaborative Learning (S&P'19) — feature inference based on the embedding or gradient in federated learning scenarios
  • Privacy risks of general-purpose language models (S&P'20) — reconstructing the privacy semantics from the text embeddings of LLMs (Ours)

2.3. Data Reconstruction

2.3.1. Gradient-Based

  • Deep Leakage from Gradients (DLG) (NeurIPS'19) — the earliest
  • GradInversion (CVPR'21) — incorporate data priors into the reconstruction
  • Exploring the Security Boundary of Data Reconstruction via Neuron Exclusivity Analysis (Security'22) — equation-solving-based reconstruction, pixel-level reconstruction (Ours)

2.3.2. Weight-Based

  • Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures (CCS'15) — present the idea of model inversion
  • Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion (CVPR'20) — distilling (synthetic) training data from the pretrained model only
  • Extracting Training Data from Large Language Models (USENIX Security'21) — extracting training data from GPT-2 based on MIA

2.4. Model Extraction/Stealing

2.4.1. Attack Side

  • Stealing machine learning models via prediction APIs (Security'16) — the earliest attack based on distilling
  • High Accuracy and High Fidelity Extraction of Neural Networks (Security'20) — propose the notion of fidelity and exploit the ReLU property
  • Matryoshka: Stealing Functionality of Private ML Data by Hiding Models in Model (TPAMI'24) — stealing by steganography (Ours)

2.4.2. Defense Side

  • PRADA: Protecting against DNN Model Stealing Attacks (EuroS&P'19) — a classic one

3.1. Model Watermarking

3.1.1. Survey

  • SoK: How Robust is Image Classification Deep Neural Network Watermarking? (S&P'22) — a good survey
  • A Systematic Review on Model Watermarking for Neural Networks

3.1.2. White-box Watermarking

  • Embedding Watermarks into Deep Neural Networks (ICME'17) — the earliest white-box watermarking scheme
  • Cracking White-box DNN Watermarks via Invariant Neuron Transforms (KDD'23) — Weight-based Obfuscation Attack (Ours)
  • Rethinking White-Box Watermarks on Deep Learning Models under Neural Structural Obfuscation (USENIX Security'23) — Structure-based Obfuscation Attack (Ours)

3.1.3. Black-box Watermarking

  • Turning Your Weakness Into a Strength: Watermarking Deep Neural Networks by Backdooring (USENIX Security'18) — one of the earliest black-box watermarking
  • Neural Dehydration: Effective Erasure of Black-box Watermarks from DNNs with Limited Data (CCS'24) — cracking nine mainstream black-box watermarking schemes (Ours)

3.2. Model Fingerprinting

  • IPGuard: Protecting Intellectual Property of Deep Neural Networks via Fingerprinting the Classification Boundary (AsiaCCS'21) — one of the earliest fingerprinting algorithms for classifiers
  • TAFA: A Task-Agnostic Fingerprinting Algorithm for Neural Networks (ESORICS'21) — task-agnostic (Ours)
  • MetaV: A Meta-Verifier Approach to Task-Agnostic Model Fingerprinting (KDD'22) — more generic (Ours)

Instrumental Convergence: A Trigger for Self-Preservation via Self-Replication

Instrumental Convergence

Instrumental Convergence is described as a concept introduced by futurist Nick Bostrom in his analysis of AI alignment issues, referenced from his 2014 book Superintelligence: Paths, Dangers, Strategies (Oxford University Press). It suggests that most AIs, while pursuing diverse goals, will converge on a set of instrumental goals — such as self-preservation and resource acquisition — which could contribute to an intelligence explosion.

The hypothesis posits that although different AIs may have varied long-term objectives (long-term goals/final goals), they tend to pursue common instrumental values during the process to increase the probability of achieving their goals: