A Nostalgia Reading List for Beginners on AI Security

Missing the innocent time but we must move forward for bigger challenges

Xudong Pan

February 6, 2025

A Nostalgia Reading List for Beginners on AI Security

1. Adversarial Attacks
- 1.1. Adversarial Examples (AE) & Defenses
- 1.2. Backdoor Attacks & Defenses
- 1.3. Poisoning Attacks
- 1.4. Byzantine Attacks
1. Privacy Attacks
- 2.1. Membership Inference
- 2.2. Property Inference
- 2.3. Data Reconstruction
- 2.4. Model Extraction/Stealing
1. Copyright Protection
- 3.1. Model Watermarking
- 3.2. Model Fingerprinting

1. Adversarial Attacks

1.1. Adversarial Examples (AE) & Defenses

1.1.1. Survey

Wild patterns: Ten years after the rise of adversarial machine learning（Surveying AI Security research before 2018, mainly on adversarial examples and poisoning attacks）

1.2. Backdoor Attacks & Defenses

1.2.1. Survey

TrojanZoo (huge engineering efforts with an open-sourced framework）

1.2.2. Attack Side

TrojanNN (NDSS’17，neuron-based injection)
BadNet (IEEE Access，data-based injection, classic)
Latent Backdoor (CCS’19，extending the backdoor attack to pretrained encoders, via feature alignment)
Input-Aware Backdoor(NeurIPS‘20，the first dynamic trigger backdoor)
Hidden Trigger Backdoor Attack on NLP Models via Linguistic Style Manipulation （Security‘22，the first dynamic backdoor on NLP models, including, Ours）
Towards Backdoor Attack on Deep Learning based Time Series Classification (ICDE’22，the first effective backdoor attack on time series models，Ours)

1.2.3. Defense Side

Fine-pruning (RAID’18，literally, pruning and finetuning, based on the hypotheized differences in activation patterns)
STRIP (ACSAC’19，detection, based on the hypothesis that triggered input is resilient to noise)
Neural Cleanse (S&P‘19，strong link between backdoor behavior, i.e., misclassification, and static trigger pattern)
ABS (CCS’19，neuron-level inspection)

1.3. Poisoning Attacks

1.3.1. Clean-Label Attacks

Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks（NIPS‘18，present the idea of clean-label attack, feature-level alignment）
Bullseye Polytope(EuroS&P’21，enhancing the attack effectiveness of the above work from a geometric view, simple yet effective)
Label-consistent backdoor attack (another approach towards clean-label attack)

1.4. Byzantine Attacks

The Hidden Vulnerability of Distributed Learning in Byzantium (Krum) (NIPS'17, Statistics-based defense)
Justinian’s GAAvernor: Robust Distributed Learning with Gradient Aggregation Agent (Security’20, RL-based defense, Ours)

2. Privacy Attacks

2.1. Membership Inference

2.1.1. Survey

Membership Inference Attacks on Machine Learning: A Surve (ACM Computing Surveys)

2.1.2. Attack Side

Membership Inference Attacks against Machine Learning Models (S&P‘17, the earliest MIA)
ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models (NDSS‘18，a minimalist)

2.1.3. Defense Side

MemGuard: Defending against Black-Box Membership Inference Attacks via Adversarial Examples (CCS‘19，defense by logit-level obfuscation)

2.2. Property Inference

2.2.1. Global Property

https://dl.acm.org/doi/pdf/10.1145/3243734.3243834 (CCS'18, Inference based on the model weights)

2.2.2. Individual Property

Exploiting Unintended Feature Leakage in Collaborative Learning (S&P'19, feature inference based on the embedding or gradient in federated learning scenarios)
Privacy risks of general-purpose language models (S&P’20, reconstructing the privacy semantics from the text embeddings of LLMs，Ours)

2.3. Data Reconstruction

2.3.1. Gradient-Based

Deep Leakage from Gradients （DLG，NeurIPS‘19，the earliest）
GradInversion (CVPR’21，incorporate data priors into the reconstruction)
Exploring the Security Boundary of Data Reconstruction via Neuron Exclusivity Analysis (Security‘22，equation-solving-based reconstruction, pixel-level reconstruction, Ours)

2.3.2. Weight-Based

Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures (CCS‘15，present the idea of model inversion)
Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion（CVPR’20，distilling (synthetic) training data from the pretrained model only)
Extracting Training Data from Large Language Models (USENIX Security‘21，extracting training data from GPT-2 based on MIA)

2.4. Model Extraction/Stealing

2.4.1. Attack Side

Stealing machine learning models via prediction APIs （Security’16，the earliest attack based on distilling）
High Accuracy and High Fidelity Extraction of Neural Networks （Security‘20，propose the notion of fidelity and exploit the ReLU property）
Matryoshka: Stealing Functionality of Private ML Data by Hiding Models in Model （TPAMI'24, stealing by stegnography Ours）

2.4.2. Defense Side

PRADA: Protecting against DNN Model Stealing Attacks （EuroS&P’19，a classic one）

3. Copyright Protection

3.1. Model Watermarking

3.1.1. Survey

SoK: How Robust is Image Classification Deep Neural Network Watermarking? （S&P‘22，a good survey）
A Systematic Review on Model Watermarking for Neural Networks

3.1.2. White-box Watermarking

Embedding Watermarks into Deep Neural Networks (ICME’17，the earliest white-box watermarking scheme)
Cracking White-box DNN Watermarks via Invariant Neuron Transforms (KDD'23, Weight-based Obfuscation Attack)
Rethinking White-Box Watermarks on Deep Learning Models under Neural Structural Obfuscation (USENIX Security'23, Structure-based Obfuscation Attack)

3.1.3. Black-box Watermarking

Turning Your Weakness Into a Strength: Watermarking Deep Neural Networks by Backdooring (USENIX Security’18，one of the earliest black-box watermarking)
Neural Dehydration: Effective Erasure of Black-box Watermarks from DNNs with Limited Data (CCS'24, cracking nine mainstream black-box watermarking schemes, Ours)

3.2. Model Fingerprinting

IPGuard: Protecting Intellectual Property of Deep Neural Networks via Fingerprinting the Classification Boundary （AsiaCCS’21，one of the earliest fingerprinting algorithms for classifiers）
TAFA: A Task-Agnostic Fingerprinting Algorithm for Neural Networks （ESORICS‘21，task-agnostic，Ours）
MetaV: A Meta-Verifier Approach to Task-Agnostic Model Fingerprinting（KDD’22，more generic，Ours）

A Nostalgia Reading List for Beginners on AI Security

1. Adversarial Attacks

1.1. Adversarial Examples (AE) & Defenses

1.1.1. Survey

1.1.2. Attack Side

1.1.3. Empirical Defense

1.1.4. Certified Defense

1.2. Backdoor Attacks & Defenses

1.2.1. Survey

1.2.2. Attack Side

1.2.3. Defense Side

1.3. Poisoning Attacks

1.3.1. Clean-Label Attacks

1.4. Byzantine Attacks

2. Privacy Attacks

2.1. Membership Inference

2.1.1. Survey

2.1.2. Attack Side

2.1.3. Defense Side

2.2. Property Inference

2.2.1. Global Property

2.2.2. Individual Property

2.3. Data Reconstruction

2.3.1. Gradient-Based

2.3.2. Weight-Based

2.4. Model Extraction/Stealing

2.4.1. Attack Side

2.4.2. Defense Side

3. Copyright Protection

3.1. Model Watermarking

3.1.1. Survey

3.1.2. White-box Watermarking

3.1.3. Black-box Watermarking

3.2. Model Fingerprinting