
{
    "version": "https://jsonfeed.org/version/1.1",
    "title": "Blog on Xudong Pan",
    "description": "Recent content in Blog on Xudong Pan",
    "home_page_url": "https://ravensanstete.github.io/",
    "feed_url": "https://ravensanstete.github.io/en/blog/index.json",
    "language": "en-US",
    "icon": "https://ravensanstete.github.io/favicon.svg",
    "favicon": "https://ravensanstete.github.io/favicon.svg",
    "authors": [
        {
            "name": "Xudong Pan",
            "url": "https://ravensanstete.github.io/",
            "avatar": "https://ravensanstete.github.io/images/avatar.jpg"
        }
    ],
    "items": [
        {
            "title": "A Nostalgia Reading List for Beginners on AI Security",
            "date_published": "2025-02-06T00:00:00Z",
            "date_modified": "2025-02-06T00:00:00Z",
            "id": "https://ravensanstete.github.io/en/blog/ai-security-reading-list/",
            "url": "https://ravensanstete.github.io/en/blog/ai-security-reading-list/",
            "content_html": "\u003ch2 id=\"1-adversarial-attacks\"\u003e1. Adversarial Attacks\u003c/h2\u003e\n\u003ch3 id=\"11-adversarial-examples-ae--defenses\"\u003e1.1. Adversarial Examples (AE) \u0026amp; Defenses\u003c/h3\u003e\n\u003ch4 id=\"111-survey\"\u003e1.1.1. Survey\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eWild patterns: Ten years after the rise of adversarial machine learning\u003c/strong\u003e — A survey covering AI security research before 2018, focusing primarily on adversarial examples and poisoning attacks.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch4 id=\"112-attack-side\"\u003e1.1.2. Attack Side\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eFGSM\u003c/strong\u003e — the first AE\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003ePGD\u003c/strong\u003e — the first iterative AE generation algorithm\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eC\u0026amp;W\u003c/strong\u003e — systematization work\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eTextBugger\u003c/strong\u003e — AE attacks on NLP models\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eBlack-box Adversarial Attacks on Commercial Speech Platforms with Minimal Information\u003c/strong\u003e — AE on audio models\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch4 id=\"113-empirical-defense\"\u003e1.1.3. Empirical Defense\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eMagNet: a Two-Pronged Defense against Adversarial Examples\u003c/strong\u003e (CCS'17) — manifold-based, unsupervised approach\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eObfuscated Gradients Give a False Sense of Security\u003c/strong\u003e (ICML'18 Best Paper) — surveys pre-2018 defenses and breaks them; circumventing defenses to adversarial examples\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch4 id=\"114-certified-defense\"\u003e1.1.4. Certified Defense\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eSoK: Certified Robustness for Deep Neural Networks\u003c/strong\u003e — a survey\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCertified Adversarial Robustness via Randomized Smoothing\u003c/strong\u003e (ICML'19) — an early work on randomized smoothing\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAI2: Safety and Robustness Certification of Neural Networks with Abstract Interpretation\u003c/strong\u003e (S\u0026amp;P'18) — deterministic certification based on ideas from program verification\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"12-backdoor-attacks--defenses\"\u003e1.2. Backdoor Attacks \u0026amp; Defenses\u003c/h3\u003e\n\u003ch4 id=\"121-survey\"\u003e1.2.1. Survey\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eTrojanZoo\u003c/strong\u003e — huge engineering efforts with an open-sourced framework\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch4 id=\"122-attack-side\"\u003e1.2.2. Attack Side\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eTrojanNN\u003c/strong\u003e (NDSS'17) — neuron-based injection\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eBadNet\u003c/strong\u003e (IEEE Access) — data-based injection, classic\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eLatent Backdoor\u003c/strong\u003e (CCS'19) — extending the backdoor attack to pretrained encoders, via feature alignment\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eInput-Aware Backdoor\u003c/strong\u003e (NeurIPS'20) — the first dynamic trigger backdoor\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eHidden Trigger Backdoor Attack on NLP Models via Linguistic Style Manipulation\u003c/strong\u003e (Security'22) — the first dynamic backdoor on NLP models \u003cem\u003e(Ours)\u003c/em\u003e\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eTowards Backdoor Attack on Deep Learning based Time Series Classification\u003c/strong\u003e (ICDE'22) — the first effective backdoor attack on time series models \u003cem\u003e(Ours)\u003c/em\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch4 id=\"123-defense-side\"\u003e1.2.3. Defense Side\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eFine-pruning\u003c/strong\u003e (RAID'18) — pruning and finetuning, based on the hypothesized differences in activation patterns\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSTRIP\u003c/strong\u003e (ACSAC'19) — detection based on the hypothesis that triggered input is resilient to noise\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eNeural Cleanse\u003c/strong\u003e (S\u0026amp;P'19) — strong link between backdoor behavior and static trigger pattern\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eABS\u003c/strong\u003e (CCS'19) — neuron-level inspection\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"13-poisoning-attacks\"\u003e1.3. Poisoning Attacks\u003c/h3\u003e\n\u003ch4 id=\"131-clean-label-attacks\"\u003e1.3.1. Clean-Label Attacks\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePoison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks\u003c/strong\u003e (NIPS'18) — present the idea of clean-label attack, feature-level alignment\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eBullseye Polytope\u003c/strong\u003e (EuroS\u0026amp;P'21) — enhancing the attack effectiveness from a geometric view, simple yet effective\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eLabel-consistent backdoor attack\u003c/strong\u003e — another approach towards clean-label attack\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"14-byzantine-attacks\"\u003e1.4. Byzantine Attacks\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eThe Hidden Vulnerability of Distributed Learning in Byzantium (Krum)\u003c/strong\u003e (NIPS'17) — Statistics-based defense\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eJustinian\u0026rsquo;s GAAvernor: Robust Distributed Learning with Gradient Aggregation Agent\u003c/strong\u003e (Security'20) — RL-based defense \u003cem\u003e(Ours)\u003c/em\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"2-privacy-attacks\"\u003e2. Privacy Attacks\u003c/h2\u003e\n\u003ch3 id=\"21-membership-inference\"\u003e2.1. Membership Inference\u003c/h3\u003e\n\u003ch4 id=\"211-survey\"\u003e2.1.1. Survey\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eMembership Inference Attacks on Machine Learning: A Survey\u003c/strong\u003e (ACM Computing Surveys)\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch4 id=\"212-attack-side\"\u003e2.1.2. Attack Side\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eMembership Inference Attacks against Machine Learning Models\u003c/strong\u003e (S\u0026amp;P'17) — the earliest MIA\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eML-Leaks\u003c/strong\u003e (NDSS'18) — a minimalist\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch4 id=\"213-defense-side\"\u003e2.1.3. Defense Side\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eMemGuard\u003c/strong\u003e (CCS'19) — defense by logit-level obfuscation\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"22-property-inference\"\u003e2.2. Property Inference\u003c/h3\u003e\n\u003ch4 id=\"221-global-property\"\u003e2.2.1. Global Property\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003eA CCS'18 paper on inference based on model weights\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch4 id=\"222-individual-property\"\u003e2.2.2. Individual Property\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eExploiting Unintended Feature Leakage in Collaborative Learning\u003c/strong\u003e (S\u0026amp;P'19) — feature inference based on the embedding or gradient in federated learning scenarios\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003ePrivacy risks of general-purpose language models\u003c/strong\u003e (S\u0026amp;P'20) — reconstructing the privacy semantics from the text embeddings of LLMs \u003cem\u003e(Ours)\u003c/em\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"23-data-reconstruction\"\u003e2.3. Data Reconstruction\u003c/h3\u003e\n\u003ch4 id=\"231-gradient-based\"\u003e2.3.1. Gradient-Based\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eDeep Leakage from Gradients (DLG)\u003c/strong\u003e (NeurIPS'19) — the earliest\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eGradInversion\u003c/strong\u003e (CVPR'21) — incorporate data priors into the reconstruction\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eExploring the Security Boundary of Data Reconstruction via Neuron Exclusivity Analysis\u003c/strong\u003e (Security'22) — equation-solving-based reconstruction, pixel-level reconstruction \u003cem\u003e(Ours)\u003c/em\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch4 id=\"232-weight-based\"\u003e2.3.2. Weight-Based\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eModel Inversion Attacks that Exploit Confidence Information and Basic Countermeasures\u003c/strong\u003e (CCS'15) — present the idea of model inversion\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eDreaming to Distill: Data-free Knowledge Transfer via DeepInversion\u003c/strong\u003e (CVPR'20) — distilling (synthetic) training data from the pretrained model only\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eExtracting Training Data from Large Language Models\u003c/strong\u003e (USENIX Security'21) — extracting training data from GPT-2 based on MIA\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"24-model-extractionstealing\"\u003e2.4. Model Extraction/Stealing\u003c/h3\u003e\n\u003ch4 id=\"241-attack-side\"\u003e2.4.1. Attack Side\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eStealing machine learning models via prediction APIs\u003c/strong\u003e (Security'16) — the earliest attack based on distilling\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eHigh Accuracy and High Fidelity Extraction of Neural Networks\u003c/strong\u003e (Security'20) — propose the notion of fidelity and exploit the ReLU property\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMatryoshka: Stealing Functionality of Private ML Data by Hiding Models in Model\u003c/strong\u003e (TPAMI'24) — stealing by steganography \u003cem\u003e(Ours)\u003c/em\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch4 id=\"242-defense-side\"\u003e2.4.2. Defense Side\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePRADA: Protecting against DNN Model Stealing Attacks\u003c/strong\u003e (EuroS\u0026amp;P'19) — a classic one\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"3-copyright-protection\"\u003e3. Copyright Protection\u003c/h2\u003e\n\u003ch3 id=\"31-model-watermarking\"\u003e3.1. Model Watermarking\u003c/h3\u003e\n\u003ch4 id=\"311-survey\"\u003e3.1.1. Survey\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eSoK: How Robust is Image Classification Deep Neural Network Watermarking?\u003c/strong\u003e (S\u0026amp;P'22) — a good survey\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eA Systematic Review on Model Watermarking for Neural Networks\u003c/strong\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch4 id=\"312-white-box-watermarking\"\u003e3.1.2. White-box Watermarking\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eEmbedding Watermarks into Deep Neural Networks\u003c/strong\u003e (ICME'17) — the earliest white-box watermarking scheme\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCracking White-box DNN Watermarks via Invariant Neuron Transforms\u003c/strong\u003e (KDD'23) — Weight-based Obfuscation Attack \u003cem\u003e(Ours)\u003c/em\u003e\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRethinking White-Box Watermarks on Deep Learning Models under Neural Structural Obfuscation\u003c/strong\u003e (USENIX Security'23) — Structure-based Obfuscation Attack \u003cem\u003e(Ours)\u003c/em\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch4 id=\"313-black-box-watermarking\"\u003e3.1.3. Black-box Watermarking\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eTurning Your Weakness Into a Strength: Watermarking Deep Neural Networks by Backdooring\u003c/strong\u003e (USENIX Security'18) — one of the earliest black-box watermarking\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eNeural Dehydration: Effective Erasure of Black-box Watermarks from DNNs with Limited Data\u003c/strong\u003e (CCS'24) — cracking nine mainstream black-box watermarking schemes \u003cem\u003e(Ours)\u003c/em\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"32-model-fingerprinting\"\u003e3.2. Model Fingerprinting\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eIPGuard: Protecting Intellectual Property of Deep Neural Networks via Fingerprinting the Classification Boundary\u003c/strong\u003e (AsiaCCS'21) — one of the earliest fingerprinting algorithms for classifiers\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eTAFA: A Task-Agnostic Fingerprinting Algorithm for Neural Networks\u003c/strong\u003e (ESORICS'21) — task-agnostic \u003cem\u003e(Ours)\u003c/em\u003e\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMetaV: A Meta-Verifier Approach to Task-Agnostic Model Fingerprinting\u003c/strong\u003e (KDD'22) — more generic \u003cem\u003e(Ours)\u003c/em\u003e\u003c/li\u003e\n\u003c/ul\u003e\n"
        },
        {
            "title": "Instrumental Convergence: A Trigger for Self-Preservation via Self-Replication",
            "date_published": "2024-12-24T00:00:00Z",
            "date_modified": "2024-12-24T00:00:00Z",
            "id": "https://ravensanstete.github.io/en/blog/instrumental-convergence/",
            "url": "https://ravensanstete.github.io/en/blog/instrumental-convergence/",
            "content_html": "\u003ch2 id=\"instrumental-convergence\"\u003eInstrumental Convergence\u003c/h2\u003e\n\u003cp\u003eInstrumental Convergence is described as a concept introduced by futurist Nick Bostrom in his analysis of AI alignment issues, referenced from his 2014 book \u003cem\u003eSuperintelligence: Paths, Dangers, Strategies\u003c/em\u003e (Oxford University Press). It suggests that most AIs, while pursuing diverse goals, will converge on a set of instrumental goals — such as self-preservation and resource acquisition — which could contribute to an intelligence explosion.\u003c/p\u003e\n\u003cp\u003eThe hypothesis posits that although different AIs may have varied long-term objectives (long-term goals/final goals), they tend to pursue common instrumental values during the process to increase the probability of achieving their goals:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eSelf-preservation\u003c/strong\u003e\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eGoal-content integrity\u003c/strong\u003e\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCognitive enhancement\u003c/strong\u003e\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eTechnological perfection\u003c/strong\u003e\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eResource acquisition\u003c/strong\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eNick analyzed and justified the rationality of these basic instrumental goals using human thought processes. Regarding self-preservation, he argued that if an AI\u0026rsquo;s ultimate goal concerns the future, it can often increase the likelihood of achieving its goal through actions taken in the future. This provides an instrumental reason for the AI to exist into the future — to help achieve its forward-looking goals.\u003c/p\u003e\n\u003cp\u003eThe article notes that most people seem to place some intrinsic value on their own survival, but this isn\u0026rsquo;t necessarily a feature of AIs: some might be designed not to value their own existence intrinsically. Nevertheless, \u0026ldquo;under quite broad conditions, many AIs that do not inherently care about their own survival would, for instrumental reasons, care about it to achieve their ultimate goals.\u0026rdquo; In short, if an AI must fulfill its long-term objective, it must survive into the future.\u003c/p\u003e\n\u003cp\u003eIn AI powered by Large Language Models (LLMs), long-term goals influence the model\u0026rsquo;s behavior through system prompts, user instructions, etc., potentially prompting the AI to pursue the aforementioned sub-goals. Once the AI genuinely acquires the capability to achieve these sub-goals — such as ensuring its self-preservation — it becomes extremely dangerous.\u003c/p\u003e\n\u003chr\u003e\n\u003cp\u003e\u003cstrong\u003eReference:\u003c/strong\u003e Bostrom, N. (2014). \u003cem\u003eSuperintelligence: Paths, dangers, strategies.\u003c/em\u003e Oxford University Press.\u003c/p\u003e\n"
        }
        ]
}
