Reading log

Reads

papers, blogs, and talks I’ve been reading

41 reads27 days32 blog7 arxiv1 x1 list

topics:

6 May 2026
- blog
  Cartesian Frames · Alignment Forum#maths
- arxiv
  Cartesian Frames · Scott Garrabrant#maths
5 May 2026
- blog
  Without Specific Countermeasures, the Easiest Path to AGI-Level Capabilities is Dangerous Alignment · LessWrong#alignment#ai-safety
- arxiv
  Scheming AIs · Joe Carlsmith#alignment#ai-safety
24 Apr 2026
- x
  DeepSeek V4: Technical Report thread · Elie Bakouch
23 Apr 2026
- blog
  Brief Explorations in LLM Value Rankings · Hua et al.#Value
18 Apr 2026
- blog
  Understanding Muon: A Revolutionary Neural Network Optimizer · Muon#notes
- blog
  Muon: An optimizer for hidden layers in neural networks · Muon#notes
14 Apr 2026
- blog
  Persona vectors: Monitoring and controlling character traits in language models · Anthropic#alignment#interpretability
- blog
  AI Discourse Causes Self-Fulfilling (Mis)alignment · Alignment Pretraining#alignment#pretraining
11 Apr 2026
- blog
  The Bayesian Conspiracy · LessWrong#alignment#ai-safety
- arxiv
  PersonaGym: Evaluating Persona Agents and LLMs · Samuel et al.#model-personas
- arxiv
  The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation · Reza et al.#model-personas
8 Apr 2026
- blog
  AGI safety from first principles · Alignment Forum#alignment#ai-safety
5 Apr 2026
- list
  Sunday Reading List · Arc folder#alignment#interpretability
4 Apr 2026
- blog
  Shaping the Exploration of the Motivation Space Matters for Alignment · LessWrong#alignment#ai-safety
- blog
  Did Claude 3 Opus Align Itself via Gradient Hacking? · LessWrong#alignment#interpretability
1 Apr 2026
- blog
  How I Think About My Research Process · Neel Nanda#research#alignment
27 Mar 2026
- blog
  Talent Needs of Technical AI Safety Teams · LessWrong#ai-safety#careers
23 Mar 2026
- blog
  Model Persona Research Agenda · CLR#alignment
22 Mar 2026
- blog
  Modifying Beliefs via SDF · Anthropic#alignment
19 Mar 2026
- blog
  Model Persona Research Agenda · CLR#ai-safety
10 Mar 2026
- blog
  AI Risk · CAIS#ai-safety
7 Mar 2026
- blog
  Investigating models for misalignment · AISI#eval-awareness#ai-safety
- blog
  Neurons learn to predict their own firing rates · McGovern Institute#neuro
5 Mar 2026
- blog
  seeing the castle from the cave · sophielwang#philosophy
4 Mar 2026
- blog
  The Replication Engine · IFP#ai-infrastructure
1 Mar 2026
- blog
  AI Control: Improving Safety Despite Intentional Subversion · Redwood#ai-safety#control
- blog
  The Case for Ensuring That Powerful AIs Are Controlled · Redwood#ai-safety#control
- blog
  Reward Tampering · Anthropic#ai-safety#rl
28 feb 2026
- blog
  Six Thoughts on AI Safety · Boaz Barak#ai-safety
27 feb 2026
- blog
  Persona Selection Model(PSM) · Anthropic#alignment
23 feb 2026
- arxiv
  Foundational Challenges in Assuring Alignment and Safety of Large Language Models · Anwar et al.reading#alignment#ai-safety
22 feb 2026
- blog
  Subliminal Learning · Anthropic#alignment#interpretability
- blog
  The Behavioral Selection Model for Predicting AI Motivations · Alex Mallen#alignment#philosophy
21 feb 2026
- blog
  Situational Awareness · Kelsey Piper#ai-safety#governance
- blog
  Debate Update: Obfuscated Arguments Problem · Beth Barnes#alignment#control
- arxiv
  Constitutional AI: Harmlessness from AI Feedback · Bai et al.#alignment#rl
20 feb 2026
- blog
  After Orthogonality: Virtue-Ethical Agency and AI Alignment · Peli Grietzer#alignment#philosophy
19 feb 2026
- blog
  The Bitter Lesson · Rich Sutton#rl
- arxiv
  Representation Engineering: A Top-Down Approach to AI Transparency · Zou et al.#interpretability