Bookmarks (710)

screenshot

To be legible, evidence of misalignment probably has to be behavioral — LessWrong

lesswrong.com

screenshot

1

screenshot

AISN #51: AI Frontiers — LessWrong

lesswrong.com

screenshot

1

screenshot

Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI — LessWrong

lesswrong.com

screenshot

1

screenshot

OpenAI #13: Altman at TED and OpenAI Cutting Corners on Safety Testing — LessWrong

lesswrong.com

screenshot

1

screenshot

3M Subscriber YouTube Account 'Channel 5' Reporting On Rationalism — LessWrong

lesswrong.com

screenshot

1

screenshot

The real reason AI benchmarks haven’t reflected economic impacts — LessWrong

lesswrong.com

screenshot

1

screenshot

ASI existential risk: reconsidering alignment as a goal — LessWrong

lesswrong.com

Published on April 15, 2025 1:36 PM GMTDiscuss

screenshot

1

screenshot

Map of AI Safety v2 — LessWrong

lesswrong.com

screenshot

1

screenshot

Can SAE steering reveal sandbagging? — LessWrong

lesswrong.com

screenshot

1

screenshot

Risers for Foot Percussion — LessWrong

lesswrong.com

screenshot

1

screenshot

Intro to Multi-Agent Safety — LessWrong

lesswrong.com

screenshot

1

screenshot

Vestigial reasoning in RL — LessWrong

lesswrong.com

screenshot

1

screenshot

Four Types of Disagreement — LessWrong

lesswrong.com

screenshot

1

screenshot

How I switched careers from software engineer to AI policy operations — LessWrong

lesswrong.com

screenshot

1

screenshot

Steelmanning heuristic arguments — LessWrong

lesswrong.com

screenshot

1

screenshot

MONA: Three Month Later - Updates and Steganography Without Optimization Pressure — LessWrong

lesswrong.com

screenshot

1

screenshot

The Era of the Dividual—are we falling apart? — LessWrong

lesswrong.com

screenshot

1

screenshot

Commitment Races are a technical problem ASI can easily solve — LessWrong

lesswrong.com

screenshot

1

screenshot

Experts have it easy — LessWrong

lesswrong.com

screenshot

1

screenshot

Луна Лавгуд и Комната Тайн, Часть 3 — LessWrong

lesswrong.com

screenshot

1

screenshot

Is the ethics of interaction with primitive peoples already solved? — LessWrong

lesswrong.com

screenshot

1

screenshot

Can LLMs learn Steganographic Reasoning via RL? — LessWrong

lesswrong.com

screenshot

1

screenshot

My day in 2035 — LessWrong

lesswrong.com

screenshot

1

screenshot

Youth Lockout — LessWrong

lesswrong.com

screenshot

1

screenshot

OpenAI Responses API changes models' behavior — LessWrong

lesswrong.com

screenshot

1

screenshot

Weird Random Newcomb Problem — LessWrong

lesswrong.com

screenshot

1

screenshot

On Google’s Safety Plan — LessWrong

lesswrong.com

screenshot

1

screenshot

Луна Лавгуд и Комната Тайн, Часть 2 — LessWrong

lesswrong.com

screenshot

1

screenshot

Paper — LessWrong

lesswrong.com

screenshot

1

screenshot

Why are neuro-symbolic systems not considered when it comes to AI Safety? — LessWrong

lesswrong.com

screenshot

1

screenshot

Nuanced Models for the Influence of Information — LessWrong

lesswrong.com

Published on April 10, 2025 6:28 PM GMTDiscuss

screenshot

1

screenshot

Playing in the Creek — LessWrong

lesswrong.com

screenshot

screenshot

2

screenshot

The Three Boxes: A Simple Model for Spreading Ideas — LessWrong

lesswrong.com

screenshot

1

screenshot

Reactions to METR task length paper are insane — LessWrong

lesswrong.com

screenshot

1

screenshot

Existing Safety Frameworks Imply Unreasonable Confidence — LessWrong

lesswrong.com

screenshot

1

screenshot

Arguments for and against gradual change — LessWrong

lesswrong.com

screenshot

1

screenshot

Disempowerment spirals as a likely mechanism for existential catastrophe — LessWrong

lesswrong.com

screenshot

1

screenshot

My day in 2035 — LessWrong

lesswrong.com

screenshot

1

screenshot

AI #111: Giving Us Pause — LessWrong

lesswrong.com

screenshot

1

screenshot

Forging A New AGI Social Contract — LessWrong

lesswrong.com

screenshot

1

screenshot

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions — LessWrong

lesswrong.com

screenshot

1

screenshot

Thinking Machines — LessWrong

lesswrong.com

screenshot

1

screenshot

Digital Error Correction and Lock-In — LessWrong

lesswrong.com

screenshot

1

screenshot

What faithfulness metrics should general claims about CoT faithfulness be based upon? — LessWrong

lesswrong.com

screenshot

1

screenshot

AI 2027: Responses — LessWrong

lesswrong.com

screenshot

1

screenshot

The first AI war will be in your computer — LessWrong

lesswrong.com

screenshot

1

screenshot

Who wants to bet me $25k at 1:7 odds that there won't be an AI market crash in the next year? — LessWrong

lesswrong.com

screenshot

1

screenshot

A Pathway to Fully Autonomous Therapists — LessWrong

lesswrong.com

screenshot

1

screenshot

Misinformation is the default, and information is the government telling you your tap water is safe to drink — LessWrong

lesswrong.com

screenshot

1

screenshot

Log-linear Scaling is Worth the Cost due to Gains in Long-Horizon Tasks — LessWrong

lesswrong.com

screenshot

1