~www_lesswrong_com | Bookmarks (664)

Improving Model-Written Evals for AI Safety Benchmarking — LessWrong

lesswrong.com

Published on October 15, 2024 6:25 PM GMTThis post was written as part of the summer...
Published on October 15, 2024 6:25 PM GMTThis post was written as part of the summer 2024 cohort of the ML Alignment & Theory Scholars program, under the mentorship of Marius Hobbhahn.AbstractAs model-written evals (MWEs) become more widely used in AI benchmarking, we need scalable approaches to assessing the quality of the eval questions. Upon examination of an existing benchmark dataset (Perez et al., 2022),...
1
Anthropic's updated Responsible Scaling Policy — LessWrong

lesswrong.com

Published on October 15, 2024 4:46 PM GMTToday we are publishing a significant update to our...
Published on October 15, 2024 4:46 PM GMTToday we are publishing a significant update to our Responsible Scaling Policy (RSP), the risk governance framework we use to mitigate potential catastrophic risks from frontier AI systems. This update introduces a more flexible and nuanced approach to assessing and managing AI risks while maintaining our commitment not to train or deploy models unless we have implemented...
1
When is reward ever the optimization target? — LessWrong

lesswrong.com

Published on October 15, 2024 3:09 PM GMTAlright, I have a question stemming from TurnTrout's post...
Published on October 15, 2024 3:09 PM GMTAlright, I have a question stemming from TurnTrout's post on Reward is not the optimization target, where he argues that the premises that are required to get to the conclusion of reward being the optimization target are so narrowly applicable as to not apply to future RL AIs as they gain more and more power:https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target#When_is_reward_the_optimization_target_of_the_agent_But @gwern argued...
1
An Opinionated Evals Reading List — LessWrong

lesswrong.com

Published on October 15, 2024 2:38 PM GMTWhile you can make a lot of progress in...
Published on October 15, 2024 2:38 PM GMTWhile you can make a lot of progress in evals with tinkering and paying little attention to the literature, we found that various other papers have saved us many months of research effort. The Apollo Research evals team thus compiled a list of what we felt were important evals-related papers. We likely missed some relevant papers, and...
1
Anthropic's first RSP update — LessWrong

lesswrong.com

Published on October 15, 2024 2:25 PM GMTI am actively editing this post. Consider reading it...
Published on October 15, 2024 2:25 PM GMTI am actively editing this post. Consider reading it in a couple hours; maybe avoid sharing it until it's done.New canonical linkpost: https://www.lesswrong.com/posts/Q7caj7emnwWBxLECF/anthropic-s-updated-responsible-scaling-policy.I haven't yet formed an opinion on the key questions, including are the thresholds and mitigations reasonable and adequate.Anthropic's first update to its RSP is here at last.Today we are publishing a significant update to...
1
[Intuitive self-models] 5. Dissociative Identity (Multiple Personality) Disorder — LessWrong

lesswrong.com

Published on October 15, 2024 1:31 PM GMT5.1 Post summary / Table of contentsThis is the...
Published on October 15, 2024 1:31 PM GMT5.1 Post summary / Table of contentsThis is the 5th of a series of 8 blog posts, which I’m serializing weekly. (Or email or DM me if you want to read the whole thing right now.)Dissociative Identity Disorder (DID) (previously known as “Multiple Personality Disorder”) involves a person having multiple “alters” (alternate identities), with different preferences and (in some...
1
Economics Roundup #4 — LessWrong

lesswrong.com

Published on October 15, 2024 1:20 PM GMTPrevious Economics Roundups: #1, #2, #3 Fun With Campaign...
Published on October 15, 2024 1:20 PM GMTPrevious Economics Roundups: #1, #2, #3 Fun With Campaign Proposals (1) Since this section discusses various campaign proposals, I’ll reiterate: I could not be happier with my decision not to cover the election outside of the particular areas that I already cover. I have zero intention of telling anyone who to vote for. That’s for you to...
1
Is School of Thought related to the Rationality Community? — LessWrong

lesswrong.com

Published on October 15, 2024 12:41 PM GMTIf so, who are they? Link: https://yourbias.is/ At a...
Published on October 15, 2024 12:41 PM GMTIf so, who are they? Link: https://yourbias.is/ At a gloss the material looks really polished and topical. Discuss
1
Inverse Problems In Everyday Life — LessWrong

lesswrong.com

Published on October 15, 2024 11:42 AM GMTThere’s a class of problems broadly known as inverse problems....
Published on October 15, 2024 11:42 AM GMTThere’s a class of problems broadly known as inverse problems. Wikipedia explains them as follows:An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them. [...] It is called an inverse problem because it starts with the effects and then calculates the causes. It is the inverse of a...
1
Thinking LLMs: General Instruction Following with Thought Generation — LessWrong

lesswrong.com

Published on October 15, 2024 9:21 AM GMTAuthors: Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao,...
Published on October 15, 2024 9:21 AM GMTAuthors: Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar.Summary thread: https://x.com/jaseweston/status/1846011492245672043.Abstract:LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and...
1
The AGI Entente Delusion — LessWrong

lesswrong.com

Published on October 13, 2024 5:00 PM GMTAs humanity gets closer to Artificial General Intelligence (AGI),...
Published on October 13, 2024 5:00 PM GMTAs humanity gets closer to Artificial General Intelligence (AGI), a new geopolitical strategy is gaining traction in US and allied circles, in the NatSec, AI safety and tech communities. Anthropic CEO Dario Amodei and RAND Corporation call it the “entente”, while others privately refer to it as “hegemony" or “crush China”. I will argue that, irrespective of one's ethical...
1
Parental Writing Selection Bias — LessWrong

lesswrong.com

Published on October 13, 2024 2:00 PM GMT In general I'd like to see a lot...
Published on October 13, 2024 2:00 PM GMT In general I'd like to see a lot more of people writing about their failures in addition to their successes. If a bunch of people all try a thing and have mixed results, and only the people with good results write about it, people who don't know about this selection bias or don't realize its extent...
1
Personal Philosophy — LessWrong

lesswrong.com

Published on October 13, 2024 3:01 AM GMTThis is a rough outline of my philosophical framework....
Published on October 13, 2024 3:01 AM GMTThis is a rough outline of my philosophical framework. Which gives me a context for all of my experiences and knowledge. Certain KnowledgeWe cannot know much for certain. I am a ultimate skeptic however I agree with Rene Descartes so far as I can certainly trust in my own existence. My conscious experience in this moment is guaranteed....
1
AI Compute governance: Verifying AI chip location — LessWrong

lesswrong.com

Published on October 12, 2024 5:36 PM GMTTL;DR: In this post I discuss a recently proposed...
Published on October 12, 2024 5:36 PM GMTTL;DR: In this post I discuss a recently proposed on-chip compute governance mechanism, a delay based location verification mechanism. Then I demonstrate feasibility and applicability of the mechanism through a few exercises and identify a potential issue with frequent false positives. Finally I propose a potential solution for the identified issue.Technical governance approaches will be critical for...
1
Contagious Beliefs—Simulating Political Alignment — LessWrong

lesswrong.com

Published on October 13, 2024 12:27 AM GMTHumans are social animals, and as such we are...
Published on October 13, 2024 12:27 AM GMTHumans are social animals, and as such we are influenced by the beliefs of those around us. This simulation explores how beliefs can spread through a population, and how indirect relationships between beliefs can lead to unexpected correlations. The featured simulation only works in the original post. I recommend visiting to explore the ideas fully.STRANGE BED-FELLOWSThere are...
1
How Should We Use Limited Time to Maximize Long-Term Impact? — LessWrong

lesswrong.com

Published on October 12, 2024 8:02 PM GMTI've been reflecting on how researchers—particularly those with limited...
Published on October 12, 2024 8:02 PM GMTI've been reflecting on how researchers—particularly those with limited time or resources—can best contribute to influencing the long-term future.Assumption: Individually, we may not have the resources to fund large projects, but we still want to make contributions that move the needle, however slightly.I approach this question with epistemic humility—recognizing that predicting long-term impact is incredibly difficult—and with...
1
Binary encoding as a simple explicit construction for superposition — LessWrong

lesswrong.com

Published on October 12, 2024 9:18 PM GMTSuperposition is the possibility of storing more than n.mjx-chtml {display:...
Published on October 12, 2024 9:18 PM GMTSuperposition is the possibility of storing more than n.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align:...
1
A Percentage Model of a Person — LessWrong

lesswrong.com

Published on October 12, 2024 5:55 PM GMTThe standard psychological questionnaire for depression doctors have given...
Published on October 12, 2024 5:55 PM GMTThe standard psychological questionnaire for depression doctors have given me is the PHQ-9. It names a symptom, and for each symptom it gives four possible responses in severity. The responses are worth points, the points are totaled, and the final score is supposed to be indicative of how severe a person’s depression is.This is what it looks...
1
Geoffrey Hinton on the Past, Present, and Future of AI — LessWrong

lesswrong.com

Published on October 12, 2024 4:41 PM GMTIntroductionGeoffrey Hinton is a famous AI researcher who is...
Published on October 12, 2024 4:41 PM GMTIntroductionGeoffrey Hinton is a famous AI researcher who is often referred to as the "godfather of AI" because of his foundational work on neural networks and deep learning from the 1980s until today. Arguably his most significant contribution to the field of AI was the introduction of the backpropagation algorithm for neural network training in 1986 which...
1
AI research assistants competition 2024Q3: Tie between Elicit and You.com — LessWrong

lesswrong.com

Published on October 12, 2024 3:10 PM GMT Summary I make a large part of my...
Published on October 12, 2024 3:10 PM GMT Summary I make a large part of my living performing literature reviews to answer scientific questions. For years AI has been unable to do anything to lower my research workload, but back in August I tried Perplexity, and it immediately provided value far beyond what I’d gotten from other tools. This wasn’t a fair comparison because I...
1
Rationality Quotes - Fall 2024 — LessWrong

lesswrong.com

Published on October 10, 2024 6:37 PM GMTOnce upon a time, there were posts where people...
Published on October 10, 2024 6:37 PM GMTOnce upon a time, there were posts where people would comment with quotes. The last one was a while ago, and they'd been getting less and less traffic. I have a theory that there's a quote overhang. Firstly, I think LessWrong has recovered a lot since the 2015~2017 decline, and so there might just be more users around....
1
why won't this alignment plan work? — LessWrong

lesswrong.com

Published on October 10, 2024 3:44 PM GMTthe idea:we give the AI a massive list of...
Published on October 10, 2024 3:44 PM GMTthe idea:we give the AI a massive list of actionseach one is annotated with how much utility we estimate it to havefor example, we list "giving someone sad a hug" as having, say, 6 utility, but "giving someone sad a hug | they didn't consent to it" has -4 utility or something like thatwe train it to...
1
AI #85: AI Wins the Nobel Prize — LessWrong

lesswrong.com

Published on October 10, 2024 1:40 PM GMTBoth Geoffrey Hinton and Demis Hassabis were given the...
Published on October 10, 2024 1:40 PM GMTBoth Geoffrey Hinton and Demis Hassabis were given the Nobel Prize this week, in Physics and Chemistry respectively. Congratulations to both of them along with all the other winners. AI will be central to more and more of scientific progress over time. This felt early, but not as early as you would think. The two big capability...
1
Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming — LessWrong

lesswrong.com

Published on October 10, 2024 1:36 PM GMTOne strategy for mitigating risk from schemers (that is,...
Published on October 10, 2024 1:36 PM GMTOne strategy for mitigating risk from schemers (that is, egregiously misaligned models that intentionally try to subvert your safety measures) is behavioral red-teaming (BRT). The basic version of this strategy is something like: Before you deploy your model, but after you train it, you search really hard for inputs on which the model takes actions that are very...
1

~www_lesswrong_com | Bookmarks (664)

Domains