Andrea Loehr

Andrea Loehr

Astrophysics, Oncology, AI Safety

AI Safety Projects

Exploring Gaps in Model Safety Evaluation: Findings from Red-Teaming the SALAD-Bench Benchmark for Large Language Models

Here, we explore limitations in current large language model (LLM) safety evaluation frameworks and examine how prompt style can affect the safety classification of LLM outputs. We use the SALAD-Bench and its MD-Judge evaluator to classify ChatGPT3.5-turbo responses to over 21,000 harmful prompts across 6 major harm categories into safe or unsafe responses using one simple directive and one Chain-of-Thought prompt. The simple directive and CoT prompts resulted in 7% and 16% unsafe responses, respectively. Analyzing individual responses, we identified large-scale patterns of false safe classification. This misclassification gives a false sense of security and can potentially further unsafe LLM behavior when future models are trained to meet benchmarking goals.

View on GitHub

My Journey in AI Safety

Course: AGI Strategy by BlueDot Impact (Oct. 2025)

Course: AI Safety, Ethics and Society by the Center for AI Safety (May 2025)

Certificate: ChatGPT Prompt Engineering and Advanced Data Analysis (Coursera, 2023)

About Me

I am a physicist by training, with two decades of experience spanning research, leadership, data analysis, and software engineering in two fields: astrophysics and oncology. As I have been learning more about AI safety, I have developed a passionate sense of urgency to contribute to the safe and responsible development of AI. I bring a wide range of transferrable skills and a proven track record of peer-reviewed publications, patents, and FDA approvals.


I am committed to donating 20+ pro bono hours of my focused, productive work time to an AI project with an organization that moves the needle in the field.


I am open to new opportunities and collaborations. Feel free to reach out!