DJ Strouse
I am a Member of Technical Staff at OpenAI in San Francisco, where I work on large-scale reinforcement learning for reasoning.
Previously, I was a Research Scientist at DeepMind in New York, where I worked on improving the reasoning capabilities of frontier models with the Blueshift team. I did my PhD in Physics at Princeton University, advised by David Schwab and Bill Bialek, and funded by a Hertz Fellowship and DOE Computational Science Graduate Fellowship. Before that, I did a master's at the University of Cambridge with Mate Lengyel as a Churchill Scholar and studied physics and mathematics at the University of Southern California, where I worked with Bartlett Mel and Paolo Zanardi and had a blog. Throughout my studies, I interned at DeepMind with Matt Botvinick, Stanford University with Kwabena Boahen, the Institute for Quantum Computing with Andrew Childs, and Spotify NYC with their machine learning team. I also enjoy a good puzzle.
Email  | 
Twitter  | 
CV  | 
Scholar
|
|
Research
I'm broadly interested in improving reasoning capabilities in frontier models. Previously, I've worked on a variety of topics in reinforcement learning, including exploration, and training agents to play cooperative games with humans. My PhD thesis ("Optimization of MILES") focused on applications of the information bottleneck (IB) across supervised, unsupervised, and reinforcement learning, and definitely not on collecting airline miles. In past lives, I've also worked on quantum information theory and computational neuroscience.
|
Select Publications |
|
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team
arxiv, 2024
arxiv |
website |
tweet |
show bibtex
We trained a math-specialized version of Gemini 1.5 Pro that was the first model to publicly exceed 90% on Hendrycks MATH, a benchmark of difficult competition-level high school math problems. See Section 7 of the report for more details, or the tweets from Oriol, Jeff, Behnam, and Sundar.
@misc{geminiteam2024gemini1p5,
title = {Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context},
author = {Gemini Team},
year = {2024},
eprint = {2403.05530},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2403.05530},
}
|
|
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
Aaditya Singh,
DJ Strouse
arxiv, 2024
arxiv |
github |
tweet |
show bibtex
We show that frontier LLMs (GPT-3.5 and GPT-4) are better at doing addition when numbers are tokenized right-to-left (consistent with the direction we do addition), rather than the default left-to-right. We show that the errors models make with left-to-right number tokenization are highly stereotyped, suggesting systematic rather than random issues with the addition algorithm models learn. We also find evidence that the effects of number tokenization are scale-dependent, with larger models (GPT-4) exhibiting weaker effects than presumably smaller models (GPT-4 Turbo).
Our results were subsequently extended to newer models (e.g. Llama 3) by this blog post. Additionally, Claude 3, released after our work and with SOTA math capabilities, also notably uses right-to-left number tokenization.
@misc{singh2024tokenization,
title = {Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs},
author = {Aaditya K. Singh and DJ Strouse},
year = {2024},
eprint = {2402.14903},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2402.14903},
}
|
|
Confronting reward model overoptimization with constrained RLHF
Ted Moskovitz,
Aaditya Singh,
DJ Strouse,
Tuomas Sandholm,
Ruslan Salakhutdinov,
Anca D. Dragan,
Stephen McAleer
International Conference on Learning Representations (ICLR), 2024 (Spotlight)
arxiv |
openreview |
tweet |
show bibtex
We use tools from constrained optimization to combat overoptimization during reinforcement learning from human feedback (RLHF) against multiple reward models.
@inproceedings{moskovitz2023crlhf,
title = {Confronting Reward Model Overoptimization with Constrained RLHF},
author = {Ted Moskovitz and Aaditya K. Singh and DJ Strouse and Tuomas Sandholm and Ruslan Salakhutdinov and Anca D. Dragan and Stephen McAleer},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2023},
}
|
|
In-context Reinforcement Learning with Algorithm Distillation
Michael Laskin,
Luyu Wang,
Junhyuk Oh,
Emilio Parisotto,
Stephen Spencer,
Richie Steigerwald,
DJ Strouse,
Steven Hansen
Angelos Filos,
Ethan Brooks,
Maxime Gazeau,
Himanshu Sahni,
Satinder Singh,
Vlad Mnih
International Conference on Learning Representations (ICLR), 2023 (Oral)
arxiv |
openreview |
tweet |
show bibtex
We demonstrate that it is possible to distill entire reinforcement learning (RL) algorithms into the in-context learning abilities of a Transformer, by training models to do supervised prediction of multi-episodic trajectories from RL agents.
@inproceedings{laskin2023ad,
title = {In-context Reinforcement Learning with Algorithm Distillation},
author = {Michael Laskin and Luyu Wang and Junhyuk Oh and Emilio Parisotto and Stephen Spencer and Richie Steigerwald and DJ Strouse and Steven Stenberg Hansen and Angelos Filos and Ethan Brooks and Maxime Gazeau and Himanshu Sahni and Satinder Singh and Volodymyr Mnih},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2023},
}
|
|
Semantic Exploration from Language Abstractions and Pretrained Representations
Allison Tam, Neil Rabinowitz, Andrew Lampinen, Nicholas Roy, Stephanie Chan, DJ Strouse, Jane Wang, Andrea Banino, Felix Hill
Neural Information Processing Systems (NeurIPS), 2022
arxiv |
openreview |
neurips |
tweet |
show bibtex
Exploration in RL traditionally encouraged agents to visit random unexplored states. Taking advantage of improvements in multimodal frontier models, we show how to improve exploration by guiding agents towards semantically novel states, greatly speeding up learning.
@inproceedings{tam2022exploration,
title = {Semantic Exploration from Language Abstractions and Pretrained Representations},
author = {Tam, Allison and Rabinowitz, Neil and Lampinen, Andrew and Roy, Nicholas A. and Chan, Stephanie and Strouse, DJ and Wang, Jane and Banino, Andrea and Hill, Felix},
booktitle = {Neural Information Processing Systems (NeurIPS)},
year = {2022},
}
|
|
Learning more skills through optimistic exploration
DJ Strouse*,
Kate Baumli,
David Warde-Farley,
Vlad Mnih,
Steven Hansen*
International Conference on Learning Representations (ICLR), 2022 (Spotlight)
arxiv |
openreview |
github |
tweet |
show bibtex
We highlight the inherent pessmism towards exploration in a popular family of variational unsupervised skill learning methods. To curb this pessimism, we propose an ensemble uncertainty based exploration bonus that we call discriminator disagreement intrinsic reward, or DISDAIN. We show that DISDAIN improves skill learning in both a gridworld and the Atari57 suite. Thus, we encourage researchers to treat pessimism with DISDAIN.
@inproceedings{strouse2022disdain,
title = {Learning more skills through optimistic exploration},
author = {Strouse, DJ and Baumli, Kate and Warde-Farley, David and Mnih, Vlad and Hansen, Steven},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2022},
}
|
|
Collaborating with Humans without Human Data
DJ Strouse*,
Kevin R. McKee,
Matt Botvinick,
Edward Hughes,
Richard Everett*
Neural Information Processing Systems (NeurIPS), 2021 (Spotlight)
arxiv |
neurips |
openreview |
tweet |
alignment newsletter |
show bibtex
We introduce Fictitious Co-Play (FCP), a simple and intuitive training method for producing agents capable of zero-shot coordination with humans in Overcooked. FCP works by training an agent as the best response to a frozen pool of self-play agents and their past checkpoints. Notably, FCP exhibits robust generalization to humans, despite not using any human data during training.
@inproceedings{strouse2021fcp,
title = {Collaborating with Humans without Human Data},
author = {Strouse, DJ and McKee, Kevin R. and Botvinick, Matt and Hughes, Edward and Everett, Richard},
booktitle = {Neural Information Processing Systems (NeurIPS)},
year = {2021},
}
|
|
Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning
Natasha Jaques,
Angeliki Lazaridou,
Edward Hughes,
Caglar Gulcehre,
Pedro A. Ortega,
DJ Strouse,
Joel Z. Leibo,
Nando de Freitas
International Conference on Machine Learning (ICML), 2019 (Best Paper Honorable Mention)
arxiv |
icml |
openreview |
show bibtex
We reward agents for influencing the actions of other agents, and show that this gives rise to better cooperation and more meaningful emergent communication protocols.
@inproceedings{jaques2019influence,
title = {Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning},
author = {Jaques, Natasha and Lazaridou, Angeliki and Hughes, Edward and Gulcehre, Caglar and Ortega, Pedro and Strouse, DJ and Leibo, Joel Z. and De Freitas, Nando},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2019},
}
|
|
InfoBot: Transfer and Exploration via the Information Bottleneck
Anirudh Goyal,
Riashat Islam,
DJ Strouse,
Zafarali Ahmed,
Hugo Larochelle,
Matt Botvinick,
Sergey Levine,
Yoshua Bengio
International Conference on Learning Representations (ICLR), 2019
arxiv |
openreview |
show bibtex
We train agents in multi-goal environments with an information bottleneck between their goal and policy. This encourages agents to develop useful "habits" that generalize across goals. We identify the states where agents must deviate from their habits to solve a task as "decision states" and show that they are useful targets for an exploration bonus.
@inproceedings{goyal2019infobot,
title={Transfer and Exploration via the Information Bottleneck},
author={Anirudh Goyal and Riashat Islam and DJ Strouse and Zafarali Ahmed and Matthew Botvinick and Hugo Larochelle and Yoshua Bengio and Sergey Levine},
booktitle={International Conference on Learning Representations (ICLR)},
year = {2019},
}
|
|
Learning to share and hide intentions using information regularization
DJ Strouse,
Max Kleiman-Weiner,
Josh Tenenbaum,
Matt Botvinick,
David Schwab
Neural Information Processing Systems (NIPS), 2018
arxiv |
nips |
code |
show bibtex
We train agents to cooperate / compete by regularizing the reward-relevant information they share with other agents, enabling agents trained alone to nevertheless perform well in a multi-agent setting.
@inproceedings{strouse2018intentions,
title={Learning to share and hide intentions using information regularization},
author = {Strouse, DJ and Kleiman-Weiner, Max and Tenenbaum, Josh and Botvinick, Matt and Schwab, David J},
booktitle = {Neural Information Processing Systems (NeurIPS)},
year = {2018},
}
|
|
The information bottleneck and geometric clustering
DJ Strouse,
David Schwab
Neural Computation (NECO), 2019
pdf |
neco |
arxiv |
code |
show bibtex
We show how to use the (deterministic) information bottleneck to perform geometric clustering, introducing a novel information-theoretic model selection criterion. We show how this relates to and generalizes k-means and gaussian mixture models (GMMs).
@article{strouse2019clustering,
title = {Geometric Clustering with the Information Bottleneck},
author = {Strouse, DJ and Schwab, David J.},
journal = {Neural Computation},
year = {2019},
volume = {31},
number = {3},
pages = {596-612},
}
|
|
The deterministic information bottleneck
DJ Strouse,
David Schwab
Neural Computation (NECO), 2017 & Uncertainty in Artificial Intelligence (UAI), 2016
pdf |
arxiv |
code |
uai |
neco |
show bibtex
We introduce the deterministic information bottleneck (DIB), an alternative formulation of the information bottleneck that uses entropy instead of mutual information to measure compression. This results in a hard clustering algorithm with a built-in preference for using fewer clusters.
@article{strouse2017dib,
title = {The Deterministic Information Bottleneck},
author = {Strouse, DJ and Schwab, David J.},
journal = {Neural Computation},
year = {2017},
volume = {29},
number = {6},
pages = {1611-1630},
}
|
|