Maarten Bosma

About

I focus on data, evaluation, and post-training for large language models. I enjoy starting new research directions and building systems to support them.

My career in LLM research began with a 20% project at Google Brain where I helped train Google Brain's first large-scale autoregressive language model. I have been a Research Engineer at Google Brain, a founding team member at Inflection, and a Principal Member of Technical Staff at Microsoft AI.

Since October 2024 I have been on sabbatical. I have been traveling, staying at a forest monastery in Thailand where I was temporarily ordained as a monk. I have also been working on side projects, practicing meditation, and learning Thai.

Selected Work

Instruction tuning

I started the FLAN project at Google, which introduced instruction tuning, and co-led the first academic publication on the concept. I trained the initial model and organized the research effort to show that supervised fine-tuning on instruction-formatted datasets improves performance on unseen tasks. Instruction tuning has since become a standard first step in modern LLM post-training.
Reasoning

Co-author of the paper that introduced Chain of Thought. We showed that breaking problems into explicit steps improves accuracy on difficult tasks. This work helped establish methods used in reasoning-focused models and in scaling inference-time compute.
Code generation

I created the Mostly Basic Python Problems (MBPP) dataset, released in August 2021, an early benchmark that researchers widely use to evaluate Python program synthesis, contemporaneous with HumanEval.
Foundational model building

While at Google Brain, I was a core contributor to PaLM and worked across LaMDA and GLaM, building pretraining data pipelines, running data-quality experiments, and setting up evaluation systems. At Inflection, I designed and ran pretraining, post-training, and safety experiments, and built pretraining and evaluation pipelines that took us from zero to shipping Inflection-2.

See Google Scholar for a complete list of publications.

Industry Experience

Microsoft AI
Mar 2024 – Oct 2024

Principal Member of Technical Staff. Worked at Microsoft and on-site at OpenAI to launch and adopt GPT-4o realtime voice in Microsoft Copilot.
Inflection AI
Apr 2022 – Mar 2024

Founding Member of Technical Staff. Designed and ran post-training and safety experiments; built data and evaluation pipelines for pretraining and post-training; helped ship Inflection-1, Inflection-2 and Pi.
Google
Aug 2016 – Apr 2022

Research Engineer at Google Brain. Created pretraining data pipelines, ran data-quality ablations, and designed evaluation systems for LaMDA, GLaM, and PaLM. Core contributor to PaLM.

Before working on LLM research, I worked as an ML Engineer and Data Scientist at Google Ads and Whisper.

About

Selected Work

Instruction tuning

Reasoning

Code generation

Foundational model building

Industry Experience

Microsoft AI

Inflection AI

Google