About
I am a PhD student in computer science at Georgia Tech, advised by Mark Riedl and co-advised by Sudheer Chava. My work focuses on data provenance, interpretability, and evaluation for language models in finance, wargaming, law, healthcare, and other high-stakes settings.

The throughline across my work is simple: what do language models actually learn about people, and how can we tell when that understanding is real instead of plausible-looking imitation?
Most of my projects ask three linked questions: what in the data teaches models about beliefs, incentives, and roles; how that structure appears inside the model; and what evaluation is strong enough to distinguish robust behavior from surface fluency.
I prefer domains where misunderstanding people has concrete costs. Finance, wargaming, law, and healthcare make it easier to test whether a model is tracking institutions and incentives rather than just continuing text in the right style.
Today the work sits at the intersection of machine learning research, high-stakes evaluation, and institutional questions about how these systems should be deployed.
PhD student in computer science, advised by Mark Riedl at the Machine Learning Center and co-advised by Sudheer Chava in finance.
Participant in the MATS program and Fellowship Facilitator for AISI, with ongoing work related to post-AGI safety policy.
Published at ACL, COLM, EMNLP, AAAI, and NeurIPS, with support from DARPA, NSF, Together AI, Modal Labs, and Thinking Machines Lab.
Before returning to academia, I worked on production NLP, healthcare analytics, credit risk, and recommendation systems. That experience still shapes how I think about evaluation, failure modes, and what counts as useful model behavior.
Built NLP pipelines over millions of clinical records, where model quality had to survive messy language, operational constraints, and real downstream decisions.
Worked on rare disease detection and real-world evidence from claims data, with an emphasis on extracting signal from noisy observational settings.
Developed credit risk models for underserved borrowers, which made the relationship between model outputs, incentives, and human consequences impossible to ignore.
Built recommendation systems at production scale, where personalization had to be both technically sound and legible enough to operate in practice.
If you are working on model behavior, interpretability, or high-stakes evaluation, I would be glad to compare notes.