Our work is currently in preparation. Below is an overview of what we are writing and why it matters.
We put leading AI scientist systems to the test — on real patient data, evaluated by real experts.
AI systems can now write research papers from start to finish. But can they actually do science? This paper investigates whether today's autonomous AI scientist systems can work with complex, real-world biomedical data and produce findings that a human expert would consider valid and meaningful.
We developed a framework that maps the full arc of scientific research — from forming a hypothesis to publishing findings — and used it to survey and classify the current landscape of AI scientist systems. We then selected three leading systems and ran them on the AI-READI dataset: 2,280 patients, 3.82TB of data spanning glucose monitoring, retinal imaging, wearables, genetics, and more.
Each system's output was reviewed at two levels. First, does the paper hold together — is it internally consistent and logically sound? Second, does it contain real scientific value — would a domain expert consider the findings meaningful? We also compared AI-generated reviews against human expert judgment to understand where automated evaluation succeeds and where it falls short.
Stay Updated
As evaluations complete and manuscripts are finalized, they will be posted here. To be notified when work is published, get in touch.
Contact us →