Why AI Hasn't Cured Cancer
For AI to reach its potential in biology we need massive, longitudinal, patient-derived datasets
The AI bubble is expanding to include biology. Startups developing models for drug discovery and development are raising unprecedented amounts of capital at ever-increasing valuations. Industry leaders like Sam Altman and Dario Amodei frequently emphasize the transformative potential of AI-driven biological models to tackle some of humanity’s most daunting medical challenges—curing cancer, reversing aging, and more. While some of these claims are undoubtedly for the purpose of attracting more capital for their war chests, they nonetheless contain a lot of truth.
The opportunities for AI in biology are immense, yet historically the results of biological models haven't lived up to the excitement around their potential. When I was exploring starting Valinor I wanted to dive into why that was the case – why hasn’t AI come close to curing cancer?
Problem 1. Generalizability
The first constraint that biological models have compared to standard foundation models is the degree to which a biological model can be applied to out-of-distribution drugs, diseases, and patients. For example, a standard image classification model that can distinguish between dogs and cats will likely succeed on different tasks like distinguishing between horses and zebras. This is because the visual domain is coherent, i.e. a diffusion model learning how to identify any type of image will likely perform well no matter what images are selected.
Biology doesn’t work like this. The moment that the biological context shifts, the latent space crumbles. A model trained on thousands of cancer cell lines will struggle to perform better than a simple linear regression model on non-oncology indications if the regression model has even a small amount of relevant training data. Until we find a way to generalize ML models across indications, we will need to build a custom foundation model for each therapeutic area (oncology, neurological, cardiovascular, etc), which will then need to be further fine-tuned for specific disease indications and drug modalities.
Problem 2. Data Volume
The sheer amount of data available to train biological models is many orders of magnitude smaller than all of the text or image data used to train standard foundation models. GPT-4 is rumored to have been trained on roughly 13 trillion tokens from the open internet. GPT-3 used 500B. Contrastingly, the largest single-cell perturbation dataset currently available, Tahoe-100M from Tahoe Therapeutics, is only 200-300B datapoints.
If pharma wants to see the same type of rapid improvements in biological models as we’re seeing in general foundation models, we need exponentially more data. Fortunately, the industry recognizes and is actively addressing this bottleneck. Companies like Parse Biosciences in transcriptomics, Olink in proteomics, and Renew in cfDNA methylation are developing cutting-edge technologies to dramatically enhance the amount of data captured from biological samples. For instance, Parse's GigaLab technology can analyze over 10 million cells in a single sequencing run—representing a tenfold improvement over previous state-of-the-art methods.
Problem 3: Data Quality
The amount of data isn’t the only data-related bottleneck. Where we are getting the data is equally critical. The vast majority of biological data being used today relies on immortalized cell lines like HeLa or K562, which are cells that quickly adapt to petri dishes and are easy to culture. These immortalized cell lines, while easier to set up, lack natural cell-to-cell variability, so you are just cloning the same cell to generate a dataset. Their mutations in response to different modulations are cheap to automate but are biological caricatures of the tissues and systems that they are actually meant to represent. Primary cells, biopsies, and organoids can capture in-vivo (literally in living) biological reality far better than in-vitro (means in glass, or cell-line derived) methods, but they are more expensive and harder to standardize.
There has been a lot of hype around “virtual cell” models – meaning ML models training on large amounts of cell-line derived, usually transcriptomic, data in an effort to accurately model the workings of a cell computationally. This is a poor medium for actual virtual models, as cell-line experiments are fundamentally not translatable to clinical efficacy. 90% of clinical trials fail. All of the clinical programs that made it to the clinic would have needed to have demonstrated promising results in cell-lines before being approved for human testing. So why would in silico virtual cell models be any different? Even if you get to 100% accuracy on cell-line experiments you are still only approaching 10% clinical translatability. So patient-derived data is needed.
Problem 4: Longitudinal Data
Yet another disadvantage of cell-line derived datasets is the fact that most are from immortalized cell-lines. These are artificial single cellular snapshots looking at the “before” and “after” in response to a modulation (chemical introduction, CRISPR knockout, etc).
Unfortunately, this isn’t how biological systems work. Physiological processes are much more progressive. Neurodegeneration will creep in over decades; immune responses will change over the course of hours; tumors will evolve over weeks in response to therapeutic pressure. Despite the clear impact of time on the significance of biological datasets, longitudinal datasets where samples are collected from the same patients over time are increasingly rare. When they do exist, privacy regulations, inconsistent data collection standards, and other covariates make harmonization alone a whole project. Without high-fidelity longitudinal datasets it will be impossible to train models to accurately model how diseases and therapeutic responses actually impact individuals over time.
Problem 5: Model Architecture
If the amount and quality of data set hard limits on biological models, architecture introduces soft limits. Transformer blocks, diffusion samplers and residual connections were originally designed for language process and image generation – not biology. Therefore it should be assumed that novel architectures should be created that can accommodate biology’s contextual nature: proteins fold in three dimensions; organs communicate via endocrine highways measured in milli-seconds or months; cellular response varies drastically depending on the drug modality, dosage, or exposure time used. The biases of standard foundation model architectures ignore domain-specific bottlenecks. Simply borrowing architectures from LLMs and vision models removes the opportunity to add biological and physico-chemical principles like thermodynamics, quantum/stereo chemistry, biomechanics, mass balance, or cell-state lineage, which could provide the scaffolding for the models to be truly generalizable.
Looking Forward
While biological modeling has historically faced real constraints, I’m confident that the future of ML in biology will be extremely exciting. That said, for us to achieve the same rapid scaling and performance gains we’ve seen with general‑purpose foundation models, we should focus on the following priorities:
Longitudinal Patient-Derived Multi-Omics Datasets
This should be table stakes. We aren’t going to revolutionize drug development by automating preclinical tests that aren’t translatable to clinical efficacy in actual patients. We need to be collecting samples from patients across indications and drug modalities at varying time points. Ideally these samples will be collected as frequently as standard clinical trial cycles, such as every 14 or 21 days for oncology.
Biology-Native Architectures
We need biology‑native ML architectures that capture the dynamic and interconnected nature of biological data. Interesting ideas that follow this priority include spatiotemporal graph neural networks for evolving cellular interactions and multimodal transformer models that can reason across diverse patient‑derived data.
Industry-Standard Benchmarks
Industry-wide standard benchmarks and evaluation protocols must be established. Currently, the biological modeling community operates with fragmented standards, leading to each new model coming with bespoke evaluation metrics, limiting meaningful comparison across efforts. Industry-standard benchmarks analogous to the ARC-AGI test for general purpose AI or MoleculeNet for chemical properties should be broadly adopted. Specific benchmarks tailored to clinical endpoints—for example, predicting patient outcomes based on longitudinal clinical assays or forecasting disease progression accurately—would substantially accelerate collaborative innovation.
Getting the evals correct for biological ML, particularly in regards to the clinical translatability of predictive models, is something that we’ve thought a lot about at Valinor, and what led us to partner with the Theis Lab to create industry standard benchmarks for perturbation models.
Conclusion
Biological modeling stands at an exciting yet pivotal crossroads. The scale and complexity of biology mean there are significant hurdles yet to overcome. However, by addressing these challenges head-on through expanded patient-derived datasets, novel architectures explicitly designed for biology, and robust industry-wide benchmarks we can unlock the true potential of AI in medicine and finally develop the cures humanity is waiting for.
Note: I am the founder of Valinor, a machine learning startup seeking to solve all of the bottlenecks I mentioned above by training clinical automation models on longitudinal, patient-derived datasets.
Thank you to Zhanel Nugmanova and Joseph Pacini for their feedback on this article.

