Bioptimus launches H-Optimus-1: a state-of-the-art foundation model for pathology

We have recently released H-Optimus-1, a new foundation model (FM) for pathology that reaches state-of-the-art performance on a large variety of downstream tasks, including the HEST benchmark¹.

H-Optimus-1 is a 1.1 billion parameter vision transformer trained with self-supervised learning on an extensive proprietary dataset. It consists of billions of histology images, sampled from over 1 million slides from more than 800,000 patients.

The model can be accessed for academic research purposes here.

H-Optimus-1 pre-training dataset

A crucial component in developing a strong FM is the quality and diversity of the dataset used for training the model.

H-Optimus-1 was trained on an extensive collection of over 1 million H&E-stained histology slides of more than 50 organs digitized with 3 scanner types across more than 4,000 clinical centers.

Importantly, the dataset used to train H-Optimus-1 is, to the best of our knowledge, the most patient-diverse dataset ever used to train a pathology FM, including histology slides of more than 800,000 patients² with various diseases. This patient diversity enables the model to learn from various histology patterns and diseases during training, ultimately resulting in rich and generalizable features that are useful for solving complex tasks.

Model evaluation

Results

H-Optimus-1 was benchmarked on 13 downstream tasks encompassing 15 datasets at both the slide level and tile level, including the HEST benchmark [Jaume et al. 2025].

HEST

This task consists of predicting gene expression from histology images in nine different organs. More details about this benchmark can be found here.

The metric used is Pearson’s correlation coefficient (higher is better). The models are ordered by decreasing average performance. Standard deviations are reported in parentheses. Bold indicates the highest score in a column.

Slide-level tasks

We have benchmarked H-Optimus-1 and other leading pathology FMs on a diverse set of slide-level downstream tasks using multiple instance learning:

META-BC: Identification of metastasis in breast cancer lymph nodes.
MSI-GC: Prediction of the microsatellite instability (MSI) status in gastric cancer.
MSI-CRC: Prediction of the MSI status in colorectal cancer.
KRAS-CRC: Prediction of the KRAS mutation status in colorectal cancer.
BRAF-CRC: Prediction of the BRAF mutation status in colorectal cancer.
HER2-BC: Prediction of the HER2 status in breast cancer.
ER-BC: Prediction of the ER status in breast cancer.
PR-BC: Prediction of the PR status in breast cancer.

The metric used is the area under the ROC curve (higher is better). More details about the evaluation methodology can be found in the ‘Slide-level tasks evaluation methodology’ section below. The models are ordered by decreasing average performance and standard deviations are reported in parentheses. Bold indicates the highest score in a row.

Tile-level tasks

We have also benchmarked the different pathology FMs on tile-level tasks using linear probing. These tasks are:

MHIST: classification of colorectal polyps as hyperplastic polyp or sessile serrated adenoma.
TCGA-UNIFORM: pan-cancer tumor tissue classification task (32 cancer types).
CAM17-WILDS: identification of tumor on histology patches of lymph nodes of patients diagnosed with breast cancer.
CRC-NO-NORM: classification of colorectal cancer histology images as one of nine tissue types.

The metric used is the top-1 accuracy (higher is better). More details about the evaluation methodology can be found in the ‘Tile-level tasks evaluation methodology’ section below. The models are ordered by decreasing average performance and standard deviations are reported in parentheses. Bold indicates the highest score in a column.

Additional information

Models benchmarked

We list in the table below the characteristics of the models benchmarked. For each model, the [CLS] token embedding was used for the downstream evaluations.

Slide-level evaluation tasks

We list in the table below the different tasks defined for the slide evaluation benchmark, and the datasets used to define these tasks.

FR-CRC-Bio is an internal dataset consisting of 727 CRC biopsies from multiple French hospitals. TCGA datasets were retrieved from https://portal.gdc.cancer.gov/.

Tile-level evaluation tasks

We list in the table below the different tasks used for the tile-level evaluation benchmark and their corresponding datasets. For MHIST, CAM17-WILDS and CRC-NO-NORM/CRC-VAL-HE-7K, we used the official train/test splits. For TCGA-UNIFORM, we designed a train/test split stratified according to the labels categories as no official split is available.

HEST evaluation methodology

We used the exact same procedure as [Jaume et al. 2025], we refer to their paper for the training details.

Slide-level tasks evaluation methodology

For each task, we train 10 ABMIL models [Ilse et al. 2018] by minimizing the binary cross-entropy loss with Adam [Kingma et al. 2014], using a batch size of 32 and a constant learning rate of 0.0001.

We select the number of training steps by 5-fold cross validation while minimizing the binary cross-entropy. The maximum number of training steps is 1000 for all models except for CONCH where it is set to 4000 steps to ensure convergence.

For the sake of robustness, the above procedure is repeated 5 times with different PyTorch seeds. The values reported in the table are the average metrics of the 5*10=50 ABMIL models. Standard deviations are reported on the average of the metrics for each seed, computed across all seeds.

For the sake of speed, a random subset of 3,000 tiles per slide is selected during training. For inference, a subset of 8,000 tiles is randomly selected.

Tile-level tasks evaluation methodology

For each task, we learn a linear classifier by minimizing the cross-entropy loss with SGD, with a constant learning rate and a batch size of 256. We select the following hyperparameters by 5-fold cross validation while minimizing the binary cross-entropy:

Learning rate in {1e-2, 5e-3, 2e-3, 1e-3, 1e-4}
Number of training steps in [100, 200, …, 12500]

To ensure convergence, a different set of learning rates is used for CONCH: {5e-1, 2e-1, 1e-1, 5e-2, 2e-2, 1e-2, 1e-3} while keeping the same number of training steps.For the sake of robustness, the above procedure is repeated 3 times with different PyTorch seeds. The values reported in the table are the average metrics of the 3*5=15 linear classifiers. Standard deviations are reported on the average of the metrics for each seed, computed across all seeds.

Acknowledgments

This project was partially supported by computational and storage resources from the GENCI at IDRIS, thanks to the grant 2024-GC011015442 on the supercomputer Jean Zay's H100 partition.

The results published here are partly based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.

Part of data used in this report were generated by the National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC):

National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2020). The Clinical Proteomic Tumor Analysis Consortium Colon Adenocarcinoma Collection (CPTAC-COAD) (Version 1) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.YZWQ-ZZ63

The following datasets from TCIA were used in the benchmarks:

Campanella, G., Hanna, M. G., Brogi, E., & Fuchs, T. J. (2019). Breast Metastases to Axillary Lymph Nodes [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/tcia.2019.3xbn2jcc
Farahmand, Saman, Fernandez, Aileen I, Ahmed, Fahad Shabbir, Rimm, David L., Chuang, Jeffrey H., Reisenbichler, Emily, & Zarringhalam, Kourosh. (2022). HER2 and trastuzumab treatment response H&E slides with tumor ROI annotations (Version 3) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/E65C-AM96

Regarding the PAIP2020 dataset: De-identified pathology images and annotations used in this research were prepared and provided by the Seoul National University Hospital by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI18C0316).

¹ On average, benchmarked against all other leading foundation models that were available at the time of the writing of this blog post.

² Number of patients in the training set: UNI2-h:<350k, Virchow2: 225k, Hibou: 306k, ATLAS: 490k, Phikon-v2: <58k.