Pretraining
More data, larger models, and more compute all improve alignment, but each pretraining resource shows clear saturation; eventually, standard pretraining enters diminishing returns.
data · params · FLOPs// ICML 2026 · EPFL NeuroAI
Multimodal scaling laws for task & data-optimized models of visual cortex. 8 datasets, 600+ models, 4 modalities — measuring which resources improve model-to-brain alignment.
A brain model = a vision backbone + a learned mapping to multimodal recordings. We scale three controllable resources — pretraining, neural fine-tuning, and the mapping itself.
Builds on prior work. This project extends our Scaling Laws for Task-Optimized Models of the Primate Visual Ventral Stream study — which compared neural vs. behavioral alignment and different learning signals across datasets and models — to multimodal brain benchmarks, neural fine-tuning, and readout scaling.
More data, larger models, and more compute all improve alignment, but each pretraining resource shows clear saturation; eventually, standard pretraining enters diminishing returns.
data · params · FLOPsHybrid task + neural supervision yields small but consistent gains that transfer across datasets and recording modalities.
hybrid supervisionIncreasing paired stimulus–response samples for the model-to-brain readout produces the largest and most reliable gains of all three axes.
largest gainsExplore how pretraining, neural fine-tuning, and mapping shape alignment.
pretraining FLOPs (log)
Data, model size, and compute all improve alignment, with diminishing returns at larger scales.
Brain alignment rises with pretraining compute, dataset size, and model parameters, but it plateaus across every benchmark and recording modality. Scaling data yields larger gains than scaling parameters. Curves are fit on the spvvs model set from our prior work; held-out timm models (crosses) follow the same trajectories at larger scales.
Scaling results with additional alignment metrics (RSA and CKA) are in the appendix.
Noise-ceiled alignment (Pearson r) S increases with (a) pretraining FLOPs C, (b) samples D, and (c) parameters N, but ultimately plateaus. Points are individual models; solid curves are fitted scaling laws with uncertainty bands.
At small data and compute budgets, architectures differ sharply in alignment; those gaps shrink as pretraining scales up. Training trajectories of different model sizes converge to similar saturation levels — so widely used architectures reach comparable alignment to far more expensive models.
Across model sizes and architecture families, alignment gaps are largest at small scale and shrink as pretraining data and compute increase.
More accurate object-recognition models are more brain-aligned. The highest-accuracy models, however, cluster at similar alignment levels — the same saturation we see under compute. The relationship holds for both ImageNet- and Ecoset-pretrained models, suggesting it reflects general properties of good recognition rather than dataset-specific labels.
Per-ROI accuracy–alignment plots across datasets are in the appendix.
Fine-tuning a backbone with a hybrid task + neural objective improves alignment as paired neural samples grow, with no clear saturation at current dataset sizes. Gains are largest for the fMRI datasets and T-MEG; T-EEG1, which has the lowest signal-to-noise ratio, is the main exception.
Per-dataset, per-benchmark curves with more architectures and permutation tests are in the appendix.
We fine-tune ViT-S on one dataset and evaluate it on all benchmarks. Most transfers are positive across electrophysiology, fMRI, EEG, and MEG. The gains are modest but consistent (~0.01–0.02 absolute), and some cross-modal transfers are stronger than within-modality ones.
Each panel fine-tunes ViT-S on one dataset (panel title); bars show the change in brain alignment relative to the pretrained baseline across all benchmarks.
The alignment gains from fine-tuning come with a slight drop in task accuracy. For ViT-S, full-data fine-tuning moves accuracy 0.783 → 0.769 while mean alignment rises 0.529 → 0.535; the same direction holds for ViT-B/L, ConvNeXt-S, and ResNet-50, with the largest alignment gain for ViT-L (0.539 → 0.547).
Increasing the number of stimulus–neural pairs used to fit the model-to-brain mapping yields consistent, roughly log-linear gains, the largest of any axis. For most datasets the fitted curves do not saturate over the observed range, indicating alignment remains mapping-data-limited.
Attention-readout scaling and implementation details are in the appendix.
The appendix adds further methodological detail — including data preprocessing — and additional analysis, such as the consistency of layer commitment across datasets.
Beyond the scaling analyses, we introduce a readout that shares a cross-attention block across subjects while keeping lightweight subject- and ROI-specific output heads. It matches or exceeds per-subject linear decoding on TVSD-EP, T-fMRI, NSD-fMRI, and T-EEG1, reaches the best average alignment overall, using ~10× fewer parameters than the linear baseline.
Noise-ceiled Pearson r, averaged across subjects, ROIs, and three seeds. SS = single-subject, MS = multi-subject. Best per row in bold.
| Benchmark | LinearSS | Shallow MLPSS | Shallow MLPMS | Low-rankSS | AttentionSS | AttentionMS |
|---|---|---|---|---|---|---|
| TVSD-EP | 0.801 | 0.665 | 0.673 | 0.656 | 0.844 | 0.846 |
| NSD-fMRI | 0.693 | 0.625 | 0.658 | 0.623 | 0.677 | 0.723 |
| T-fMRI | 0.458 | 0.390 | 0.403 | 0.370 | 0.461 | 0.487 |
| T-EEG1 | 0.245 | 0.220 | 0.217 | 0.216 | 0.234 | 0.257 |
| T-EEG2 | 0.447 | 0.367 | 0.395 | 0.365 | 0.390 | 0.434 |
| T-MEG | 0.453 | 0.374 | 0.386 | 0.339 | 0.410 | 0.414 |
| Average | 0.516 | 0.440 | 0.455 | 0.428 | 0.503 | 0.527 |
| Avg. #Params (×107) | 23.03 | 2.53 | 2.06 | 3.31 | 2.53 | 2.06 |
Dataset abbreviations: TVSD-EP = THINGS Ventral-stream Spiking Dataset electrophysiology; NSD-fMRI = Natural Scenes Dataset fMRI; T-fMRI = THINGS-fMRI; T-EEG1/2 = THINGS-EEG; T-MEG = THINGS-MEG.
Hosted PDF
Forum and metadata
Training, extraction, evaluation & curve fitting
Published result tables on Hugging Face
Poster and virtual session
@inproceedings{
gokce2026multimodal_brain_scaling,
title = {Multimodal Scaling Laws for Task & Data-Optimized Models of Visual Cortex},
author = {Abdulkadir Gokce and Yingtian Tang and Martin Schrimpf},
booktitle = {Forty-third International Conference on Machine Learning},
year = {2026},
url = {https://openreview.net/forum?id=OQ6jQHJPTT}
}