// ICML 2026 · EPFL NeuroAI

scaling brain models

Multimodal scaling laws for task & data-optimized models of visual cortex. 8 datasets, 600+ models, 4 modalities — measuring which resources improve model-to-brain alignment.

Abdulkadir Gokce · Yingtian Tang · Martin Schrimpf

Overview: resource scaling, task pretraining, neural fine-tuning, and the mapping procedure

A brain model = a vision backbone + a learned mapping to multimodal recordings. We scale three controllable resources — pretraining, neural fine-tuning, and the mapping itself.

Builds on prior work. This project extends our Scaling Laws for Task-Optimized Models of the Primate Visual Ventral Stream study — which compared neural vs. behavioral alignment and different learning signals across datasets and models — to multimodal brain benchmarks, neural fine-tuning, and readout scaling.

Three controllable axes of alignment

01

Pretraining

More data, larger models, and more compute all improve alignment, but each pretraining resource shows clear saturation; eventually, standard pretraining enters diminishing returns.

data · params · FLOPs
02

Neural fine-tuning

Hybrid task + neural supervision yields small but consistent gains that transfer across datasets and recording modalities.

hybrid supervision
03

Mapping

Increasing paired stimulus–response samples for the model-to-brain readout produces the largest and most reliable gains of all three axes.

largest gains
Interactive explorer

Compare three routes to better brain models

Explore how pretraining, neural fine-tuning, and mapping shape alignment.

pretraining FLOPs (log)

Pretraining scale

Data, model size, and compute all improve alignment, with diminishing returns at larger scales.

verdict improves, then saturates
curve flattens with scale

What the curves show

Finding 01 · Pretraining

Pretraining helps, then saturates

Brain alignment rises with pretraining compute, dataset size, and model parameters, but it plateaus across every benchmark and recording modality. Scaling data yields larger gains than scaling parameters. Curves are fit on the spvvs model set from our prior work; held-out timm models (crosses) follow the same trajectories at larger scales.

Scaling results with additional alignment metrics (RSA and CKA) are in the appendix.

Alignment vs pretraining FLOPs, samples, and parameters — all saturate

Noise-ceiled alignment (Pearson r) S increases with (a) pretraining FLOPs C, (b) samples D, and (c) parameters N, but ultimately plateaus. Points are individual models; solid curves are fitted scaling laws with uncertainty bands.

Finding 02 · Pretraining

Architectural differences narrow with scale

At small data and compute budgets, architectures differ sharply in alignment; those gaps shrink as pretraining scales up. Training trajectories of different model sizes converge to similar saturation levels — so widely used architectures reach comparable alignment to far more expensive models.

Average alignment vs pretraining FLOPs, colored by model size Average alignment vs pretraining samples across architecture families

Across model sizes and architecture families, alignment gaps are largest at small scale and shrink as pretraining data and compute increase.

Finding 03 · Pretraining

Task accuracy tracks alignment, with diminishing returns

More accurate object-recognition models are more brain-aligned. The highest-accuracy models, however, cluster at similar alignment levels — the same saturation we see under compute. The relationship holds for both ImageNet- and Ecoset-pretrained models, suggesting it reflects general properties of good recognition rather than dataset-specific labels.

Per-ROI accuracy–alignment plots across datasets are in the appendix.

Mean alignment vs validation accuracy across architecture families and pretraining datasets
Finding 04 · Fine-tuning

Neural fine-tuning scales with data

Fine-tuning a backbone with a hybrid task + neural objective improves alignment as paired neural samples grow, with no clear saturation at current dataset sizes. Gains are largest for the fMRI datasets and T-MEG; T-EEG1, which has the lowest signal-to-noise ratio, is the main exception.

Per-dataset, per-benchmark curves with more architectures and permutation tests are in the appendix.

Average alignment vs neural training samples for ViT-S fine-tuned per dataset
Finding 05 · Fine-tuning

Fine-tuning transfers across modalities

We fine-tune ViT-S on one dataset and evaluate it on all benchmarks. Most transfers are positive across electrophysiology, fMRI, EEG, and MEG. The gains are modest but consistent (~0.01–0.02 absolute), and some cross-modal transfers are stronger than within-modality ones.

Change in alignment across all benchmarks after fine-tuning on each dataset

Each panel fine-tunes ViT-S on one dataset (panel title); bars show the change in brain alignment relative to the pretrained baseline across all benchmarks.

Finding 06 · Fine-tuning

A small accuracy cost for alignment

The alignment gains from fine-tuning come with a slight drop in task accuracy. For ViT-S, full-data fine-tuning moves accuracy 0.783 → 0.769 while mean alignment rises 0.529 → 0.535; the same direction holds for ViT-B/L, ConvNeXt-S, and ResNet-50, with the largest alignment gain for ViT-L (0.539 → 0.547).

Task accuracy vs mean alignment between baseline and fine-tuned states
Finding 07 · Mapping

Mapping data yields the largest gains

Increasing the number of stimulus–neural pairs used to fit the model-to-brain mapping yields consistent, roughly log-linear gains, the largest of any axis. For most datasets the fitted curves do not saturate over the observed range, indicating alignment remains mapping-data-limited.

Attention-readout scaling and implementation details are in the appendix.

Within-dataset alignment vs neural samples used to fit a linear mapping

The appendix adds further methodological detail — including data preprocessing — and additional analysis, such as the consistency of layer commitment across datasets.

Subject-shared cross-attention readout

Beyond the scaling analyses, we introduce a readout that shares a cross-attention block across subjects while keeping lightweight subject- and ROI-specific output heads. It matches or exceeds per-subject linear decoding on TVSD-EP, T-fMRI, NSD-fMRI, and T-EEG1, reaches the best average alignment overall, using ~10× fewer parameters than the linear baseline.

  • A shared cross-attention block pools structure across subjects
  • Subject- & ROI-specific heads preserve individual differences
  • Substantially fewer parameters than per-subject linear readouts
  • A switchable plug-in on top of frozen backbone features

Readout comparison across benchmarks

Noise-ceiled Pearson r, averaged across subjects, ROIs, and three seeds. SS = single-subject, MS = multi-subject. Best per row in bold.

Benchmark LinearSS Shallow MLPSS Shallow MLPMS Low-rankSS AttentionSS AttentionMS
TVSD-EP0.8010.6650.6730.6560.8440.846
NSD-fMRI0.6930.6250.6580.6230.6770.723
T-fMRI0.4580.3900.4030.3700.4610.487
T-EEG10.2450.2200.2170.2160.2340.257
T-EEG20.4470.3670.3950.3650.3900.434
T-MEG0.4530.3740.3860.3390.4100.414
Average0.5160.4400.4550.4280.5030.527
Avg. #Params (×107)23.032.532.063.312.532.06

Dataset abbreviations: TVSD-EP = THINGS Ventral-stream Spiking Dataset electrophysiology; NSD-fMRI = Natural Scenes Dataset fMRI; T-fMRI = THINGS-fMRI; T-EEG1/2 = THINGS-EEG; T-MEG = THINGS-MEG.

Resources

Citation

@inproceedings{
  gokce2026multimodal_brain_scaling,
  title     = {Multimodal Scaling Laws for Task & Data-Optimized Models of Visual Cortex},
  author    = {Abdulkadir Gokce and Yingtian Tang and Martin Schrimpf},
  booktitle = {Forty-third International Conference on Machine Learning},
  year      = {2026},
  url       = {https://openreview.net/forum?id=OQ6jQHJPTT}
}