Pingchuan Ma* · Xiaopei Yang* · Yusong Li
Ming Gui · Felix Krause · Johannes Schusterbauer · Björn Ommer
CompVis Group @ LMU Munich Munich Center for Machine Learning (MCML)
* equal contribution
📄 ICCV 2025
🔥 News
- [06.2026] This repository now also contains the official code for the paper: catFM: Contrastive-Augmented Flow Matching for Style-Content Disentanglement, a follow-up work currently under review at TPAMI.
- [10.2025] Released training code and dataset splits.
- [10.2025] Released the full 512px image dataset.
- [08.2025] Released inference code and pretrained checkpoints.
- [08.2025] ICCV paper available on arXiv.
Important
The original SCFlow (ICCV 2025) implementation remains the default training and inference pipeline. This repository additionally includes the implementation of catFM, a follow-up method currently under review at TPAMI.
This repository contains the official implementation of the paper "SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models". We proposed a flow-matching framework that learns an invertible mapping between style-content mixtures and their separate representations, avoiding explicit disentanglement objectives. Together with the method, we have curated a 510k synthetic dataset consisting of 10k content instances and 51 distinct styles.
Create the enviroment with conda:
conda create -n scflow python=3.10
conda activate scflow
pip install -r requirements.txtThe enviroment was tested on Ubuntu 22.04.5 LTS with CUDA 12.1. You can optionally install jupyter-notebook to run the notebook provided in notebooks
Download the model checkpoints:
mkdir ckpts
cd ckpts
# model checkpoint
wget https://huggingface.co/CompVis/SCFlow/resolve/main/scflow_last.ckpt
# unclip checkpoint for visualization
wget https://huggingface.co/CompVis/SCFlow/resolve/main/sd21-unclip-l.ckptDownload the training and test splits of the dataset:
# return to parent dir
cd ..
mkdir dataset
cd dataset
# training split with meta data, e.g., content and style idx and content description etc.
wget https://huggingface.co/CompVis/SCFlow/resolve/main/train.h5
# test split with meta data, e.g., content and style idx and content description etc.
wget https://huggingface.co/CompVis/SCFlow/resolve/main/test.h5
The following bash scripts are just naive wrappers for an easy start. You can the args accordingly by calling directly the training.py and inference.py.
Inference forward (merge content and style)
bash scripts/inference_forward.shInference reverse (disentangle content and style from a given reference)
bash scripts/inference_reverse.shFor training you would need ~22GB with the default setting.
bash scripts/training.shThis repository additionally includes the implementation of catFM, a follow-up work built upon SCFlow and currently under review at TPAMI.
Compared to SCFlow, CATFM introduces:
- contrastive regularization on style and content embeddings,
- multiple endpoint prediction objectives,
- improved style-content disentanglement and retrieval performance.
The original SCFlow pipeline remains the default. To train CATFM, use:
bash scripts/catfm_training.shYou can also customize the training configuration directly from the command line:
python training.py --config configs/catfm_training.yaml train.dml_type=MultiSimilarity train.predict_x0x1=TrueFor catFM metric losses (train.dml_type != null), install the optional dependency:
pip install pytorch-metric-learningcatFM checkpoints can be used by the same inference script:
python inference.py \
--model_type catfm \
--config configs/inference.yaml \
--resume_checkpoint path/to/catfm.ckpt \
--image_c_path path/to/content.jpg \
--image_s_path path/to/style.jpg \
--unclip_ckpt ckpts/sd21-unclip-l.ckptWe hosted the dataset (currently only the clip embeddings and their corresponding metadata due to the space limit) on HF. You can download them as instructed in the above section. The file train.h5 (same holds for test.h5) is an HDF5 dataset storing embeddings and metadata useful for training. You can load it in Python with:
import h5py
train = h5py.File(”./dataset/train.h5”, ‘r’)The main groups inside are:
- images: Contains CLIP embeddings with shape
(357000, 768), representing feature vectors for training samples. - metadata: Contains descriptive information with keys:
content_descriptioncontent_idxstyle_idxstyle_name
Note: Some metadata entries can be duplicated because there are 7000 content variations for training and 3000 for testing. This means the same content with different styles will have identical
content_descriptionandcontent_idx.
We hosted the original images on HF. You should be able to download them by calling:
# The zip file is around 36.5 GB.
wget https://huggingface.co/CompVis/SCFlow/resolve/main/raw_512px.zip
It is structured by styles, then different content ids, e.g., Cubism/00001.jpg ... 10000.jpg, where the content ids are consistent across different styles.
If you use this codebase and dataset, or found our work valuable, please cite our paper:
@inproceedings{ma2025scflow,
author = {Ma, Pingchuan and Yang, Xiaopei and Li, Yusong and Gui, Ming and Krause, Felix and Schusterbauer, Johannes and Ommer, Bj\"orn},
title = {SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {14919-14929}
}