Patch-Discontinuity Mining for Generalized Deepfake Detection

1Shanghai Jiao Tong University, 2Chinese Academy of Military Science
Code (Cooming Soon) arXiv

Abstract

Overview image

The advancement of generative artificial intelligence has led to the creation of more diverse and realistic fake facial images. This poses serious threats to personal privacy and can contribute to the spread of misinformation. Existing deepfake detection methods usually utilize prior knowledge about forged clues to design complex modules, achieving excellent performance in the intra-domain settings. However, their performance usually suffers from a significant decline in unseen forgery patterns. It is thus desirable to develop a generalized deepfake detection method using a neat network structure. In this paper, we propose a simple yet efficient framework to transfer a powerful large-scale vision model like ViT to the downstream deepfake detection task, namely the generalized deepfake detection framework (GenDF). Concretely, we first propose a deepfake-specific representation learning (DSRL) scheme to learn different discontinuity patterns across patches inside a fake facial image and continuity between patches within a real counterpart in a low-dimensional space. To further alleviate the distribution mismatch between generic real images and human facial images consisting of both real and fake, we introduce a feature space redistribution (FSR) scheme to separately optimize the distributions of real and fake feature space, enabling the model to learn more distinctive representations. Furthermore, to enhance the generalization performance on unseen forgery patterns produced by constantly evolving facial manipulation techniques and diverse variations on real faces, we propose a classification-invariant feature augmentation (CIFAug) function without trainable parameters. Extensive experiments demonstrate that our method achieves state-of-the-art generalization performance in cross-domain and cross-manipulation settings with only 0.28M trainable parameters.

Method

Method overview

The pipeline of the proposed GenDF framework. The facial images (real or fake) first enter a ViT backbone for representation embedding in a low-dimensional space. Then, we optimize the distributions to learn more discriminative features. Next, these two kinds of features go through the class-invariant feature augmentation procedure to improve the generalization abilities of our method.

Experimental Results

Quantitative results table

The comparison of intra-domain performance on FF++, Celeb-DF, DFDC and DFD datasets. Observations show that GenDF consistently surpasses all existing competitors, regardless of whether the custom-made, pre-trained, or fine-tuned methods.

Quantitative results table

Cross-domain evaluations by training on FF++(HQ) and evaluating on other datasets. The results domonstrate that GenDF achieves state-of-the-art generalization performance with the minimal trainable parameters compared with all existing deepfake detection methods, validating that our framework is remarkably simple but highly effective.

Qualitative results

The Grad-CAM activation map of real and fake representations learned by the ViT backbone and our GenDF on the FF++(HQ) dataset. GenDF precisely captures the discriminative patterns of both authentic and various forged facial images generated by various manipulation techniques, outperforming the vanilla ViT model.

Qualitative results

The t-SNE feature distribution visualization of real and forged facial images obtained by the baseline model (ViT) and the proposed method (GenDF) under the intra-testing and the cross-testing settings. (a) ViT on the intra-testing dataset (FF++). (b) Ours+ViT on the intra-testing dataset (FF++). (c) Ours+ViT on the cross-testing dataset (DFD). (d) Ours+ViT without Class Invariant Augmentation (CIFAug) function on the cross-testing dataset (DFD).

BibTeX

@article{Yuan2025Patch,
  title={Patch-Discontinuity Mining for Generalized Deepfake Detection},
  author={Huanhuan Yuan and Yang Ping and Zhengqin Xu and Junyi Cao and Shuai Jia and Chao Ma},
  journal={IEEE Transactions on Multimedia},
  year={2025},
  url={https://arxiv.org/abs/2512.22027}
}