GenWorld: Towards Detecting AI-generated Real-world Simulation Videos

Tsinghua University
Arxiv
* Project Leader. † Corresponding author.

We have proposed GenWorld, a high-quality, real-world simulated AI-generated video dataset, and SpannDetector, an efficient detection model leveraging multi-view consistency.


Abstract

The flourishing of video generation technologies has endangered the credibility of real-world information and intensified the demand for AI-generated video detectors. Despite some progress, the lack of high-quality real-world datasets hinders the development of trustworthy detectors. In this paper, we propose GenWorld, a large-scale, high-quality, and real-world simulation dataset for AI-generated video detection. GenWorld features the following characteristics: (1) Real-world Simulation: GenWorld focuses on videos that replicate real-world scenarios, which have a significant impact due to their realism and potential influence; (2) High Quality: GenWorld employs multiple state-of-the-art video generation models to provide realistic and high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes videos generated from diverse generators and various prompt modalities (e.g., text, image, video), offering the potential to learn more generalizable forensic features. We analyze existing methods and find they fail to detect high-quality videos generated by world models (i.e., Cosmos), revealing potential drawbacks of ignoring real-world clues. To address this, we propose a simple yet effective model, SpannDetector, to leverage multi-view consistency as a strong criterion for real-world AI-generated video detection. Experiments show that our method achieves superior results, highlighting a promising direction for explainable AI-generated video detection based on physical plausibility. We believe that GenWorld will advance the field of AI-generated video detection.

Teaser Image

Figure 1. Most existing AI-generated video datasets consist of cartoon videos even as “real” data, lacking a clear definition of authenticity. This paper proposes a high-quality dataset including only real and generated videos from real-world scenarios (e.g., driving, navigation, manipulation). GenWorld features three key characteristics: 1) Real-world Simulation, 2) High Quality, and 3) Cross-prompt Diversity, which can serve as a foundation for AI-generated video detection research with practical significance.


Method

Pipeline and motivation of Our SpannDetector. SpannDetector is designed based on an in-depth analysis of multi-view consistency in real and AI-generated videos. It integrates a stereo reconstruction model with a temporal memory module to enhance efficiency in consistency detection. An authenticity scorer evaluates the stereo features, and the final video authenticity is determined by averaging these scores across the entire video.


Results

Comparisons to the SOTAs in F1 score (F1) and average precision (AP) on the Train-Test Evaluation.

BibTeX


      @misc{chen2025genworlddetectingaigeneratedrealworld,
        title={GenWorld: Towards Detecting AI-generated Real-world Simulation Videos}, 
        author={Weiliang Chen and Wenzhao Zheng and Yu Zheng and Lei Chen and Jie Zhou and Jiwen Lu and Yueqi Duan},
        year={2025},
        eprint={2506.10975},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2506.10975}, 
  }