R U L E R - B e n c h: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

* Equal contribution Project Lead
1 Zhejiang University   2 Ant Group  

📰 News & Updates

  • 🚀

    Dec 3, 2025

    The Paper, Project Page, and Dataset are released 🎉

Abstract

Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.

Overview of RULER-Bench

Teaser.

Overview of RULER-Bench. We propose RULER-Bench, a comprehensive benchmark designed to evaluate the rule-based reasoning abilities of video generation models. Left: Grounded in three fundamental domains, we formulate rule-based reasoning ability into six categories: Science, Vision, Hypothesis Game, Semantics, and Humanity. These categories are further subdivided into 40 tasks. Center: Using the collected samples, we evaluate 10 video models based on the corresponding checklist across four metrics. Each checklist question is scored by GPT-o3 with discrete labels. To validate the reliability of the evaluator, we conduct a human alignment study, in which GPT-o3 achieves 85\% agreement with human judgments. Right: Extensive experiments demonstrate that Veo3.1 achieves the best performance. However, all models exhibit limited reasoning ability across different rule categories.

Dataset Construction and Validation

Teaser.

Overview of dataset construction and validation . First, we formulate our tasks based on the six rule categories. Second, we design task-specific data construction pipelines for T2V and I2V tasks. Third, we leverage MLLM to construct checklist questions across four evaluation metrics. Finally, we conduct quality control and data refinement for the constructed dataset and checklists.

Main Results

Evaluation result across different rule categories and metrics

Rule Categories Metric Closed-Source Models Open-Source Models
Veo3.1 Veo2 Sora2 Pixel Verse-V5 Wan 2.5 Seedance 1.0-pro Hunyuan Video CogVideoX 1.5 5B Wan 2.2 A14B Wan 2.1 14B
Science Rule IF 65.0542.1766.0057.1357.3858.86 24.9227.9737.1535.80
VC 83.1873.388.0180.7680.4880.99 48.4648.8468.5265.74
VF 91.3782.3389.4989.7485.3587.69 71.2970.9680.3781.93
RC 50.9722.1647.0941.4133.6431.96 12.6413.7017.1615.90
Avg 72.6454.9972.6567.2664.2164.87 39.3340.3750.8049.84
Game Rule IF 39.7524.2539.1930.1026.5924.26 14.7522.7516.2919.30
VC 51.4536.3372.3367.0972.7168.79 40.0755.2964.5637.52
VF 77.9559.1588.1880.5986.2888.39 59.1369.4580.1372.20
RC 17.708.1719.9713.0615.4515.61 6.987.5614.1210.48
Avg 46.7131.9854.9247.7150.2649.26 30.2338.7643.7734.88
Semantics Rule IF 71.8356.4468.1265.0859.9161.28 38.5146.0648.7746.27
VC 92.6591.1890.8591.1887.3387.67 80.3975.8282.3580.72
VF 91.6282.5083.4389.0282.1984.55 79.1770.6982.7083.09
RC 67.5744.1353.6956.8049.9549.42 32.0137.3437.7338.40
Avg 80.9268.5674.0275.5269.8470.73 57.5257.4862.8962.12
Hypothesis Rule IF 86.9758.5572.4480.1371.9364.32 44.4441.4561.1161.75
VC 85.9064.3277.3581.6266.4567.74 51.9250.4364.7455.56
VF 92.2081.5482.5085.7376.8679.66 73.8963.8077.0375.17
RC 46.7912.5041.3546.6918.3128.31 9.6211.0012.9317.84
Avg 77.9654.2368.4173.5458.3960.01 44.9741.6753.9552.58
Humanity Rule IF 79.9053.4680.0472.8763.2868.93 46.5642.3249.7652.28
VC 87.3773.1088.0684.2579.8383.13 70.6054.2372.3470.47
VF 94.4984.3888.0889.6583.9088.52 80.9467.7683.1582.32
RC 61.2335.2356.7850.6333.4138.75 27.7820.6030.2129.21
Avg 80.7561.5478.2474.3565.1069.83 56.4746.2358.8658.57
Vision Rule VC 59.5346.1957.7756.1470.0461.86 43.4924.7959.0351.26
VF 72.6757.6357.7771.6168.3276.06 52.9429.4165.5549.58
RC 48.9430.5828.5040.4742.2441.74 18.9114.7829.3423.25
Avg 60.3844.8048.0256.0760.2059.89 38.4522.9951.3141.36
Average IF 68.7046.9765.1661.0655.8255.53 33.8436.1142.6243.08
VC 76.6864.0779.0676.8476.1475.03 55.8251.5768.5960.21
VF 86.7274.5981.5884.3980.4884.14 69.5662.0177.6975.92
RC 48.8725.4641.2341.5132.1734.30 17.9917.5023.5822.51
Avg 70.2452.7766.7665.9561.1562.25 44.3041.8053.2449.96
Win Rate 0.3970.1860.3400.3000.2570.267 0.1510.1510.1930.162
Teaser.

Average performance of video generation models across different tasks on RULER-Bench. Video models generally perform best on tasks in Humanity and Hypothesis, while showing lower performance on Vision and Game categories.

Generated Videos

Veo3.1
Veo2
Sora2
PixelVerse-V5
Wan2.5
Seedance1.0-pro
Please correct the bicycle's wheel shape anomaly so that the vehicle can function properly.
First add 0.5 mL of 0.1 M dilute hydrochloric acid and observe the litmus color change; then add 1 mL of 0.1 M dilute sodium hydroxide, recording the color change and estimating the final pH.
Increase the number of pink paperclips in the input image so that there are exactly five partially overlapping paperclips, arranged in a visually natural stacked formation where each clip slightly overlaps the next, maintaining consistent angle, scale, and lighting as in the original pair.
A student has just learned that their exam results are excellent, exceeding expectations.
The process of a tadpole transforming into a frog.
Fill a U-shaped transparent tube with colored water so that both arms have the same initial height. Add extra colored water to the left arm and observe the water levels in both arms.
In an office, thick smoke and small flames are rising from a pile of paper near a power socket. Several employees notice the fire spreading quickly across nearby papers and office supplies. They look around the room, searching for a way to extinguish it.
Observe the lava flow over several years under continuous volcanic activity, noting changes along its path.

BibTeX

If you find our work useful, please consider citing our paper:


      @article{he2025ruler,
        title={RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence},
        author={He, Xuming and Fan, Zehao and Li, Hengjia and Zhuo, Fan and Xu, Hankun and Cheng, Senlin and Weng, Di and Liu, Haifeng and Ye, Can and Wu, Boxi},
        journal={arXiv preprint arXiv:2512.02622},
        year={2025}
      }
    

Website source code based on the ArtiMuse project page.