
Dr.DocBench
DR.DOCBENCH is a difficulty-aware benchmark for expert-level document parsing.
We not only share the intricate details of our dataset construction but also dissect how our work precisely empowers your model to achieve decisive victories on critical evaluations—Explore the articles to discover how we are defining the next peak of performance. Here, we go beyond the surface.


DR.DOCBENCH is a difficulty-aware benchmark for expert-level document parsing.

KINA is a high-density knowledge benchmark encompassing 261 fine-grained disciplines, the first to incorporate disciplinary representativeness as a core metric. It features a reusable, game-theoretic data collection pipeline that mitigates annotation vulnerabilities.

SuperGPQA is a large-scale and highly challenging benchmark created to evaluate the advanced reasoning capabilities of Large Language Models (LLMs). Its purpose is to test model performance on expert-level, graduate qualification questions across an unprecedented 285 academic and professional disciplines.

Discover VeriWeb, a pioneering benchmark for long-horizon web agents. It offers a reproducible environment and 302 real-world tasks with subtask-level verification, advancing research in complex information-seeking.

OmniDocBench is a comprehensive benchmark for evaluating AI in document parsing and content extraction.

Discover PIN, a new data format and two large-scale datasets (PIN-200M & PIN-14M) designed to help LMMs understand complex, knowledge-intensive multimodal documents.

EDITREWARD is trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs.

VideoScore2 is a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales.

Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench) is designed to evaluate models' intrinsic reasoning and planning abilities by minimizing interference from pretrained knowledge.

MMAR (Massive Multi-disciplinary Audio Reasoning) is a new and challenging benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs).

OmniHD-Scenes is a massive multimodal autonomous driving dataset. Featuring 450K+ synchronized frames of 128-beam LiDAR, 6-view cameras, and 4D imaging radar data. Includes high-quality 3D bounding boxes and semantic occupancy for complex urban scenarios, rainy weather, and night scenes. Download the 1.3TB dataset now.

Chain-of-Agents (CoA) is a novel framework for training end-to-end agent foundation models (AFM) using multi-agent distillation and agentic reinforcement learning.

YuE is a family of open foundation models based on the LLaMA2 architecture that generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies.

AFM-Datasets is the official training dataset released with the research paper, "Chain-of-Agents," and is specifically designed for building Agent Foundation Models (AFMs).

TaskCraft is a multi-modal benchmark dataset featuring tasks ranging from simple (1-step) to expert-level (4-step+).

COIG-Writer is a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts.

AutoKaggle is a powerful and user-centric framework that assists data scientists in completing daily data pipelines through a collaborative multi-agent system.

Matrix is a massive, open-source pretraining dataset containing approximately 4.7 trillion tokens** of bilingual text in English and Chinese.

CriticLeanBench is a specialized benchmark designed to evaluate the critical reasoning of AI models, specifically on the task of validating the translation of natural language mathematics into formal Lean 4 theorem statements.

FormalMATH is a large-scale benchmark designed to evaluate and advance the capabilities of Large Language Models in the challenging domain of formal mathematical reasoning.

COIG-P (Chinese Open Instruction Generalist - Preference) is a high-quality, large-scale Chinese preference dataset designed for aligning Large Language Models (LLMs) with human values.