2077AI - Open-source Innovation Foundation

Datasets

We not only share the intricate details of our dataset construction but also dissect how our work precisely empowers your model to achieve decisive victories on critical evaluations—Explore the articles to discover how we are defining the next peak of performance. Here, we go beyond the surface.

VLM

Dr.DocBench

DR.DOCBENCH is a difficulty-aware benchmark for expert-level document parsing.

Reasoning

KINA

KINA is a high-density knowledge benchmark encompassing 261 fine-grained disciplines, the first to incorporate disciplinary representativeness as a core metric. It features a reusable, game-theoretic data collection pipeline that mitigates annotation vulnerabilities.

Reasoning

SuperGPQA

SuperGPQA is a large-scale and highly challenging benchmark created to evaluate the advanced reasoning capabilities of Large Language Models (LLMs). Its purpose is to test model performance on expert-level, graduate qualification questions across an unprecedented 285 academic and professional disciplines.

Agent

VeriWeb

Discover VeriWeb, a pioneering benchmark for long-horizon web agents. It offers a reproducible environment and 302 real-world tasks with subtask-level verification, advancing research in complex information-seeking.

VLM

OmniDocBench

OmniDocBench is a comprehensive benchmark for evaluating AI in document parsing and content extraction.

VLM

PIN Dataset

Discover PIN, a new data format and two large-scale datasets (PIN-200M & PIN-14M) designed to help LMMs understand complex, knowledge-intensive multimodal documents.

VLM

EditReward

EDITREWARD is trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs.

Video

VideoScore2

VideoScore2 is a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales.

Reasoning

KOR-BENCH

Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench) is designed to evaluate models' intrinsic reasoning and planning abilities by minimizing interference from pretrained knowledge.

Audio

MMAR

MMAR (Massive Multi-disciplinary Audio Reasoning) is a new and challenging benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs).

Multimodal

OmniHD-Scenes

OmniHD-Scenes is a massive multimodal autonomous driving dataset. Featuring 450K+ synchronized frames of 128-beam LiDAR, 6-view cameras, and 4D imaging radar data. Includes high-quality 3D bounding boxes and semantic occupancy for complex urban scenarios, rainy weather, and night scenes. Download the 1.3TB dataset now.

Agent

Chain-of-Agents

Chain-of-Agents (CoA) is a novel framework for training end-to-end agent foundation models (AFM) using multi-agent distillation and agentic reinforcement learning.

Audio

YuE

YuE is a family of open foundation models based on the LLaMA2 architecture that generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies.

Agent

AFM-Datasets

AFM-Datasets is the official training dataset released with the research paper, "Chain-of-Agents," and is specifically designed for building Agent Foundation Models (AFMs).

Multimodal

TaskCraft

TaskCraft is a multi-modal benchmark dataset featuring tasks ranging from simple (1-step) to expert-level (4-step+).

Reasoning

COIG-Writer

COIG-Writer is a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts.

Agent

AutoKaggle

AutoKaggle is a powerful and user-centric framework that assists data scientists in completing daily data pipelines through a collaborative multi-agent system.

Multimodal

M-A-P Matrix

Matrix is a massive, open-source pretraining dataset containing approximately 4.7 trillion tokens** of bilingual text in English and Chinese.

Reasoning

CriticLeanBench

CriticLeanBench is a specialized benchmark designed to evaluate the critical reasoning of AI models, specifically on the task of validating the translation of natural language mathematics into formal Lean 4 theorem statements.

Reasoning

FormalMATH

FormalMATH is a large-scale benchmark designed to evaluate and advance the capabilities of Large Language Models in the challenging domain of formal mathematical reasoning.

Multimodal

COIG-P

COIG-P (Chinese Open Instruction Generalist - Preference) is a high-quality, large-scale Chinese preference dataset designed for aligning Large Language Models (LLMs) with human values.

About

Mission

Events

News

Opportunities

Partnerships

Research

Datasets

Projects

EVA

Campus Program

Challenges

Ventures

Datasets

Dr.DocBench

KINA

SuperGPQA

VeriWeb

OmniDocBench

PIN Dataset

EditReward

VideoScore2

KOR-BENCH

MMAR

OmniHD-Scenes

Chain-of-Agents

YuE

AFM-Datasets

TaskCraft

COIG-Writer

AutoKaggle

M-A-P Matrix

CriticLeanBench

FormalMATH

COIG-P