EnterpriseOps-Gym
Environments and Evaluations for Stateful
Agentic Planning and Tool Use

A large-scale agentic benchmark for enterprise evaluating LLM agents on 1,150 expert-curated tasks across 8 enterprise domains in fully interactive, containerized environments.

Shiva Krishna Reddy Malay* Shravan Nayak* Jishnu Sethumadhavan Nair Aman Tiwari Sathwik Tejaswi Madhusudhan Sagar Davasam Sridhar Krishna Nemala Srinivas Sunkara Sai Rajeswar

^*Equal Contribution ServiceNow Research Mila – Quebec AI Institute Université de Montréal

Paper GitHub 🤗 Dataset

🛠️

Tools

📋

Tasks

🗄️

DB Tables

🏢

Domains

Abstract

What is EnterpriseOps-Gym?

Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments — specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols.

We introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. It features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Agents are evaluated on 1,150 expert-curated tasks across 8 mission-critical verticals including Customer Service, HR, and IT.

Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Providing oracle human plans improves performance by 14–35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves only 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment.

Key Contributions

What Makes EnterpriseOps-Gym Unique

🗄️

Unprecedented Ecosystem Scale

A containerized sandbox with 512 functional tools and 164 relational tables — an order of magnitude larger than prior enterprise benchmarks — providing a true stress test for navigating high-density schemas and massive toolkits.

✅

SQL-Based Outcome Verification

Instead of soft LLM-as-a-judge or action-trace matching, hidden expert-authored SQL scripts verify the final database state — ensuring agents achieve the correct business outcome while maintaining data integrity.

🔒

Policy-Governed Operations

Tasks embed real internal business policies — data access controls, approval workflows, and privacy rules — shifting evaluation from simple tool use to governed autonomous operations typical of regulated industries.

�

Cross-Domain Orchestration

Unique Hybrid scenarios require agents to coordinate across 8 interconnected ecosystems — e.g., resolving a CSM ticket by checking HR records and updating a Teams channel — reflecting real enterprise fragmentation.

�

Infeasibility & Refusal Testing

A dedicated subset of infeasible tasks evaluates an agent's safety reasoning — agents must identify and refuse requests that violate organizational policies or system constraints, preventing harmful actions or data leaks.

�

Long-Horizon Stateful Trajectories

Tasks require up to 34 sequential steps in a stateful ReAct loop where every action permanently affects the shared database, demanding committed, multi-step planning rather than greedy tool calls.

Cite Our Work

Citation

@misc{malay2026enterpriseopsgymenvironmentsevaluationsstateful,
      title={EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings}, 
      author={Shiva Krishna Reddy Malay and Shravan Nayak and Jishnu Sethumadhavan Nair and Sagar Davasam and Aman Tiwari and Sathwik Tejaswi Madhusudhan and Sridhar Krishna Nemala and Srinivas Sunkara and Sai Rajeswar},
      year={2026},
      eprint={2603.13594},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.13594}, 
}

EnterpriseOps-Gym
Environments and Evaluations for Stateful
Agentic Planning and Tool Use

What is EnterpriseOps-Gym?

What Makes EnterpriseOps-Gym Unique

Unprecedented Ecosystem Scale

SQL-Based Outcome Verification

Policy-Governed Operations

Cross-Domain Orchestration

Infeasibility & Refusal Testing

Long-Horizon Stateful Trajectories

Architecture & Task Flow

Leaderboard

Oracle Mode — Task Success Rate (%)

Citation

EnterpriseOps-Gym Environments and Evaluations for StatefulAgentic Planning and Tool Use

What is EnterpriseOps-Gym?

What Makes EnterpriseOps-Gym Unique

Unprecedented Ecosystem Scale

SQL-Based Outcome Verification

Policy-Governed Operations

Cross-Domain Orchestration

Infeasibility & Refusal Testing

Long-Horizon Stateful Trajectories

Architecture & Task Flow

Leaderboard

Oracle Mode — Task Success Rate (%)

Citation

EnterpriseOps-Gym
Environments and Evaluations for Stateful
Agentic Planning and Tool Use