Intelligent Agent-Based Control Structures for Automatic Failure Repair in Remote Computing Environments with Improved Stability
Abstract
Remote computing environments, including distributed cloud infrastructures and geographically dispersed cyber-physical systems, are increasingly exposed to complex failure modes driven by extreme environmental disturbances, cascading dependencies, and high system heterogeneity. Traditional fault management approaches rely on static rule-based recovery mechanisms, which are insufficient for ensuring stability under dynamic and uncertain operational conditions. This paper proposes an intelligent agent-based control architecture for automatic failure repair that integrates resilience engineering principles with adaptive decision-making models to enhance system stability and recovery efficiency.
The proposed framework models failure repair as a sequential decision process in which autonomous agents continuously monitor system states, evaluate fault severity, and execute corrective actions based on learned policies. The conceptual foundation is derived from resilience quantification methods in power systems (Stanković et al., 2023; Bhusal et al., 2020) and fragility modeling techniques used to assess structural vulnerability under extreme disturbances (Zhu & Ou, 2025). These approaches are extended to remote computing environments through agent-based coordination and adaptive control structures.
The system further incorporates psychological resilience constructs, such as cognitive flexibility and emotion regulation capacity (Dennis & Vander Wal, 2010; Gratz & Roemer, 2004), to conceptually model adaptive decision robustness in computational agents. Reinforcement learning-inspired adaptation mechanisms allow the system to improve recovery strategies over time based on feedback signals.
Experimental conceptual analysis indicates that the proposed architecture significantly improves fault isolation accuracy, reduces mean recovery time, and enhances system stability under cascading failure conditions. The integration of graph-based vulnerability assessment and resilience-driven control policies ensures that system-level disruptions are mitigated proactively rather than reactively.
The study contributes a unified theoretical framework for intelligent autonomous repair systems, bridging resilience engineering, agent-based modeling, and adaptive decision theory. It provides a foundation for next-generation self-healing remote computing infrastructures capable of maintaining stability under highly volatile conditions.