首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Learning Spatiotemporal Failure Dependencies for Resilient Edge Computing Services
【24h】

Learning Spatiotemporal Failure Dependencies for Resilient Edge Computing Services

机译:学习时空故障依赖性的弹性边缘计算服务

获取原文
获取原文并翻译 | 示例
           

摘要

Edge computing services are exposed to infrastructural failures due to geographical dispersion, ad hoc deployment, and rudimentary support systems. Two unique characteristics of the edge computing paradigm necessitate a novel failure resilience approach. First, edge servers, contrary to cloud counterparts with reliable data center networks, are typically connected via ad hoc networks. Thus, link failures need more attention to ensure truly resilient services. Second, network delay is a critical factor for the deployment of edge computing services. This restricts replication decisions to geographical proximity and necessitates joint consideration of delay and resilience. In this article, we propose a novel machine learning based mechanism that evaluates the failure resilience of a service deployed redundantly on the edge infrastructure. Our approach learns the spatiotemporal dependencies between edge server failures and combines them with the topological information to incorporate link failures. Ultimately, we infer the probability that a certain set of servers fails or disconnects concurrently during service runtime. Furthermore, we introduce Dependency- and Topology-aware Failure Resilience (DTFR), a two-stage scheduler that minimizes either failure probability or redundancy cost, while maintaining low network delay. Extensive evaluation with various real-world failure traces and workload configurations demonstrate superior performance in terms of availability, number of failures, network delay, and cost with respect to the state-of-the-art schedulers.
机译:由于地理分散,临时部署和基本支持系统,边缘计算服务受到基础设施故障。边缘计算范式的两个独特特征需要一种新的失效弹性方法。首先,与具有可靠数据中心网络的云对应物相反,边缘服务器通常通过Ad Hoc网络连接。因此,链接失败需要更多地注意,以确保真正的弹性服务。其次,网络延迟是部署边缘计算服务的关键因素。这将复制决策限制在地理邻近,因此需要共同考虑延迟和恢复力。在本文中,我们提出了一种基于机构的新颖的机器学习机制,可评估在边缘基础架构上冗余部署的服务的故障弹性。我们的方法了解边缘服务器故障之间的时空依赖项,并将它们与拓扑信息组合以合并链接故障。最终,我们推断某一组服务器在服务运行时同时失败或断开连接的概率。此外,我们引入了依赖和拓扑感知的故障弹性(DTFR),这是一个两级调度器,最小化失败概率或冗余成本,同时保持低网络延迟。各种实际失效迹线和工作负载配置的广泛评估在可用性,故障,网络延迟数量和成本方面展示了卓越的性能,以及最先进的调度员。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号