Streaming high-quality video over dynamic radio networks is challenging. Dynamic adaptive streaming over HTTP (DASH) is a standard for delivering video in segments, and adapting its quality to adjust to a changing and limited network bandwidth. We present a machine learning-based predictive pre-fetching and caching approach for DASH video streaming, implemented at the multi-access edge computing server. We use ensemble methods for machine learning (ML) based segment request prediction and an integer linear programming (ILP) technique for pre-fetching decisions. Our approach reduces video segment access delay with a cache-hit ratio of 60% and alleviates transport network load by reducing the backhaul link utilization by 69%. We validate the ML model and the pre-fetching algorithm, and present the trade-offs involved in pre-fetching and caching for resource-constrained scenarios.
We describe the implementation and deployment of a software decision support tool for the maintenance planning of gas turbines. The tool is used to plan the maintenance for turbines manufactured and maintained by Siemens Industrial Turbomachinery AB (SIT AB) with the goal to reduce the direct maintenance costs and the often very costly production losses during maintenance downtime. The optimization problem is formally defined, and we argue that feasibility in it is NP-complete. We outline a heuristic algorithm that can quickly solve the problem for practical purposes, and validate the approach on a real-world scenario based on an oil production facility. We also compare the performance of our algorithm with results from using mixed integer linear programming, and discuss the deployment of the application. The experimental results indicate that downtime reductions up to 65% can be achieved, compared to traditional preventive maintenance. In addition, using our tool is expected to improve availability with up to 1% and reduce the number of planned maintenance days with 12%. Compared to a mixed integer programming approach, our algorithm not optimal, but is orders of magnitude faster and produces results which are useful in practice. Our test results and SIT AB’s estimates based on operational use both indicate that significant savings can be achieved by using our software tool, compared to maintenance plans with fixed intervals.
Preventive maintenance schedules occurring in industry are often suboptimal with regard to maintenance coal-location, loss-of-production costs and availability. We describe the implementation and deployment of a software decision support tool for the maintenance planning of gas turbines, with the goal of reducing the direct maintenance costs and the often costly production losses during maintenance downtime. The optimization problem is formally defined, and we argue that the feasibility version is NP-complete. We outline a heuristic algorithm that can quickly solve the problem for practical purposes and validate the approach on a real-world scenario based on an oil production facility. We also compare the performance of our algorithm with results from using integer programming, and discuss the deployment of the application. The experimental results indicate that downtime reductions up to 65% can be achieved, compared to traditional preventive maintenance. In addition, the use of our tool is expected to improve availability with up to 1% and reduce the number of planned maintenance days by 12%. Compared to a integer programming approach, our algorithm is not optimal, but is much faster and produces results which are useful in practice. Our test results and SIT AB’s estimates based< on operational use both indicate that significant savings can be achieved by using our software tool, compared to maintenance plans with fixed intervals.
In many diagnosis situations it is desirable to perform a classification in an iterative and interactive manner. All relevant information may not be available initially and must be acquired manually or at a cost. The matter is often complicated by very limited amounts of knowledge and examples when a new system to be diagnosed is initially brought into use. Here, we will describe how to create an incremental classification system based on a statistical model that is trained from empirical data, and show how the limited available background information can still be used initially for a functioning diagnosis system.
We describe a novel incremental diagnostic system based on a statistical model that is trained from empirical data. The system guides the user by calculating what additional information would be most helpful for the diagnosis. We show that our diagnostic system can produce satisfactory classification rates, using only small amounts of available background information, such that the need of collecting vast quantities of initial training data is reduced. Further, we show that incorporation of inconsistency-checking mechanisms in our diagnostic system reduces the number of incorrect diagnoses caused by erroneous input.
Although there is consensus that software defined networking and network functions virtualization overhaul service provisioning and deployment, the community still lacks a definite answer on how carrier-grade operations praxis needs to evolve. This article presents what lies beyond the first evolutionary steps in network management, identifies the challenges in service verification, observability, and troubleshooting, and explains how to address them using our Service Provider DevOps (SP-DevOps) framework. We compendiously cover the entire process from design goals to tool realization and employ an elastic version of an industry-standard use case to show how on-the-fly verification, software-defined monitoring, and automated troubleshooting of services reduce the cost of fault management actions. We assess SP-DevOps with respect to key attributes of software-defined telecommunication infrastructures both qualitatively and quantitatively, and demonstrate that SP-DevOps paves the way toward carrier-grade operations and management in the network virtualization era.
Technology trends such as Cloud, SDN, and NFV are transforming the telecommunications business, promising higher service flexibility and faster deployment times. They also allow for increased programmability of the infrastructure layers. We propose to split selected monitoring control functionality onto node-local control planes, thereby taking advantage of processing capabilities on programmable nodes. Our software defined monitoring approach provides telecom operators with a way to handle the trade off between high-granular monitoring information versus network and computation loads at central control and management layers. To illustrate the concept, a link rate monitoring function is implemented using node-local control plane components. Furthermore, we introduce a messaging bus for simple and flexible communication between monitoring function components as well as control and management systems. We investigate scalability gains with a numerical analysis, demonstrating that our approach would generate thousand fold less monitoring traffic while providing similar information granularity as a naive SNMP implementation or an Open Flow approach.
Network Service Chaining (NSC) is a service deployment concept that promises increased flexibility and cost efficiency for future carrier networks. NSC has received considerable attention in the standardization and research communities lately. However, NSC is largely undefined in the peer-reviewed literature. In fact, a literature review reveals that the role of NSC enabling technologies is up for discussion, and so are the key research challenges lying ahead. This paper addresses these topics by motivating our research interest towards advanced dynamic NSC and detailing the main aspects to be considered in the context of carrier-grade telecommunication networks. We present design considerations and system requirements alongside use cases that illustrate the advantages of adopting NSC. We detail prominent research challenges during the typical lifecycle of a network service chain in an operational telecommunications network, including service chain description, programming, deployment, and debugging, and summarize our security considerations. We conclude this paper with an outlook on future work in this area.
Deployment of 100Gigabit Ethernet (GbE) links challenges the packet processing limits of commodity hardware used for Network Functions Virtualization (NFV). Moreover, realizing chained network functions (i.e., service chains) necessitates the use of multiple CPU cores, or even multiple servers, to process packets from such high speed links. Our system Metron jointly exploits the underlying network and commodity servers' resources: (i) to offload part of the packet processing logic to the network, (ii) by using smart tagging to setup and exploit the affinity of traffic classes, and (iii) by using tag-based hardware dispatching to carry out the remaining packet processing at the speed of the servers' cores, with zero inter-core communication. Moreover, Metron transparently integrates, manages, and load balances proprietary "blackboxes"together with Metron service chains. Metron realizes stateful network functions at the speed of 100GbE network cards on a single server, while elastically and rapidly adapting to changing workload volumes. Our experiments demonstrate that Metron service chains can coexist with heterogeneous blackboxes, while still leveraging Metron's accurate dispatching and load balancing. In summary, Metron has (i) 2.75-8× better efficiency, up to (ii) 4.7× lower latency, and (iii) 7.8× higher throughput than OpenBox, a state-of-the-art NFV system. © 2021 Owner/Author.
In this paper we present Metron, a Network Functions Virtualization (NFV) platform that achieves high resource utilization by jointly exploiting the underlying network and commodity servers’ resources. This synergy allows Metron to: (i) offload part of the packet processing logic to the network, (ii) use smart tagging to setup and exploit the affinity of traffic classes, and (iii) use tag-based hardware dispatching to carry out the remaining packet processing at the speed of the servers’ fastest cache(s), with zero inter-core communication. Metron also introduces a novel resource allocation scheme that minimizes the resource allocation overhead for large-scale NFV deployments. With commodity hardware assistance, Metron deeply inspects traffic at 40 Gbps and realizes stateful network functions at the speed of a 100 GbE network card on a single server. Metron has 2.75-6.5x better efficiency than OpenBox, a state of the art NFV system, while ensuring key requirements such as elasticity, fine-grained load balancing, and flexible traffic steering.
Network service providers are facing challenges for deploying new services mainly due to the growing complexity of software architecture and development process. Moreover, the recent architectural innovation of network systems such as Network Function Virtualization (NFV), Software-defined Net- working (SDN), and Cloud computing increases the development and operation complexity yet again. One of the emerging solutions to this problem is a novel software development concept, namely DevOps, that is widely employed by major Internet software companies. Although the goals of DevOps in data centers are well-suited for the demands of agile service creation, additional requirements specific to the virtualized and software-defined network environment are important to be addressed from the perspective of modern network carriers.
Efficient coordination among network elements and optimal resource utilization in heterogeneous mobile networks (HMNs) is a key factor for the success of future 5G systems. The COHERENT project focuses on developing an innovative programmable control and coordination framework which is aware of the underlying network topology, radio environment and traffic conditions, and can efficiently coordinate available spectrum resources. In this paper, we provide a set of scenarios and use cases that the COHERENT project intends to address.
The exponential growth of mobile data traffic still remains an important challenge for the mobile network operators. In response, the 5G scene needs to couple fast connectivity and optimized spectrum usage with cloud networking and high processing power, optimally combined in a converged environment. In this paper, we investigate two 5G research projects; SESAME [1] and COHERENT [2]. We consider the proposed 5G architectures and the corresponding key network components, in order to highlight the common aspects towards the 5G architecture design.
We propose a highly scalable statistical method for modelling the monitored traffic rate in a network node and suggest a simple method for detecting increased risk of congestion at different monitoring time scales. The approach is based on parameter estimation of a lognormal distribution using the method of moments. The proposed method is computation- ally efficient and requires only two counters for updating the parameter estimates between consecutive inspections. Evaluation using a naive congestion detector with a success rate of over 98% indicates that our model can be used to detect episodes of high congestion risk at 0.3 s using estimates captured at 5 m intervals.
Managing and balancing load in distributed systems remains a challenging problem in resource management, especially in networked systems where scalability concerns favour distributed and dynamic approaches. Distributed methods can also integrate well with centralised control paradigms if they provide high-level usage statistics and control interfaces for supporting and deploying centralised policy decisions. We present a general method to compute target values for an arbitrary metric on the local system state and show that autonomous rebalancing actions based on the target values can be used to reliably and robustly improve the balance for metrics based on probabilistic risk estimates. To balance the trade-off between balancing efficiency and cost, we introduce 2 methods of deriving rebalancing actuations from the computed targets that depend on parameters that directly affects the trade-off. This enables policy level control of the distributed mechanism based on collected metric statistics from network elements. Evaluation results based on cellular radio access network simulations indicate that load balancing based on probabilistic overload risk metrics provides more robust balancing solutions with fewer handovers compared to a baseline setting based on average load.
Current trends strongly indicate a transition towards large-scale programmable networks with virtual network functions. In such a setting, deployment of distributed control planes will be vital for guaranteed service availability and performance. Moreover, deployment strategies need to be completed quickly in order to respond flexibly to varying network conditions. We propose an effective optimization approach that automatically decides on the needed number of controllers, their locations, control regions, and traffic routes into a plan which fulfills control flow reliability and routability requirements, including bandwidth and delay bounds. The approach is also fast: The algorithms for bandwidth and delay bounds can reduce the running time at the level of 50x and 500x, respectively, compared to state-of-the-art and direct solvers such as CPLEX. Altogether, our results indicate that computing a deployment plan adhering to predetermined performance requirements over network topologies of various sizes can be produced in seconds and minutes, rather than hours and days. Such fast allocation of resources that guarantees reliable connectivity and service quality is fundamental for elastic and efficient use of network resources.
We present an architecture for cloud networking, the provision of virtual infrastructure in a multi-administrative domain scenario, where data centre and network operators in- teract through defined interfaces to provide seamless virtual infrastructure. To support this scenario we introduce the flash network slice, dynamic elastic network connec- tions that compose to form integrated cross-domain networks. Flash network slices support decomposition of virtual infrastructure into partitions that can be managed independently but seamlessly interconnected across administrative boundaries. The approach supports limited information disclosure about implementation details on be- half of the providers, scalability, heterogeneity, and a migration path from currently deployed infrastructure technologies to future network implementations. The resulting infrastructure services are suited to on-demand deployment of emerging cloud services such as content distribution, social networks and cloud-based IT applications.
Software Defined Networking (SDN) and Network Functions Virtualization facilitate, with their advanced pro- grammability features, the design of automated dynamic service creation platforms. Applying DevOps principles to service design can further reduce service creation times and support continuous operation. Monitoring, troubleshooting, and other DevOps tools can have different roles within virtualised networks, depending on virtualization level, type of instantiation, and user intent. We have implemented and integrated four key DevOps tools that are useful in their own right, but showcase also an integrated scenario, where they form the basis for a more complete and realistic DevOps toolkit. The current set of tools include a message bus, a rudimentary configuration tool, a probabilistic congestion detector, and a watchpoint mechanism. The demo also presents potential roles and use-cases for the tools.
Recent control plane solutions in a software-defined network (SDN) setting assume physically distributed but logically centralized control instances: a distributed control plane (DCP). As networks become more heterogeneous with increasing amount and diversity of network resources, DCP deployment strategies must be both fast and flexible to cope with varying network conditions whilst fulfilling constraints. However, many approaches are too slow for practical applications and often address only bandwidth or delay constraints, while control-plane reliability is overlooked and control-traffic routability is not guaranteed. We demonstrate the capabilities of our optimization framework [1]-[3] for fast deployment of DCPs, guaranteeing routability in line with control service reliability, bandwidth and latency requirements. We show that our approach produces robust deployment plans under changing network conditions. Compared to state of the art solvers, our approach is magnitudes faster, enabling deployment of DCPs within minutes and seconds, rather than days and hours.
Development towards 5G has introduced difficultchallenges in effectively managing and operating heterogeneousinfrastructures under highly varying network conditions. En-abling, for example, unified coordination and management ofradio resources across coexisting, multiple radio access technolo-gies (multi-RAT), require efficient representation using high-levelabstractions of the radio network performance and state. Withoutsuch abstractions, users and networks cannot harvest the fullpotential of increased resource density and connectivity optionsresulting in failure to meet the ambitions of 5G.We present a generic probabilistic approach for unified estima-tion of performance variability based on attainable throughputof UDP traffic in multi-RATs, and evaluate the applicability inan interface selection control case (involving WiFi and LTE)based on obtaining probabilistic user performance guarantees.From simulations we observe that both users and operators cansignificantly benefit from this improved service availability at lownetwork cost. Initial results indicate 1) 116% fewer performanceviolations and 2) 20% fewer performance violations with areduction by 35 times in the number of handovers, comparedto naive and state-of-the-art baselines, respectively.
Development towards 5G has introduced difficult challenges in effectively managing and operating heterogeneous infrastructures under highly varying network conditions. Enabling, for example, unified coordination and management of radio resources across coexisting, multiple radio access technologies (multi-RAT), require efficient representation using high-level abstractions of the radio network performance and state. Without such abstractions, users and networks cannot harvest the full potential of increased resource density and connectivity options resulting in failure to meet the ambitions of 5G. We present a generic probabilistic approach for unified estimation of performance variability based on attainable throughput of UDP traffic in multi-RATs, and evaluate the applicability in an interface selection control case (involving WiFi and LTE) based on obtaining probabilistic user performance guarantees. From simulations we observe that both users and operators can significantly benefit from this improved service availability at low network cost. Initial results indicate 1) 116% fewer performance violations and 2) 20% fewer performance violations with a reduction by 35 times in the number of handovers, compared to naive and state-of-the-art baselines, respectively.
We propose a novel distributed leader election algorithm to deal with the controller and control service availability issues in programmable networks, such as Software Defined Networks (SDN) or programmable Radio Access Network (RAN). Our approach can deal with a wide range of network failures, especially intermittent network partitions, where splitting and merging of a network repeatedly occur.
In contrast to traditional leader election algorithms that mainly focus on the (eventual) consensus on one leader, the proposed algorithm aims at optimizing control service availability, stability and reducing the controller state synchronization effort during intermittent network partitioning situations. To this end, we design a new framework that enables dynamic leader election based on real-time estimates acquired from statistical monitoring. With this framework, the proposed leader election algorithm has the capability of being flexibly configured to achieve different optimization objectives, while adapting to various failure patterns. Compared with two existing algorithms, our approach can significantly reduce the synchronization overhead (up to 12x) due to controller state updates, and maintain up to twice more nodes under a controller.
For large-scale programmable networks, flexible deployment of distributed control planes is essential for service availability and performance. However, existing approaches only focus on placing controllers whereas the consequent control traffic is often ignored. In this paper, we propose a black-box optimization framework offering the additional steps for quantifying the effect of the consequent control traffic when deploying a distributed control plane. Evaluating different implementations of the framework over real-world topologies shows that close to optimal solutions can be achieved. Moreover, experiments indicate that running a method for controller placement without considering the control traffic, cause excessive bandwidth usage (worst cases varying between 20.1%-50.1% more) and congestion, compared to our approach.
Technical advances in network communication systems (e.g. radio access networks) combined with evolving concepts based on virtualization (e.g. clouds), require new management algorithms in order to handle the increasing complexity in the network behavior and variability in the network environment. Current network management operations are primarily centralized and deterministic, and are carried out via automated scripts and manual interventions, which work for mid-sized and fairly static networks. The next generation of communication networks and systems will be of significantly larger size and complexity, and will require scalable and autonomous management algorithms in order to meet operational requirements on reliability, failure resilience, and resource-efficiency. A promising approach to address these challenges includes the development of probabilistic management algorithms, following three main design goals. The first goal relates to all aspects of scalability, ranging from efficient usage of network resources to computational efficiency. The second goal relates to adaptability in maintaining the models up-to-date for the purpose of accurately reflecting the network state. The third goal relates to reliability in the algorithm performance in the sense of improved performance predictability and simplified algorithm control. This thesis is about probabilistic approaches to fault management that follow the concepts of probabilistic network management (PNM). An overview of existing network management algorithms and methods in relation to PNM is provided. The concepts of PNM and the implications of employing PNMalgorithms are presented and discussed. Moreover, some of the practical differences of using a probabilistic fault detection algorithm compared to a deterministic method are investigated. Further, six probabilistic fault management algorithms that implement different aspects of PNM are presented. The algorithms are highly decentralized, adaptive and autonomous, and cover several problem areas, such as probabilistic fault detection and controllable detection performance; distributed and decentralized change detection in modeled link metrics; root-cause analysis in virtual overlays; event-correlation and pattern mining in data logs; and, probabilistic failure diagnosis. The probabilistic models (for a large part based on Bayesian parameter estimation) are memory-efficient and can be used and re-used for multiple purposes, such as performance monitoring, detection, and self-adjustment of the algorithm behavior.
We present a distributed adaptive fault-handling algorithm applied in networked systems. The probabilistic approach that we use makes the proposed method capable of adaptively detect and localize network faults by the use of simple end-to-end test transactions. Our method operates in a fully distributed manner, such that each network element detects faults using locally extracted information as input. This allows for a fast autonomous adaption to local network conditions in real-time, with significantly reduced need for manual configuration of algorithm parameters. Initial results from a small synthetically generated network indicate that satisfactory algorithm performance can be achieved, with respect to the number of detected and localized faults, detection time and false alarm rate.
We present the extension of a distributed adaptive fault-detection algorithm applied in networked systems. In previous work, we developed an approach to probabilistic detection of communication faults based on measured probe response delays and packet drops. The algorithm is here extended to detect network latency shifts and adapt to long-term changes of the expected probe response delay. Initial performance tests indicate that detected latency shifts and communication faults successfully can be localised to links and nodes. Further, the amount of network traffic produced by the algorithm scales linearly with the network size.
We present a statistical approach to distributed detection of local latency shifts in networked systems. For this purpose, response delay measurements are performed between neighbouring nodes via probing. The expected probe response delay on each connection is statistically modelled via parameter estimation. Adaptation to drifting delays is accounted for by the use of overlapping models, such that previous models are partially used as input to future models. Based on the symmetric Kullback-Leibler divergence metric, latency shifts can be detected by comparing the estimated parameters of the current and previous models. In order to reduce the number of detection alarms, thresholds for divergence and convergence are used. The method that we propose can be applied to many types of statistical distributions, and requires only constant memory compared to e.g., sliding window techniques and decay functions. Therefore, the method is applicable in various kinds of network equipment with limited capacity, such as sensor networks, mobile ad hoc networks etc. We have investigated the behaviour of the method for different model parameters. Further, we have tested the detection performance in network simulations, for both gradual and abrupt shifts in the probe response delay. The results indicate that over 90% of the shifts can be detected. Undetected shifts are mainly the effects of long convergence processes triggered by previous shifts. The overall performance depends on the characteristics of the shifts and the configuration of the model parameters.
We investigate the effects of employing a proba- bilistic fault detection approach relative the performance of a deterministic network monitoring method. The approach has its foundation in probabilistic network management, in which performance limits and thresholds are specified in terms of e.g. probabilities or belief values. When combined with adap- tive mechanisms, probabilistic approaches can potentially offer improved controllability, adaptivity and reliability, compared to deterministic monitoring methods. Results from synthetically generated and real network QoS measurements indicate that the probabilistic approach generally can perform at least as good as a deterministic algorithm, with a higher degree of predictable performance and resource-efficiency. Due to the stochastic nature of the algorithm, worse performance than expected is sometimes observed. Nevertheless, the results give additional support to some of the practical benefits expected in using probabilistic approaches for network management purposes.
We present a statistical probing-approach to distributed fault-detection in networked systems, based on autonomous configuration of algorithm parameters. Statistical modelling is used for detection and localisation of network faults. A detected fault is isolated to a node or link by collaborative fault-localisation. From local measurements obtained through probing between nodes, probe response delay and packet drop are modelled via parameter estimation for each link. Estimated model parameters are used for autonomous configuration of algorithm parameters, related to probe intervals and detection mechanisms. Expected fault-detection performance is formulated as a cost instead of specific parameter values, significantly reducing configuration efforts in a distributed system. The benefit offered by using our algorithm is fault-detection with increased certainty based on local measurements, compared to other methods not taking observed network conditions into account. We investigate the algorithm performance for varying user parameters and failure conditions. The simulation results indicate that more than 95 % of the generated faults can be detected with few false alarms. At least 80 % of the link faults and 65 % of the node faults are correctly localised. The performance can be improved by parameter adjustments and by using alternative paths for communication of algorithm control messages.
Scalable and automated monitoring processes for testing, debugging, and operation of VNFs and service-chains are crucial components towards achieving the aims of network softwarization - i.e., cheaper, faster, and shorter service deployment and network management processes. In this paper we present a decentralized monitoring approach aimed at supporting automated deployment and operation of VNFs and service-chains. The approach is inspired by network tomography and is designed to address observability limitations and scalability issues that arise from performing measurements from an SDN controller. From successive end-to-end measurements link metrics are derived via in-network parameter estimation with no need of forwarding raw measurements to the controller, which significantly reduces the measurement overhead compared to when monitoring individual links explicitly from an SDN controller.
The development toward programmatic operation and control of 5G networks is a compelling and technically challenging task. A fundamental component is the capability of exposing and controlling the network state across heterogeneous equipment in a unified manner. In this paper, we outline the COHERENT approach to network abstractions enabling unified representation and programming of heterogeneous radio access networks.
5G promises to usher in the industrial 4.0 era. In that era, intricately managed autonomous industrial sites with for example remotely controller equipment and autonomous units promise previously unseen levels of efficiency. Although such scenarios are elusive, they come with strict long-since established safety requirements. To uphold such requirements, intelligent industrial 5G networks, that actively take into account prevailing conditions and dynamics of the workers on the site, the equipment, and the network, are needed. Little is known about the dynamics of actual industrial 5G networks and the interplay between network performance and QoE. In this paper, as a step towards intelligent industrial 5G networks, we measure network performance for an industrial 5G network, and conduct QoE experiments with remote controlled industrial equipment on an operational site. The results revealed unexpected relationships between QoE and network performance that shows how important domain-specific knowledge is when researching intelligent industrial 5G networks.
Existing approaches to solving combinatorial optimization problems on graphs suffer from the need to engineer each problem algorithmically, with practical problems recurring in many instances. The practical side of theoretical computer science, such as computational complexity, then needs to be addressed. Relevant developments in machine learning research on graphs are surveyed for this purpose. We organize and compare the structures involved with learning to solve combinatorial optimization problems, with a special eye on the telecommunications domain and its continuous development of live and research networks.