ALTO WG Q. Xiang Internet-Draft Tongji/Yale University Intended status: Informational H. Newman Expires: September 14, 2017 California Institute of Technology G. Bernstein Grotto Networking A. Mughal J. Balcas California Institute of Technology J. Zhang Tongji University H. Du Y. Yang Tongji/Yale University March 13, 2017 Traffic Optimization for ExaScale Science Applications draft-xiang-alto-exascale-network-optimization-01.txt Abstract Massive datasets continue to be acquired, simulated, processed and analyzed by globally distributed scientific collaborations, and the volume of this data is growing exponentially. These datasets need to be exchanged through a global network infrastructure. Applications that manage and analyze such massive data volumes can benefit substantially from the information about networking, computing and storage resources from each member sites, and more directly from network-resident services that optimize and load balance resource usage among multiple data transfer and analytic requests, and achieve a better utilization of multi-resources in clusters. The Application-Layer Traffic Optimization (ALTO) protocol can provide via extensions the network information about different clusters/sites, to both users and proactive network management services where applicable, with the goal of improving both application performance and network resource utilization. However, it has been verified in both science networks and commercial data center networks that network resource in many cases is not the bottleneck preventing the efficiency of large dataset transfer and data-intensive analytics. To achieve a greater overall efficiency of the science programs' workflows information about different resources, such as computing, storage and networking, should be provided to data intensive applications simultaneously. In this document, we propose that it is feasible to use existing ALTO services to provides not only network information, but also Xiang, et al. Expires September 14, 2017 [Page 1] Internet-Draft ExaScale Network Optimization March 2017 information about other resources in science networks including computing and storage. We introduce an Exascale Science Application Orchestrator (ExaO), which achieves an efficient multi-resource allocation to support low-latency dataset transfer and data intensive analytics in exascale science networks. ExaO provides simple APIs for users to submit and manage dataset transfer and analytic requests and to monitor the status of each request, along with fine-grained local and global network and site state information in real-time. It collects cluster information from multiple ALTO services utilizing topology extensions and leverages emerging SDN control capabilities to orchestrate the resource allocation for dataset transfers and analytic tasks, leading to improved transfer and analytic latency as well as more efficient utilization of multi-resources in clusters/ sites. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on September 14, 2017. Copyright Notice Copyright (c) 2017 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Xiang, et al. Expires September 14, 2017 [Page 2] Internet-Draft ExaScale Network Optimization March 2017 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 5 3. Changes Since Version -00 . . . . . . . . . . . . . . . . . . 5 4. Problem Settings . . . . . . . . . . . . . . . . . . . . . . 5 4.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . 6 4.2. Challenges . . . . . . . . . . . . . . . . . . . . . . . 6 5. Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . 7 5.1. Using ALTO topology services to provide multi-resource information . . . . . . . . . . . . . . . . . . . . . . . 7 5.2. Example: encode storage bandwidth into path vector . . . 7 6. Key Issues . . . . . . . . . . . . . . . . . . . . . . . . . 9 7. Exascale Dataset Transfer Orchestrator Framework . . . . . . 10 7.1. Architecture . . . . . . . . . . . . . . . . . . . . . . 10 7.2. Request parser . . . . . . . . . . . . . . . . . . . . . 12 7.2.1. User API . . . . . . . . . . . . . . . . . . . . . . 12 7.3. ALTO Client . . . . . . . . . . . . . . . . . . . . . . . 13 7.3.1. Query Mode . . . . . . . . . . . . . . . . . . . . . 13 7.4. ALTO Server . . . . . . . . . . . . . . . . . . . . . . . 13 7.5. Dataset Transfer Agents . . . . . . . . . . . . . . . . . 14 7.6. Request Execution Agents . . . . . . . . . . . . . . . . 14 7.7. Multi-Resource Orchestrator . . . . . . . . . . . . . . . 14 7.7.1. Orchestration Algorithms . . . . . . . . . . . . . . 15 7.7.2. Online, Dynamic Orchestration . . . . . . . . . . . . 15 7.7.3. Example: A Max-Min Fairness Resource Allocation Algorithm . . . . . . . . . . . . . . . . . . . . . . 15 8. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 16 8.1. Deployment . . . . . . . . . . . . . . . . . . . . . . . 17 8.2. Benefiting From ALTO Extension Topology Services . . . . 17 8.3. Constraints of the MFRA Algorithm . . . . . . . . . . . . 18 9. Security Considerations . . . . . . . . . . . . . . . . . . . 18 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 11. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 19 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 19 12.1. Normative References . . . . . . . . . . . . . . . . . . 19 12.2. Informative References . . . . . . . . . . . . . . . . . 19 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 20 1. Introduction Scientific innovation continues to exponentially increase the production of valuable research data. Exchange of this information typically involves the worldwide network infrastructure. One leading example is the Large Hadron Collider (LHC) high energy physics (HEP) program, which aims to find new particles and interactions in a previously inaccessible range of energies. The scientific collaborations that have built and operate large HEP experimental Xiang, et al. Expires September 14, 2017 [Page 3] Internet-Draft ExaScale Network Optimization March 2017 facilities at the LHC, such as the Compact Muon Solenoid (CMS) and A Toroidal LHC ApparatuS (ATLAS), currently have more than 300 petabytes of data under management at hundreds of sites around the world, and this volume is expected to grow to one exabyte by approximately 2018. With such an increasing data volume, how to manage the storage and analytics of these data in a globally distributed infrastructure has become an increasingly challenging issue. Applications such as the Production ANd Distributed Analysis system (PanDA) in ATLAS and the Physics Experiment Data Export system (PhEDEX) in CMS have been developed to manage the data transfers among different cluster sites. Given a data transfer request, these applications make data transfer decisions based on the availability of dataset replicas at different sites and initiate retransmission from a different replica if the original transmission fails or is excessively delayed. And HTCondor is deployed to achieve coarse-grained data analytics parallelization across these sites. When a data analytic task is submitted, HTCondor adopts a match-making process to assign the task to a certain set of servers in one site, based on the coarse-grained description of resource availability, such as the number of cores, the size of memory, the size of hard disk, etc. However, neither dataset transfers nor data analytic task parallelization takes fine-grained information of cluster resources, such as data locality, memory speed, network delay, network bandwidth, etc., into account, leading to high data transfer and analytic latency and underutilization of cluster resources. The Application-Layer Traffic Optimization (ALTO) services defined in [RFC7285] provide network information with the goal of improving the network resource utilization while maintaining or improving application performance. Though ALTO is not designed to provide information about other resources, such as computing and storage resources, in cluster networks, in this document we propose that exascale science networks can leverage existing ALTO services defined in [RFC7285] and ALTO topology extension services defined in network graph [DRAFT-NETGRAPH], path vector [DRAFT-PV], routing state abstraction[DRAFT-RSA], multi-cost [DRAFT-MC] and cost-calendar [DRAFT-CC] and etc. to encode information about multiple types of resources in science networks, such as memory I/O speed, CPU utilization, network bandwidth, and provides such information to orchestration applications to improve the performance of dataset transfer and data analytic tasks, including throughput, latency, etc. This document introduces a centralized resource orchestration service, Exascale Science Application Orchestrator (ExaO), which provides an efficient multi-resource allocation to support low- latency dataset transfer and data-intensive analytics in exascale Xiang, et al. Expires September 14, 2017 [Page 4] Internet-Draft ExaScale Network Optimization March 2017 science networks. ExaO provides a set of simple API for authorized users to submit, update and delete dataset transfer requests and data intensive analytic requests. One important proposal we make in this document is that it is feasible to use ALTO services to provide not only network information, but also information on other resources in science networks including computing and storage. An ExaO prototype with the dataset transfer scheduling component has been implemented on a single-domain Caltech SDN development testbed, where the ALTO OpenDaylight controller is used to collect topology information. We are currently designing the resource orchestration components to achieve low-latency data-intensive analytics. This document is organized as follows: Section 3 summarizes the change of this document since version -00. Section 4 elaborates on the motivation and challenges for coordinating storage, computing and network resources in a globally distributed science network infrastructure. Section 5 discusses the basic idea of encoding multi-resource information into ALTO path vector and abstraction services and gives an example. Section 6 lists several key issues to address in order to realize the proposal of providng multi-resource information by ALTO topology services. Section 7 gives the details of ExaO architecture for orchestrating exascale dataset transfer. Section 8 discusses current development progress of ExaO and next steps. 2. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 3. Changes Since Version -00 o Add the basic idea of ExaO, i.e., use ALTO topology service to provide a multi-resource abstraction of clusters/sites in science networks. o Add an example to show the feasibility of encoding storage resources in existing ALTO services, e.g., ALTO path-vector. o Add the data analytic component in the orchestration framework. 4. Problem Settings Xiang, et al. Expires September 14, 2017 [Page 5] Internet-Draft ExaScale Network Optimization March 2017 4.1. Motivation Exascale science programs usually involve the participation of countries and sites all over the world. The CMS experiment in the LHC physics program is a typical example. The site located at the LHC laboratory is called a Tier-0 site, which processes the data selected and stored locally by the online systems that select and record the data in real-time as it comes off the particle detector, archives it and transfers it to over 10 Tier-1 sites around the globe. Raw datasets and processed datasets from Tier-1 sites are then transferred to over 160 Tier-2 sites around the world based on users' requests. Different sites have different resources and belong to different administration domains. With the exponentially increasing data volume in the CMS experiment, the management of large data transfers and data intensive analytics in such a global multi- domain science network has become an increasingly challenging issue. Allocating resources in different clusters to fulfill different users' dataset transfer requests and data analytic requests require careful orchestrating as different requests are competing for limited storage, computation and network resources. 4.2. Challenges Orchestrating exascale dataset transfers and analytics in a globally distributed science network is non-trivial as it needs to cope with two challenges. o Different sites in this network belong to different administration domain. Sharing raw site/cluster information would violate sites' privacy constraints. Orchestrating data transfers and analytic requests based on highly abstracted, non-real-time network information may lead to suboptimal scheduling decisions. Hence the orchestrating framework must be able to collect sufficient resource information about different clusters/sites in real-time as well as over the longer term, to allow reasonably optimized network resource utilization without violating sites' privacy requirements. o Different science programs tend to adopt different software infrastructures for managing dataset transfers and analytics, and may place different requirements. Hence the orchestrating framework must be modular so that it can support different dataset management systems and different orchestrating algorithms. The orchestrating framework must support the interaction between the multi-resource orchestration module, the dataset transfer module, and the data analytic execution module. The key information to be exchanged between modules includes dataset information, the resource Xiang, et al. Expires September 14, 2017 [Page 6] Internet-Draft ExaScale Network Optimization March 2017 state of different clusters and sites, the transfer and analytic requests in progress, as well as trends and network-segment and site performance from the network point of view. Such interaction ensures that (1) the various programs can adapt their own data transfer and analytic systems to be multi-resource-aware, and more efficient in achieving their goals; and (2) the various orchestrating algorithms can achieve a reasonably optimized utilization on not only the network resource but also the computing and storage resources. 5. Basic Idea 5.1. Using ALTO topology services to provide multi-resource information The ALTO protocol is designed to provide network information to applications so that applications can achieve a better performance. Different ALTO topology services including path vector, routing state abstraction, multi-cost, cost calendar, etc. have been proposed to provide fine-grained network information to applications. In this document, we propose that not only can ALTO provide network information of different cluster sites, it can also provides information of multiple resources, including computing resource and storage resources. To this end, the basic "one-big-switch" abstraction provided by the base ALTO protocol is not sufficient. Several examples have already been given in [DRAFT-PV] and [DRAFT-RSA] to demonstrate that. There has been a similar proposal before about using ALTO to provide resource information of data centers [DRAFT-DC]. However, that proposal requires a new information model for clusters or data centers, which may affect the compatibility of ALTO. The solution of this proposal is simpler. Its basic idea is that each computer node and storage node can be seen as a "network element" or an "abstract network element" defined in ALTO-path-vector [DRAFT-PV]. In this way, ExaO can fully reuse all existing ALTO services by introducing only one cost-mode (pv) and two cost-metrics (ne and ane), instead of introducing a new information model. 5.2. Example: encode storage bandwidth into path vector We use the same dumbbell topology in [DRAFT-RSA] as an example to show the feasibility of using ALTO topology service to provide multi- resource information. In this topology, we assume the bandwidth of each network cable is 1Gbps, including the cables connecting end hosts to switches. Consider a dataset transfer request which needs to schedule the traffic among a set of end host source-destination pairs, say eh1 -> eh2, and eh3 -> eh4. Assume that the transfer application receives from the ALTO Cost Map service that both eh1 -> eh2 and eh3 -> eh4 have bandwidth 1Gbps. In [DRAFT-RSA], it is shown that whether each of the two traffic flows can receive 1Gbps Xiang, et al. Expires September 14, 2017 [Page 7] Internet-Draft ExaScale Network Optimization March 2017 bandwidth depends on whether the routes of two flows share a bottleneck link. Path vector and routing state abstraction services provide additional information about network state encoded in abstract network elements. If the returned state is ane1 + ane2 <= 1Gbps, it means two flows cannot each get 1Gbps bandwidth at the same time. If the returned state is ane1 <= 1Gbps and ane2 <= 1Gbps, it means two flows each can get 1Gbps bandwidth. +------+ | | --+ sw6 +-- / | | \ PID1 +-----+ / +------+ \ +-----+ PID2 eh1__| |_ / \ ____| |__eh2 | sw1 | \ +--+---+ +---+--+ / | sw2 | +-----+ \ | | | |/ +-----+ \_| sw5 +---------+ sw7 | PID3 +-----+ / | | | |\ +-----+ PID4 eh3__| |__/ +------+ +------+ \____| |__eh4 | sw3 | | sw4 | +-----+ +-----+ Other than network resource, assume in this topology eh1 and eh3 are equipped with commodity hard drive disk (HDD) while eh2 and eh4 are equipped with SSD. Because the bandwidth of HDD is typically 0.8Gbps and that of SSD is typically 3Gbps. Even if the returned routing state is ane1 <= 1Gbps and ane2 <=1Gbps, the actual bottleneck of each traffic flow is the storage I/O bandwidth at source host. As a result, the total bandwidth of both traffic flows can only reach 1.6Gbps. It has been verified in the CMS experiment, and also several studies on commercial data centers that network resource is not always the bottleneck of large dataset transfer and data analytics. Many have reported that storage resources and computing resources become the bottleneck in a fair large percent of dataset transfers and data analytic tasks in science networks and commercial data centers. In this example, if we see the end hosts as network elements, the storage I/O bandwidth of each host can also be encoded as an abstract element into the path-vector. And under the storage and route settings above, the returned cluster state would be ane1 <=0.8Gbps and ane2 <=0.8Gbps, which provides a more accurate capacity region for the requested traffic flows. Xiang, et al. Expires September 14, 2017 [Page 8] Internet-Draft ExaScale Network Optimization March 2017 6. Key Issues Last section describes the basic idea of using ALTO topology services to provide multi-resource information and gives an example to demonstrate its feasibility. Next we list and discuss several key issues to address in this proposal. o Can ALTO topology services provide data locality information? Existing ALTO topology services do not provide such information. Many studies have pointed out that such information plays a vital role in reducing the latency of data-intensive analytics. If ALTO topology services can encode such information together with information of other resources together, data-intensive applications can benefit a great deal in terms of information aggregation and communication overhead. o How to quickly map applications' resource allocation decision on abstract multi-resource view back to the physical multi-resource view of clusters/sites? Fine-grained resource information can be encoded into abstract network elements to reduce overhead and provide certain privacy protection of clusters. Such information can be highly compressed (see the dumbbell example used in this document as well as in [DRAFT-PV] and [DRAFT-RSA]). In preliminary evaluations on RSA, the network element compression ratio can be as high as 80 percent. This ratio is expected to be even higher in large-scale data center or cluster setting, e.g. a fat-tree topology with k=48. Therefore a fast mapping from the resource orchestration decisions on the abstract view back to the physical view is needed to satisfy the stringent latency requirement of large dataset transfers and data-intensive analytics. o How much privacy, including key resource configurations, raw topology, intra-cluster scheduling policy, etc., will be exposed? Compared with the "one-big-switch" abstraction, other ALTO topology services such as path-vector [DRAFT-PV] and routing state abstraction [DRAFT-RSA] provides fine-grained resource information to applications. Even if such information can be encoded into abstract network elements, it still risks exposing private information of different clusters/sites. Current internet drafts of these services did not provide any formal privacy analysis or performance measurement. This would be one key issue this document plan to investigate in the future. o How does current ALTO services such as path-vector and RSA scale when they are used to provide abstract information of multiple resources in clusters? Another issue along this line is how to balance the liveness of fine-grained resource information and the Xiang, et al. Expires September 14, 2017 [Page 9] Internet-Draft ExaScale Network Optimization March 2017 corresponding information delivery overhead? Although encoding information of network elements into abstract network elements can achieve a very competitive information compression ratio, a large dataset transfer or analytic application always involve many network elements in multiple clusters/sites and the absolute number of involved network elements keep increasing as the scale of clusters increase. In addition, when resource information in a cluster changes, the ALTO services need to inform all related applications. In either cases, delivering fine-grained resource information would cause high communication overhead. There still lacks of an analytics or experimental understanding on the scalability of path-vector and RSA services. 7. Exascale Dataset Transfer Orchestrator Framework 7.1. Architecture This section describes the design details of key components of the ExaO framework: the request parser, the ALTO client, the ALTO servers, the multi-resource orchestrator, the dataset transfer agents and the request execution agents. Among these modules, the request parser provides a set of simple APIs for authorized users to submit, update and cancel dataset transfer requests and data-intensive analytic requests. Depending on the programming model of each request, e.g., Map-Reduce, the parser decompose it into multiple smaller sub-requests. The ALTO client collects information about multiple resources in different clusters from ALTO servers deployed at different cluster sites. Both the decomposed sub-requests and the collected information of different clusters are sent to the multi-resource orchestrator, which makes dataset transfer scheduling decisions, including replica selection, routing path computation and bandwidth allocation, and request parallelization decisions, such as which cluster each sub- request should be placed at the multi-resource orchestrator. These decisions are then sent to data transfer agents and request execution agents, who act these decisions on behalf of ExaO. Figure 1 shows the whole process. Xiang, et al. Expires September 14, 2017 [Page 10] Internet-Draft ExaScale Network Optimization March 2017 .----------. | Users | '----------' | submit data transfer | and analytic requests .- - - - - - - - - - - - - - --|- - - - - - - - - - - - - - - - - . | .----------. | | ExaO | Request | | | | Parser | | | '----------' | | | parse and | | | decompose requests | | .----------------. .--------. | | | Multi-Resource |---------------| ALTO | | | | Allocator | collect | Client | | | '----------------' resource '--------' | | / | state |query | | dataset transfer / | request |multi- | | scheduling / | parallelization |resource| | decisions / | decisions |state | | / | | | | .---------. .----------. .---------. | | .-| Dataset | | Request |-. .-| ALTO | | | | | Transfer| | Execution| | | | Servers | | | | | Agents | | Agents | | | '---------' | | | '---------' '----------' | '----------' | | '----------' '-----------' | | | | | | | | | | | | | | | .--|---------------------' | | | | | | | | | | | '---|------|--|--|-----------------. | | | | | | | | | | | | | | '-----------------|---. | | | | | | | | | | | .-----------. .-----------. | | | Cluster 1 | . . . | Cluster N | | | '-----------' '-----------' | '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -' The benefits of ExaO include: o It provides a set of simple APIs for authorized users to submit and manage dataset transfer requests i and data-intensive analytic requests, and enables real-time requests' status monitoring. o It improves a better resource utilization and achieves low-latency for dataset transfer and data intensive analytics in science Xiang, et al. Expires September 14, 2017 [Page 11] Internet-Draft ExaScale Network Optimization March 2017 networks, by collecting the resource information of different clusters/sites and orchestrating the resource allocation for all submitted requests in a centralized framework. o The architecture of ExaO is modular to support different resource allocation algorithms, data transfer agents and request execution agents. It also supports the deployment of different ALTO servers. 7.2. Request parser The request parser is the front end of ExaO, and is responsible for collecting dataset transfer requests and data-intensive analytic requests from users and passing them to the multi-resource orchestrator for further processing. It provides a set of simple APIs for users to submit and manage requests, and to track the status of requests in real-time. 7.2.1. User API o submitReq(request, [options]) This API allows users to submit a request and specify corresponding options. The request can be a data transfer request or a data analytic request. Request options include priority, delay, etc. It returns a request identifier reqID that allows users to update, delete this request or track its status. The additional options may or may not be approved, and the relative priorities may be modified by the resource orchestrator depending on the role of users (regular users or administrators at different levels), the resource availability and the status of other ongoing requests. o updateReq(requestID, [options]) This API allows users to update the options of requests. It will return a SUCCESS if the new options are received by the request parser. But these new options may or may not be approved, and may be modified by the resource orchestrator depending on the role of users (regular users or administrators), the resource availability and the status of other ongoing requests. o deleteReq(requestID) This API allows users to delete a request by passing the corresponding requestID. A completed request cannot be deleted. An ongoing request will be stopped and the output data will be deleted. Xiang, et al. Expires September 14, 2017 [Page 12] Internet-Draft ExaScale Network Optimization March 2017 o getReqStatus(requestID) This API allows users to query the status of a request by specifying the corresponding requestID. The returned status information includes whether the request has started, the assigned priority, the percentage of finished sub-requests, transmission statistics, the expected remaining time to finish, etc. 7.3. ALTO Client The ALTO client is in the back end of ExaO and is responsible for retrieving cluster resource information through querying ALTO servers deployed at different sites. The resource information needed in ExaO includes the topology, link bandwidth, computing node memory I/O speed, computing node CPU utilization, etc. The base ALTO protocol [RFC7285] provides an extreme single-node abstraction for this information, which only supports the multi-resource orchestrator making coarse-grained resource allocation decisions. To enable fine- grained multi-resource orchestration for dataset transfer and data analytics in cluster networks, ALTO topology extension services such as routing state abstraction (RSA) [DRAFT-RSA], path vector [DRAFT-PV], network graph [DRAFT-NETGRAPH], multi-cost [DRAFT-MC] and cost-calendar [DRAFT-CC] are needed to provide fine-grained information about different types of resources in clusters. 7.3.1. Query Mode The ALTO client should operate in different query modes depending on the implementation of ALTO servers. If an ALTO server does not support incremental updates using server-sent events (SSE) [DRAFT-SSE], the ALTO client sends queries to this server periodically to get the latest resource information. If the cluster state changes after one query, the ALTO client will not be aware of the change until next query. If an ALTO server supports SSE, the ALTO client only sends one query to the ALTO server to get the initial cluster information. When the resource state changes, the ALTO client will be notified by the ALTO server through SSE. 7.4. ALTO Server ALTO servers are deployed at different sites around the world, and at strategic locations in the network itself, to provide information about different types of resources in the cluster networks in response to queries from the ALTO client deployed in ExaO. Such information include topology, link bandwidth, memory I/O speed and CPU utilization at computing nodes, storage constraints in storage nodes and etc. Each ALTO server must provide basic information services as specified in [RFC7285] such as network map, cost map, Xiang, et al. Expires September 14, 2017 [Page 13] Internet-Draft ExaScale Network Optimization March 2017 endpoint cost service (ECS), etc. To support the fine-grained multi- resource allocation in ExaO, each ALTO server should also provide more fine-grained information about different resources in clusters through ALTO extension services such as the routing state abstraction [DRAFT-RSA], path vector [DRAFT-PV], network graph [DRAFT-NETGRAPH], multi-cost [DRAFT-MC] and cost-calendar [DRAFT-CC] services. 7.5. Dataset Transfer Agents Dataset transfer agents are deployed at each site and in the network as needed, and are responsible for the following functions: o Receive and process instructions from the multi-resource orchestrator, e.g. starting a new transfer, aborting a running transfer and adjusting transfer parameters such as transfer rate and the number of connections. o Monitor the status of dataset transfers and sends the updated status to the multi-resource orchestrator. Different systems can adopt different dataset transfer agents, or different structured agent subsystems, depending on specific needs. For instance, in the CMS experiment, these agents are PhEDEx distributed agents. 7.6. Request Execution Agents Request execution agents are deployed at each site and are responsible for the following functions: o Receive and process instructions from the multi-resource orchestrator, e.g. placement and execution of data analytic sub- requests, abortion of running analytic tasks and etc. o Monitor the status of data analytic tasks and send the updated status to the multi-resource orchestrator. Depending on the supporting data analytic frameworks, different request execution agents may be deployed in each cluster. For instance, in the CMS experiment, both MPI and Hadoop execution agents are deployed. 7.7. Multi-Resource Orchestrator The multi-resource orchestrator takes the decomposed dataset transfer and data analytic requests from the request parser and the cluster resource information collected by the ALTO client as input. It then makes (1) dataset transfer scheduling decisions, including dataset Xiang, et al. Expires September 14, 2017 [Page 14] Internet-Draft ExaScale Network Optimization March 2017 replica selection, path selection, and bandwidth allocation, for all data transfer request, and (2) request parallelization decisions, such as which cluster each sub-request should be placed at the multi- resource orchestrator. These decisions are sent to data transfer agents and the request execution agents at different clusters for execution. 7.7.1. Orchestration Algorithms The modular design of ExaO allows the adoption of different orchestration algorithms and methodologies, depending on the specific performance requirements. In Section 7.7.3, a max-min fairness resource allocation algorithm for dataset transfer is described as an example. 7.7.2. Online, Dynamic Orchestration The multi-resource orchestrator should adjust the resource allocation decisions based on the progress of ongoing requests, the utilization and dynamics of cluster resources. In normal cases, the multi- resource orchestrator periodically collects such information and executes the orchestration algorithm. When it is notified of events such as request status update, cluster state update and etc., the orchestrator will also execute the orchestration algorithm to adjust resource allocations. 7.7.3. Example: A Max-Min Fairness Resource Allocation Algorithm In this section, we describe a max-min fair resource allocation (MFRA) scheduling algorithm which aims to minimize the maximal time to complete a dataset transfer subject to a set of constraints. To make resource allocation decisions, MFRA requires sufficient network information including topology, link bandwidth and recent historical information in some cases. In a small-scale single-domain network, an SDN controller can provide the raw complete topology information for the MFRA algorithm. However, in a large-scale multi-domain science network such as CMS, providing the raw network topology is infeasible because (1) it would incur significant communication overhead; and (2) it would violate the privacy constraints of some sites. Several ALTO extension topology services including Abstract Path Vector [DRAFT-PV], Network Graphs [DRAFT-NETGRAPH] and RSA [DRAFT-RSA] can provide the fine-grained yet aggregated/abstract topology information for MFRA to efficiently utilize bandwidth resources in the network. Ongoing pre-production deployment efforts of ExaO in the CMS network involve the implementation of the RSA service. Other than topology information, the additional input of the MFRA algorithm is the Xiang, et al. Expires September 14, 2017 [Page 15] Internet-Draft ExaScale Network Optimization March 2017 priority of each class of flows, expressed in terms of upper and lower limits on the allocated bandwidth between the source and the destination for each data transfer requests. The basic idea of the MFRA algorithm is to iteratively maximize the volume of data that can be transferred subject to the constraints. It works in quantized time intervals such that it schedules network paths and data volumes to be transferred in each time slot. When the DTR scheduler is notified of events such as the cancellation of a DTR, the completion of a DTR or network state changes, the MFRA algorithm will also be invoked to make updated network path and bandwidth allocation decisions. In each execution cycle, MFRA first marks all transfers as unsaturated. Then it solves a linear programming model to find the common minimum transfer satisfaction rate (i.e., the ratio of transferred data volume in a time interval over the whole data volume of this request) that is satisfied by all transfer requests. With this common rate found, MFRA then randomly selects an unsaturated request in each iteration, increases its transfer rate as much as possible by finding residual paths available in the network, or by increasing the allocated bandwidth along an existing path, until it reaches its upper limit or can otherwise not be increased further, so it is saturated. At each iteration, newly saturated requests are removed from the subsequent process by fixing their corresponding rate value, and completed transfers are removed from further consideration. After all the data transfer rates are saturated in the given time slot, then a feasible set of data transfer volumes scheduled to be transferred in the slot across each link in the network can be derived. The MFRA algorithm yields a full utilization of limited network resources such as bandwidth so that all DTR can be completed in a timely manner. It allocates network resources fairly so that no DTR suffers starvation. It also achieves load balance among the sites and the network paths crossing a complex network topology so that no site and no network link is oversubscribed. Moreover, MFRA can handle the case where particular routing constraints are specified, e.g., where all routes are fixed ahead of time, or where each transfer request only uses one single path in each time slot, by introducing an additional set of linear constraints. 8. Discussion Xiang, et al. Expires September 14, 2017 [Page 16] Internet-Draft ExaScale Network Optimization March 2017 8.1. Deployment The ExaO framework is the first step towards a new class of intelligent, SDN-driven global systems for data-intensive science programs involving a worldwide ensemble of sites and networks, such as CMS and ATLAS. ExaO relies heavily on the ALTO services for collecting and expressing abstract up-to-date cluster information, and the SDN centralized control capability to orchestrate the flows of dataset transfer requests and data analytic requests. It aims to provide a new operational paradigm in which science programs can use complex network and computing infrastructures with high throughput, while allowing for coexistence with other network traffic. A prototype case study implementation of ExaO has been demonstrated on the Caltech/StarLight/Michigan/Fermilab SDN development testbed. Because this testbed is a single-domain network, the current ExaO prototype leverages the ALTO OpenDaylight controller, to collect topology information. The CMS experiment is currently exploring pre- production deployments of ExaO, looking towards future widespread production use. To achieve this goal, it is imperative to collect sufficient topology information from the various sites in the multi- domain CMS network, without causing any privacy leak. To this end, the ALTO RSA service [DRAFT-RSA] is under development. Furthermore, as will be discussed next, other ALTO topology extension services can also substantially improve the performance of ExaO. 8.2. Benefiting From ALTO Extension Topology Services The current ALTO base protocol [RFC7285] exposes network topology using the extreme "my-Internet-view" representation, which abstracts a whole network as a single node that has a set of access ports, with each port connects to a set of end hosts called endpoints. Such an extreme abstraction leads to significant information loss on network topology [DRAFT-PV], which is key information for ExaO to make dynamic scheduling and resource allocation decisions. Though ExaO can still allocate resource for data transfer and analytic requests on this abstract view, the resource allocation decisions are suboptimal. Alternatively, feeding the raw, complete network topology of each site to ExaO is not desirable, either. First, this would violate privacy constraints of different sites. Secondly, a raw network topology would significantly increase the problem space and the solution space of the orchestrating algorithm, leading to a long computation time. Hence, ExaO desires an ALTO topology service that is able to provide only enough fine-grained topology information. Several ALTO topology extension services including Abstract Path Vector [DRAFT-PV], Network Graphs [DRAFT-NETGRAPH] and RSA Xiang, et al. Expires September 14, 2017 [Page 17] Internet-Draft ExaScale Network Optimization March 2017 [DRAFT-RSA] are potential candidates for providing fine-grained abstract network formation to ExaO. In addition, we propose that these services can also be used to provide information about computing and storage resources of different cluster/sites by viewing each computing node and storage node as a network element or abstract network element. For instance, the path vector service supports the capacity region query, which accepts multiple concurrent data flows as the input and returns the information of bottleneck resources, which could be a set of links, computing devices or storage devices, for the given set of concurrent flows. This information can be interpreted as a set of linear constraints for the multi-resource orchestrator, which can help data transfer and analytic requests better utilize multiple types of resources in different clusters. 8.3. Constraints of the MFRA Algorithm The first constraint of the MFRA algorithm is computation overhead. The execution of MFRA involves solving linear programming problems repeatedly at every time slot. The overhead of computation time is acceptable for small sets of dataset transfer requests, but may increase significantly when handling large sets of requests, e.g., hundreds of transfer requests. Current efforts towards addressing this issue include exploring the feasibility of incremental computation of scheduling policies, and reducing the problem scale by finding the minimal equivalent set of constraints of the linear programming model. The latter approach can benefit substantially from the ALTO RSA service [DRAFT-RSA]. The second constraint is that the current version of MFRA does not involve dataset replica selection. Simply denoting the replica selection as a set of binary constraint will significantly increases the computation complexity of the scheduling process. Current efforts focus on finding efficient algorithms to make dataset replica selection. 9. Security Considerations This document does not introduce any privacy or security issue not already present in the ALTO protocol. 10. IANA Considerations This document does not define any new media type or introduce any new IANA consideration. Xiang, et al. Expires September 14, 2017 [Page 18] Internet-Draft ExaScale Network Optimization March 2017 11. Acknowledgments The authors thank discussions with Kai Gao, Linghe Kong, Xiao Lin, Xin Wang, Y. Richard Yang and Jingxuan Zhang. 12. References 12.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . 12.2. Informative References [DRAFT-CC] Randriamasy, S., Yang, R., Wu, Q., Deng, L., and N. Schwan, "ALTO Cost Calendar", 2017, . [DRAFT-DC] Lee, Y., Bernstein, G., Dhody, D., and T. Choi, "ALTO Extensions for Collecting Data Center Resource Information", 2014, . [DRAFT-MC] Randriamasy, S., Roome, W., and N. Schwan, "Multi-Cost ALTO", 2017, . [DRAFT-NETGRAPH] Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y. Yang, "ALTO Topology Extensions: Node-Link Graphs", 2015, . [DRAFT-PV] Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y. Yang, "ALTO Extension: Abstract Path Vector as a Cost Mode", 2015, . Xiang, et al. Expires September 14, 2017 [Page 19] Internet-Draft ExaScale Network Optimization March 2017 [DRAFT-RSA] Gao, K., Wang, X., Yang, Y., and G. Chen, "ALTO Extension: A Routing State Abstraction Service Using Declarative Equivalence", 2015, . [DRAFT-SSE] Roome, W. and Y. Yang, "ALTO Incremental Updates Using Server-Sent Events (SSE)", 2015, . [RFC7285] Alimi, R., Ed., Penno, R., Ed., Yang, Y., Ed., Kiesel, S., Previdi, S., Roome, W., Shalunov, S., and R. Woundy, "Application-Layer Traffic Optimization (ALTO) Protocol", RFC 7285, DOI 10.17487/RFC7285, September 2014, . Authors' Addresses Qiao Xiang Tongji/Yale University 51 Prospect Street New Haven, CT USA Email: qiao.xiang@cs.yale.edu Harvey Newman California Institute of Technology 1200 California Blvd. Pasadena, CA USA Email: newman@hep.caltech.edu Greg Bernstein Grotto Networking Fremont, CA USA Email: gregb@grotto-networking.com Xiang, et al. Expires September 14, 2017 [Page 20] Internet-Draft ExaScale Network Optimization March 2017 Azher Mughal California Institute of Technology 1200 California Blvd. Pasadena, CA USA Email: azher@hep.caltech.edu Justas Balcas California Institute of Technology 1200 California Blvd. Pasadena, CA USA Email: justas.balcas@cern.ch Jingxuan Jensen Zhang Tongji University 4800 Cao'an Hwy Shanghai 201804 China Email: jingxuan.n.zhang@gmail.com Haizhou Du Tongji/Yale University 51 Prospect Street New Haven, CT USA Email: duhaizhou@gmail.com Y. Richard Yang Tongji/Yale University 51 Prospect Street New Haven, CT USA Email: yry@cs.yale.edu Xiang, et al. Expires September 14, 2017 [Page 21]