Most high-performance computing resource managers only allow applications to request a static allocation of resources. However, evolving applications have resource requirements which change (evolve) during their execution. Currently, such applications are forced to make an allocation based on their peak resource requirements, which leads to an inefficient resource usage. This paper studies whether it makes sense for resource managers to support evolving applications. It focuses on scheduling fully-predictably evolving applications on homogeneous resources, for which it proposes several algorithms and evaluates them based on simulations. Results show that resource usage and application response time can be significantly improved with short scheduling times.

This work was supported by the French ANR COOP project, no ANR-09-COSI-001.

This paper does an initial study to find out whether it is valuable for RMSs to support evolving applications. It focuses on fully-predictably evolving applications. While we agree that such an idealized case might be of limited practical use, it is still interesting to be studied for two reasons. First, it paves the way to supporting marginally-predictably evolving applications. If little gain can be made with fully-predictably evolving applications, where the system has complete information, it is clear that it makes little sense to support marginallypredictable ones. Second, the developed algorithms might be extensible to the marginally- and non-predictable case. Each time an application submits a change to the RMS, the scheduling algorithm for fully-predictable applications could be re-run with updated information.

The contribution of this paper is threefold. First, it presents a novel scheduling problem: dealing with evolving applications. Second, it proposes a solution based on a list scheduling algorithm. Third, it evaluates the algorithm and shows that significant gains can be made. Therefore, we argue that RMSs should be extended to take into account evolving resource requirements.

The remaining of this article is structured as follows. Section 2 presents related work. Section 3 gives a few definitions and notations used throughout the paper and formally introduces the problem. Section 4 proposes algorithms to solve the stated problem, which are evaluated using simulations in Section 5. Finally, Section 6 concludes this paper and opens up perspectives. 2

Increased interest has been devoted to dynamically allocate resources to applications, as it has been shown to improve resource utilization [7]. If the RMS can change an allocation during run-time, the job is called malleable. How to write malleable applications [8,9] and how to add RMS support for them [10,11] has been extensively studied.

However, supporting evolving applications is different from malleability. In the latter case, it is the RMS that decides when an application has to grow/shrink, whereas in the former case, it is the application that requests more/fewer resources, due to some internal constraints.

The Moab Workload Manager supports so-called ?dynamic? jobs [12]: the RMS regularly queries each application what its current load is, then decides how resources are allocated. This feature can be used to dynamically allocate resources to interactive workloads, but is not suitable for batch workloads. For example, let us assume that there are two evolving applications in the system, each using half of the platform. If, at one point, both of them require additional resources, a dead-lock occurs, as each application is waiting for the requested resources. Instead, the two applications should be launched one after the other.

In the context of Cloud computing, resources may be acquired on-the-fly. Unfortunately, this abstraction is insufficient for large-scale deployments, such as those required by HPC applications, because ?out-of-capacity? errors may be encountered [13]. Thus, the applications? requirements cannot be guaranteed.

To accurately define the problem studied in this paper, let us first introduce some mathematical definitions and notations. 3.1

Let an evolution profile (EP) be a sequence of steps, each step being characterized by a duration and a node-count. Formally, ep = {(d1, n1), (d2, n2), . . . , (dN , nN )}, where N is the number of steps, di is the duration and ni is the node-count during Step i.

An evolution profile can be used to represent three distinct concepts. First,
a resource EP represents the resource occupation of a system. For example,
if 10 nodes are busy for 1200 s, afterwards 20 nodes are busy for 3600 s, then
epres = {(

Second, a requested EP represents application resource requests. For
example, epreq = {(

Third, a scheduled EP represents the number of nodes actually allocated
to an application. For example, an allocation of nodes to the previous
twostep application might be eps = {(2000, 0), (

We define the expanded and delayed EPs of ep = {(d1, n1), . . . , (dN , nN )} as follows: ep = {(d1, n1), . . . , (dN , nN )} is an expanded EP of ep, if ?i ? {1, . . . , N }, di > di; ep = {(d0, 0), (d1, n1), . . . , (dN , nN )} is a delayed EP of ep, if d0 > 0.

For manipulating EPs, we use the following helper functions: ? ep(t) returns the number of nodes at time coordinate t,

i.e., ep(t) = n1 for t ? [0, d1), ep(t) = n2 for t ? [d1, d1 + d2), etc. ? max(ep, t0, t1) returns the maximum number of nodes between t0 and t1, i.e., max(ep, t0, t1) = maxt?[t0,t1) ep(t), and 0 if t0 = t1. ? loc(ep, t0, t1) returns the end-time of the last step containing the maximum, restricted to [t0, t1], i.e., loc(ep, t0, t1) = t ? max(ep, t0, t) = max(ep, t0, t1) > max(ep, t, t1). ? delay(ep, t0) returns an evolution profile that is delayed by t0. ? ep1 + ep2 is the sum of the two EPs, i.e., ?t, (ep1 + ep2)(t) = ep1(t) + ep2(t). 3.2

To give a better understanding on the core problem we are interested in, this section briefly describes how fully-predictably evolving applications could be scheduled in practice.

Let us consider that the platform consists of a homogeneous cluster of nnodes computing nodes, managed by a centralized RMS. Fully-predictably evolving applications are submitted to the system. Each application i expresses its resource requirements by submitting a requested EP1 ep(i) (ep(i)(t) ? nnodes, ?t). The RMS is responsible for deciding when and which nodes are allocated to applications, so that their evolving resource requirements are met.

During run-time, each application maintains a session with the RMS. If from one step to another the application increases its resource requirements, it keeps the currently allocated nodes and has to wait for the RMS to allocate additional nodes to it. Note that, the RMS can delay the allocation of additional nodes, i.e., it is allowed to expand a step of an application. However, we asssume that during the wait period the application cannot make any useful computations: the resources currently allocated to the application are wasted. Therefore, the scheduled EP (the EP representing the resources effectively allocated to the application) must be equal to the requested EP, optionally expanded and/or delayed.

If from one step to another the node-count decreases, the application has to release some nodes to the system (the application may choose which ones). The application is assumed fully-predictable, therefore, it is not allowed to contract nor expand any of its steps at its own initiative.

A practical solution to the above problem would have to deal with several related issues. An RMS-Application protocol would have to be developed. Protocol violations should be detected and handled, e.g., an application which does not release nodes when it is required to should be killed. However, these issues are outside the scope of this paper.

Instead, this paper does a preliminary study on whether it is meaningful to develop such a system. For simplicity, we are interested in an offline scheduling algorithm that operates on the queued applications and decides how nodes are allocated to them. It can easily be shown that such an algorithm does not need to operate on node IDs: if for each application, a scheduled EP is found, such that the sum of all scheduled EPs never exceeds available resources, a valid mapping can be computed at run-time. The next section formally defines the problem. 3.3

Based on the previous definitions and notations, the problem can be stated as follows. Let nnodes be the number of nodes in a homogeneous cluster. napps applications having their requested EPs ep(i) (i = 1 . . . napps) queued in the system (?i, ?t, ep(i)(t) ? nnodes). The problem is to compute for each application i a scheduled EP ep(si), such that the following conditions are simultaneously met: C1 ep(si) is equal to ep(i) or a delayed/expanded version of ep(i) (see above why); C2 resources are not overflown (?t, in=ap1ps ep(si)(t) ? nnodes).

Application completion time and resource usage should be optimized. 1 Note that this is in contrast to traditional parallel job scheduling, where resource requests only consist of a node-count and a wall-time duration.

This section aims at solving the above problem in two stages. First, a listscheduling algorithm is presented, which transforms requested EPs into scheduled EPs. It requires a fit function which operates on two EPs at a time. Second, several algorithms for computing a fit function are described. Algorithm 1 is an offline scheduling algorithm that solves the stated problem. It starts by initializing epr, the resource EP, representing how resource occupation evolves over time, to the empty EP. Then, it considers each requested EP, potentially expanding and delaying it using a helper fit function. The resulting scheduled EP ep(si) is added to epr, effectively updating the resource occupation.

The fit function takes as input the number of nodes in the system nnodes, a requested EP epreq and a resource EP epres and returns a time coordinate ts and epx an expanded version of epreq, such that ?t, epres(t) + delay(epx, ts)(t) ? nnodes. A very simple fit implementation consists in delaying epreq such that it starts after epres.

Throughout the whole algorithm, the condition ?t, epr(t) ? nnodes is guaranteed by the post-conditions of the fit function. Since at the end of the algorithm epr = in=ap1ps ep(si), resources will not be overflown. The core of the scheduling algorithm is the fit function, which expands a requested EP over a resource EP. It returns a scheduled EP, so that the sum of the resource EP and scheduled EP does not exceed available resources.

Because it can expand an EP, the fit function is an element of the efficiency of a schedule. On one hand, a step can be expanded so as to interleave applications, potentially reducing their response time. On the other hand, when a step is expanded, the application cannot perform useful computations, thus resources are wasted. Hence, there is a trade-off between the resource usage, the application?s start time and its completion time.

In order to evaluate the impact of expansion, the proposed fit algorithm takes as parameter the expand limit. This parameter expresses how many times the duration of a step may be increased. For example, if the expand limit is 2, a step may not be expanded to more than twice its original duration. Having an expand limit of 1 means applications will not be expanded, while an infinite expand limit does not impose any limit on expansion.

Base fit Algorithm. Algorithm 2 aims at efficiently computing the fit function, while allowing to choose different expand limits. It operates recursively for each step in epreq as follows: Algorithm 1. Offline scheduling algorithm for evolving applications 5

epr ? epr + ep(si) ; Input: ep(i), i = 1 . . . napps, requested EP of the application i, nnodes, number of nodes in the system, fit(epsrc, epdst, nnodes) ? (ts, eps), a fit function

Output: ep(si), scheduled EP of application i 1 epr ? empty EP ; 2 for i = 1 to napps do 3 t(si), ep(xi) ? fit(ep(i), epr, nnodes) ; 4 ep(si) ? delay(ep(xi), t(si)) ; Algorithm 2. Base fit Algorithm

Input: epreq =

d(r1e)q, n(r1e)q , . . . , d(rNeqreq), n(rNeqreq) epres = d(r1e)s, n(r1e)s , . . . , d(rNesres), n(rNesres) , EP to expand, , destination EP, nnodes : number of nodes in the system, l : maximum allowed expansion (l ? 1), i : index of step from epreq to start with (initially 1), t0 : first moment of time where epreq is allowed to start (initially 0) Output: epx : expanded epreq,

ts : time when epx starts or time when expansion failed 1 if i > Nreq then 2 ts ? t0 ; epx ? empty EP ; return 3 d ? d(rie)q ; n ? n(rie)q ; 4 ts ? t0 ; 5 while True do 6 if nnodes ? max(epres, ts, ts + d) < n then 7 ts ? loc(epres, ts, ts + d) ; continue if i > 1 then

/* duration and node-count of current step */ 8 9

(i?1) /* earliest allowed start of previous step */ teas ? ts ? l · dreq

(i?1) then if teas > t0 ? dreq

ts ? teas ; epx ? ? ; return else if nnodes ? max(epres, t0, ts) < n(rie?q1) then

ts ? loc(epres, t0, ts) ; epx ? ? ; return ttail, epx ? fit(epreq, epres, nnodes, i + 1, ts + d) ; s if epx = ? then

ts ? ttsail ; continue if i > 0 then prepend (ttsail ? ts, n) to epx ; else prepend (d, n) to epx ; ts ? ttsail ? d ; return 1. find ts, the earliest time coordinate when the current step can be placed, so that nnodes is not exceeded (lines 4 ? 7); 2. test if this placement forces an expansion on the previous step, which exceeds the expand limit (lines 8 ? 11) or exceeds nnodes (lines 12 ? 13); 3. recursively try to place the next step in epreq, starting at the completion time of the current step (line 14); 4. prepend the expanded version of the current step in epx (line 17). The first step is delayed (i.e., ts is increased) instead of being expanded (line 20). The recursion ends when all steps have been successfully placed (lines 1?2).

Placement of a step is first attempted at time coordinate t0, which is 0 for the first step, or the value computed on line 14 for the other steps. After every failed operation (placement or expansion) the time coordinate ts is increased so that the same failure does not repeat: ? if placement failed, jump to the time after the encountered maximum (line 7); ? if expansion failed due to the expand limit, jump to the first time which avoids excessive expansion (computed on line 11, used on line 16). ? if expansion failed due to insufficient resources, jump to the time after the encountered maximum (computed on line 13, used on line 16); Since each step, except the first, is individually placed at the earliest possible time coordinate and the first step is placed so that the other steps are not delayed, the algorithm guarantees that the application has the earliest possible completion time. However, resource usage is not guaranteed to be optimal. Post-processing Optimization (Compacting). In order to reduce resource waste, while maintaining the guarantee that the application completes as early as possible, a compacting post-processing phase can be applied. After a first solution is found by the base fit algorithm, the expanded EP goes through a compacting phase: the last step of the applications is placed so that it ends at the completion time found by the base algorithm. Then, the other steps are placed from right (last) to left (first), similarly to the base algorithm. In the worst case, no compacting occurs and the same EP is returned after the compacting phase.

The base fit algorithm with compacting first optimizes completion time then start time (it is optimal from expansion point-of-view), but because it acts in a greedy way, it might expand steps with high node-count, so it is not always optimal for resource waste. 4.3

This section has presented a solution to the problem stated in Section 3.3. The presented strategies attempt to minimize both completion time and resource waste. However, these strategies treat applications in a pre-determined order and do not attempt to do a global optimization. This allows the algorithm to be easier to adapt to an online context in future work for two reasons. First, list scheduling algorithms are known to be fast, which is required in a scalable RMS implementation. Second, since the algorithms treat application in-order, starvation cannot occur. 5

This section evaluates the benefits and drawbacks of taking into account evolving resource requirements of applications. It is based on a series of experiments done with a home made simulator developed in Python. The experiments are first described, then the results are analyzed. The experiments compare two kinds of scheduling algorithms: rigid, which does not take into account evolution, and variations of Algorithm 1. Applications are seen by the rigid algorithm as non-evolving: the requested node-cound is the maximum node-count of all steps and the duration is the sum of the durations of all steps. Then, rigid schedules the resulting jobs in a CBF-like manner.

Five versions of Algorithm 1 are considered to evaluate the impact of its options: base fit with no expansion (noX), base fit with expand limit of 2 without compacting (2X) and with compacting (2X+c), base fit with infinite expansion without compacting (infX) and with compacting (infX+c).

Two kinds of metrics are measured: system-centric and user-centric. The
five system-centric metrics considered are: (

The five user-centric metrics considered are: (

As we are not aware of any public archive of evolving application workloads, we created synthetic test-cases. A test case is made of a uniform random choice of the number of applications, their number of steps, as well as the duration and requested node-count of each step. We tried various combinations that gave similar results. Table 1 and 2 respectively present the results for the systemand user-centric metrics of an experiment made of 1000 tests. The number of applications per test is within [15, 20], the number of steps within [1, 10], a step duration within [500, 3600] and the node-count per step within [1, 75]. Administrator?s Perspective rigid is outperformed by all other strategies. They improve effective resource utilisation, reduce makespan and drastically reduce resource waste within reasonable scheduling time. Compared to rigid, all algorithms reduce resource utilization. We consider this to be a desired effect, as it means that, instead of allocating computing nodes to applications which do not effectively use them, these nodes are release to the system. The RMS could, for example, shut these nodes down to save energy.

There is a trade-off between resource waste and makespan (especially when looking at maximum values). However makespan differs less between algorithms than waste. If maintaining resources is expensive, an administrator may choose the noX algorithm, whereas to favour throughput, she would choose 2X+c. User?s Perspective. When compared to rigid, the proposed algorithms always improve both per-application resource waste and average completion time. When looking at maximum values, the trade-off between expansion / waste vs. completion time is again highlighted. Algorithms which favor stretching (infX, infX+c) reduce average waiting time, but not necessarily average completion time.

The results show that waste is not equally split among applications, instead, few applications are expanded a lot. Since most cluster / grid systems are subject to accounting (i.e., in a way, users pay for the resources that are allocated to them), using the infX and infX+c algorithm (which do not guarantee an upper bound on the waste) should be avoided. Regarding algorithms which limit expansion, the benefits of using 2X+c instead of noX are small, at the expense of significant per-application resource waste. Therefore, users might prefer not to expand their applications at all.

Global Perspective. From both perspectives, expanding applications has limited benefit. Therefore, the noX algorithm seems to be the best choice. Taking into account evolving requirements of applications enables improvement of all metrics compared to an algorithm that does not take evolvement into consideration. 6

Some applications, such as adaptive mesh refinement simulations, can exhibit evolving resource requirements. As it may be difficult to obtain accurate evolvement information, this paper studied whether this effort would be worthwhile in term of system and user perspectives. The paper has presented the problem of scheduling fully-predictable evolving applications, for which it has proposed an offline scheduling algorithm, with various options. Experiments show that taking into account resource requirement evolvement leads to improvements in all measured metrics, such as resource utilization and completion time. However, the considered expansion strategies do not appear valuable.

Future work can be divided into two directions. First, the algorithm has to be adapted to online scheduling. Second, as real applications are not fullypredictable, this assumption has to be changed and the resulting problem needs to be studied.