Page tree
Skip to end of metadata
Go to start of metadata

Introduction 

Dynamic Workload Placement controls the placement of virtual machines within a cluster. It allows the system administrator (the MBE), and end users to influence the placement of virtual machines. This flexible system permits use cases such as virtual private clouds. For examples of how to configure the Dynamic Workload Placement system for various use cases, see Dynamic Workload Placement Examples.

When Dynamic Workload Placement is used

Dynamic Workload Placement is used under two circumstances:

  • When a virtual machine is started by a customer; and
  • When a virtual machine is live migrated by the system administrator, and the system administrator specifies ‘Best node’ as a target.

However, Dynamic Workload Placement is not used if the cluster has local storage enabled. This is because there is only one viable choice of node to use where local storage is in use: the node providing the primary disk. Here placement is done when a new virtual machine is created (as opposed to started).

Dynamic Workload Placement and the Affinity Algorithm

The task of Dynamic Workload Placement is to place a virtual machine within a cluster on the best node possible to run the workload. This involves reconciling the following factors:

  • The size of the virtual machine (amount of RAM and number of virtual CPU cores);
  • The amount of free RAM on each node;
  • The number of virtual CPU cores allocated to each node;
  • The load of each node;
  • Policy specified by the system administrator; and
  • Policy specified by the user.

Often these factors will each push in contradictory directions. Further, it is desirable to hide from each customer the topology of the cluster and the usage of other customers.

In order to reconcile these competing factors, Flexiant Cloud Orchestrator uses an algorithm called Affinity. Broadly speaking it uses the following steps:

  1. Constraint Processing: Determine a set of nodes capable of running the Virtual Machine (if none, then the machine will not start);
  2. VM Key Compilation: Determine a set of Placement Keys for the Virtual Machine to be placed;
  3. Node Key Compilation: Determine a set of Placement Keys for the Virtual Machine to be placed;
  4. Node Key Processing: From this set, determine a set of nodes acceptable to the administrator to run the Virtual Machine (if none, then the machine will not start);
  5. Customer key processing: From this set, determine which node is most acceptable to the end user

These are each detailed below. Steps 2 – 5 above use Placement Keys. It is thus useful to explain what these are and how they are set.

Placement Keys

Placement Key Classes

Placement Keys fall into two classes:

  • Resource Placement Keys; and
  • Node Placement Keys.

Note that Placement Keys prefixed with underscores or hash signs have special internal meanings. The UI hides this from end users, so this is only relevant to API callers.

Resource Placement Keys

Resource Placement Keys are a type of resource key (see Keys). For a resource key to constitute a Placement Key, it must:

  • Have numeric values;
  • Have a non-zero weight;
  • Either be a system key, a billing entity key, or a customer key;
  • Be set on an appropriate resource (see below).

Placement keys can either be system keys, billing entity keys, or customer keys. As customer keys are set by end users, they cannot be set on the master billing entity, customer, VDC, or any of the product offers. Note that the master billing entity can set system keys on any customer resource.

A resource placement key which is a system key is referred to as a system resource placement key. A resource placement key which is a customer key or billing entity key is referred to as a customer resource placement key. 

Keys can be set when managing the following resources:

  • Clusters.
  • Billing entities.
  • Product offers.
  • Customers.
  • Images.
  • VDCs.
  • Servers.
  • Disks.
  • Networks.
  • NICs.

Node Placement Keys

Placement keys can also be set on nodes. In this instance they do not normally have a weight associated with them. All nodes within the cluster must have the same set of placement keys, though the values may be different.

Node placement keys can be set using the Manage Cluster dialog box (to change the list of keys) and the Manage Node dialog (to change the values of keys associated with a given node).

Key Types

Placement keys fall into 5 different types:

  • Normal Keys (not Placement Keys at all)
  • Placement Keys
  • Sticky Placement Keys
  • Reserved Placement Keys
  • Special Placement Keys

Normal Keys

These are not placement keys at all, and are not used in Dynamic Workload Placement. A key is a normal key if:

  • It has a weight of zero; and
  • It is not a node placement key (i.e. it is not set on a node)

Placement Keys

A key is a placement key if:

  • It is set on a node, or
  • It is a system, BE or customer key, and has a non-zero weight

Note that a placement key set on a resource other than those listed below will not have any effect on workload placement.

Sticky Placement Keys

A key is a sticky placement key if:

  • It is set on a node; and
  • It has the sticky flag set

Internally the sticky flag is represented by the inheritToServer flag. Unlike other node placement keys, sticky placement keys have an associated weight. The effect of the stickiness is that a server started on a node will (internally) be given a copy of the sticky key when it starts on the node. Next time the server is started, it will still have this sticky key, which can be used to ensure the server is more likely to start on the same node or group of nodes. This is useful if (for example) multiple VMware clusters are being used with different datastores, to minimise copying of servers between datastores.

Reserved Placement Keys

A placement key is a reserved placement key if begins with an underscore ('_') (which causes inheritToNode to be set within Tigerlily). This prefix is hidden by the UI, so this looks to the user like a different key type.

Reserved placement keys can be:

  • Set on the node; or
  • Set as BE or customer keys

They cannot be set as system keys.

Reserved placement keys are set on nodes by the MBE and are compared against keys set by users or BEs.

Special Placement Keys

Special Placement Keys are set:

  • Internally (by the system) on nodes, and the values are calculated automatically
  • By the MBE on resource objects (normally clusters)

Special Placement Keys always begin with the '#' character, though this is hidden from the user.

The three special placement keys in FCO are:

  • #RAM a key with value between 0 and 1 indicating the RAM contention on a node
  • #CPU a key with value between 0 and 1 indicating the CPU contention of a node
  • #LOAD a key with value between 0 and 1 indicating the load of a node

As the special placement keys on nodes are fixed in nature and are not editable, they do not appear in the UI.

The special placement keys on clusters are added by default. They always have value 0 (as no load, no CPU contention and no RAM contention are always the target). Their weight may be edited. They may be set lower down the resource hierarchy in order to place specific servers differently.

When displayed by the UI, these are represented as 3 different types. The names are not editable, and appear as a translated term (e.g. 'RAM contention') inside the system keys list on resources. They do not appear in the node placement keys dropdowns at all.

Summary of Key Classes and Types

 

Key TypeNode Placement KeysSystem KeysBE KeysCustomer keysNotes
Normal KeysWeight is zero AND is not a node placement key
Placement Keys W W WWeight is non-zero OR is a node placement key
Sticky Placement Keys WSticky flag set (inheritToNode)
Reserved Placement Keys W WBegins with '_' (Tigerlily will set inheritToServer)
Special KeysNot displayed W#RAM, #CPU and #LOAD each displayed as a different key type in the UI

A 'W' indicates that a weight is displayed.

Step 1: Constraint Processing

Constraint processing is process of determining which nodes could accept the Virtual Machine if the affinity algorithm did not apply. For clusters that do not operate local storage, for a node to pass constraint processing, the following criteria must apply:

  • The node must be in a running state;
  • The node must have sufficient uncontended free RAM;
  • The node must have sufficient contended free RAM.

To understand the last two points, it is necessary to understand what contended and uncontended RAM are. On some hypervisors (such as KVM), a virtual machine allocated a certain amount of RAM (1GB for example) may not use an entire 1GB of physical RAM. For instance, KVM page sharing, or pages of zeros, will in general result in less RAM being used than the full 1GB. Therefore a 64GB machine could (assuming on average less than the full 1GB of memory is used) run more than 64 1GB virtual machines. Doing so has a danger associated in it, in that if all VMs suddenly started to use all their RAM, the node would not have sufficient memory to service them all, and some virtual machines would crash. However, as RAM is often the limiting factor in packing more virtual machines in the same hardware (as CPUs merely slow down), some licensees may wish to take advantage of this tighter packing. By raising the RAM contention ratio on the node (through the manage node modal), the node will behave (for the purpose of the contended RAM calculation) as if it has more RAM than it actually has; for instance, with a RAM contention ratio of 2, a 64GB node will reported contended RAM of 128GB. This tends to be safer in environments where the memory footprint of the largest virtual machine is much smaller than the RAM of the node, for instance in hosting environments seeking to pack a large number of small virtual machines into a large node. For more information, see Managing Compute Nodes.

Uncontended and contended free RAM are thus measured differently. Uncontended RAM is measured as the total RAM in the node multiplied by RAM contention ratio, less the sum of the RAM allocated to each virtual machine (irrespective of the fact that a lower amount may be used). Contended free RAM is measured as the free RAM actually in the node (i.e. after the effect of the contention between VMs).

In order to place a VM on a node, both the uncontended and contended free RAM must exceed the size of the VM plus an overhead (defaulting to 1GB) representing the size of the node control software and margin for error.

Step 2: VM Key Compilation

Before processing, the resource placement keys are compiled. A separate key compilation process occurs for the system resource placement keys (on the one hand) and the customer resource placement keys (i.e. those from the Customer Keys and Billing Entity Keys sections) on the other. For each key name, the value at the lowest level of the following hierarchy takes precedence:

  • Cluster.
  • Billing entity.
  • Customer product offer.
  • Customer.
  • Image product offer.
  • Image.
  • VDC product offer.
  • VDC.
  • Server product offer.
  • Server.
  • Product offer of first disk on a server.
  • First disk on a server.
  • Product offer of network attached to first NIC on a server.
  • Network attached to first NIC on a server.
  • Product offer of first NIC on a server.
  • First NIC on a server.

Thus if there is a key ‘MYKEY’ set on the cluster to 1, on the VDC to 2, and on the product offer for the server to 3, the compiled value of ‘MYKEY’ will be 3, as the lowest level at which the key is set is the server. In this way default values can be set at the cluster level, and can be overridden further down.

After compilation, there are thus a set of compiled System Resource Placement Keys, and a set of compiled customer resource placement keys (for this purpose the compiled customer resource placement keys include those from both the Billing Entity Keys and the Customer Keys sections). Note each key has a value and a weight.

Step 3: Node Key Compilation

The Placement Keys on the node are again compiled separately for user and system keys. Node Placement Keys do not have associated weights.

The compiled customer node placement keys on the node consist of the set of compiled customer resource placement keys for each VM running on the node, ignoring the weights. As this would take a long time to calculate, the compiled customer resource placement keys are added to the node when the VM is started, and removed when the VM is stopped, so any changes to a VM’s affinity settings whilst it is running will not affect any subsequently placed VMs. There may be multiple instances of a key name in the compiled node affinity list. For instance, if two VMs running have the compiled value ‘MYKEY’ (perhaps with different values), both will be present in the compiled list. To this list is added any Reserved Placement Keys (i.e. any Placement Keys on the nodes whose names begin with underscores); this allows the licensee to expose characteristics of nodes to customers.

The compiled System Node Placement Keys on the node do not reference compiled System Resource Placement Keys on the VMs in the same way as customer keys. Instead they consist of:

  • The Node Placement keys explicitly set on the node using Manage Node / Manage Cluster, other than Reserved Placement Keys (those starting with an underscore, as above); and
  • the Special Placement Keys for that node.

The Special Placement Keys consist of three internally generated keys (#RAM, #LOAD, and #CPU). Each of these are automatically set in real time with values between 0 and 1 depending on the percentage of RAM allocated (RAM allocated divided by the product of the amount of physical RAM times the RAM contention ratio), load factor, and vCPU contention ratio of the node (this is equal to the number of vCPUs allocated divided by the product of the number of physical CPUs and the CPU contention ratio).

Step 4: System Key Processing

In this step the system attempts to find a set of nodes which match the policy requirements of the licensee.

A number of rounds of the affinity calculation are performed using the compiled System Resource Placement Keys and the System Node Placement Keys.

Each round works as follows: for each key on the node, a comparison is done between the compiled list from the VM to be placed and the compiled list on the Node. An affinity proximity value is calculated as follows: if the difference between the values of the two keys is 1.0 or more, the proximity is zero; if the difference between the values is less than 1.0, then the proximity is 1.0 less the difference. So if the two values are identical, the proximity is 1.0, and the further apart the values get, the proximity reduces towards zero at the point where they differ by one. Which value is greater does not matter. The value is then multiplied by the weight attached to that Placement Key on the VM to be placed (the weight may be negative). These values are then added to give an affinity score. The higher the score, the more likely it is that the node will be used. The lower the score, the less likely.

Up to 10 rounds are performed; this can be controlled by the local.cfg variable MSAV_STEPS (for information about how to edit the configuration files, see Configuration Customisations). On each round, the affinity scores are compared with a threshold. That threshold starts at 80 (controlled by the variable INITIAL_MSAV) and is reduced evenly across each step until it reaches -10 (controlled by the variable FINAL_MSAV). At the end of each round, if one or more candidates nodes are found where the score exceeds this threshold, the algorithm proceeds to the next stage. Otherwise, the threshold is reduced and it tries again. If no nodes are found when the threshold is FINAL_MSAV, then attempts to allocate a node has failed.

At this point the algorithm has found a set of one or more nodes which from the licensee’s point of view have roughly the same attractiveness. The virtue of the stepping system above is that this allows two nodes which from the system administrators point of view are roughly similar to be considered identical so that an allocation decision between them is based on the customer keys.

Step 5: Customer Key Processing 

Finally, analysis of the customer keys are performed. This uses a single scoring round calculated as above for system keys, but this time comparing the compiled customer resource placement keys and the customer node placement Keys. Note that the compiled customer node placement keys contain the compiled customer placement keys of each VM running on the node. The node with the highest score is taken, and in the event of a tie a node is chosen at random.

Step 6: Sticky Key Processing 

When a server is started, any Sticky Placement Keys set on the node are copied to the VM.

 

  • No labels