Page tree
Skip to end of metadata
Go to start of metadata

Flexiant Quorum Protocol (FQP) allows more accurate determination of the state of connections between nodes or cluster control servers and the rest of the cluster. This helps to ensure that live recovery is performed for a virtual machine when appropriate.

FQP is not used in VMware clusters.

For more information about FQP, see the sections below:

Key concepts in FQP

The following concepts are key to understanding how FQP works:

  • peer is a participant in FQP. This is either a compute node or a cluster controller.

    In order for FQP to work, peers must:

    • Run ntp.
    • Be 3 or more in number.
    • Have inbound and outbound TCP/IP connectivity between them.
  • Peers have a state, which determines whether or not live recovery of VMs should be initiated for a particular node. For a list of states and how they are determined, see States.

  • The generation is a monotonic (always increasing) counter which the FQP peers themselves never increment. 

    • An external FQP speaker may provide a new generation counter in order to supply new parameters or a new peer list, in which case peer entries relating to older generations will be discarded.
    • If an FQP command is received without a generation counter, it is treated as referring to the current generation
    • If an FQP peer is mentioned without a generation counter in the peer list, it is treated as referring to the generation of the packet
  • The FQP process has a size, which is the number of peers it contains.

  • The order of the FQP process is log2(size), rounded up to the nearest whole number. This is used to calculate the interval between a node losing contact with the rest of the cluster and the initiation of VM shutdown and live recovery.

FQP provides three abilities:

  • quorum feature, where a peer can determine whether or not it is part of a quorum, i.e. a group of more than half the peers which are in contact (directly or indirectly). 

    The need for a quorum means that in order for FQP to work, a minimum of 3 peers are required. If fewer than 3 peers exist, the FQP algorithm is disabled.

  • An availability feature, which determines whether another peer is available, based on the last time it has been heard of by the quorum.
  • state distribution feature which obviates the need for each node to talk to each other node, but will work over L3 routing protocols without multicast and can be shown statistically to work in instances of severe network disruption and node outage. This means that communication between peers and the rest of the cluster grows logarithmically with the number of peers, rather than exponentially.

Essentially, a node shuts down its VMs if it is out of the quorum, meaning it has not heard from a majority of the peers for a given time. That node then becomes unavailable as far as the cluster controller is concerned. At some later time it is ready for live recovery. 

States

FQP gives peers one of four states to determine whether or not it is safe to initiate live recovery of VMs on a particular node. These are as follows:

  1. U - Unknown: we cannot tell the state of this peer / process according to available information.
  2. R - Running: the peer / process is running.
  3. S - Shutdown: the peer / process should have been instructed to shutdown its servers.
  4. L - Live recover: it is safe to live recover VMs running on the the peer / someone else may have assumed it is safe to live recover VMs running on the node on which this process runs.

The states for the peers are calculated as follows:

  1. If the peer has not yet been heard from, or FQP is disabled, the peer state is unknown (U); else
  2. If the shutdown time plus the number of intervals for live recovery have passed, the peer state is live recovery (L); else
  3. If the number of intervals for shutdown has passed, the peer state is shutdown (S); else
  4. The peer state is running (R).

The FQP process state is calculated as follows:

  1. First a quorum is calculated, which is the number of peers +1 divided by two rounded up, i.e. the lowest integer being strictly more than half the number of peers.
  2. If the number of peers is smaller than 3, then the state is unknown (U); else
  3. If the number of peers not in L state is less than the quorum, the process state becomes L; else
  4. If the number of peers not in L or S state is less than the quorum, the process state becomes S; else
  5. If the number of peers not in L,S or U state is less than the quorum, the process state becomes U; else
  6. The process state is R.

Hence the process state is R if the peer is 'connected' to the cluster. If the process state is S or L, running VMs should be shutdown. The cluster controller should initiate live recovery on a node if the cluster controller's process state is R but the node's peer state is L.

Reading the state of an FQP cluster

To read the state of an FQP cluster:

  1. SSH to the management server.
     
  2. Type the following command:

    telnet 127.0.0.1 754
  3. Once you are connected to 127.0.0.1, type fqp. This will return an XML document similar to the following: 

    In the example document below, asterisks have been used to mark important lines.

    <?xml version="1.0"?>
    <fqp>
      <parameters>
        <generation>8</generation>
        <enabled>1</enabled>
        <heartbeatinterval>10</heartbeatinterval>
        <numpeerstopoll>2</numpeerstopoll>
        <numintervalsforshutdown>5</numintervalsforshutdown>
        <numintervalsforlr>10</numintervalsforlr>
        <shutdowntime>30</shutdowntime>
        <maxtimedelta>8</maxtimedelta>
      </parameters>
      <fqpperf>
        <mydest>10.157.128.1</mydest>
        <state U="0" R="3" S="0" L="0">R</state> *
        <order>2</order>
        <inboundtimedelta>0.003</inboundtimedelta>
        <outboundtimedelta>0.003</outboundtimedelta>
        <inboundrate>0.186</inboundrate>
        <outboundrate>0.205</outboundrate>
        <inboundhealth>93</inboundhealth>
        <outboundhealth>102</outboundhealth>
      </fqpperf>
      <time>1416565558192730</time>
      <peer d="10.157.128.1" t="1416565558192730" g="8" s="R"/> *
      <peer d="10.157.128.51" t="1416565554463633" g="8" s="R" x="1416565561661803"/> *
      <peer d="10.157.128.52" t="1416565556855745" g="8" s="R" x="1416565566631206"/> *
    </fqp>
  4. The cluster state is here:

    <state U="0" R="3" S="0" L="0">R</state>

    This line says that there are 0 speakers in U state, 3 in R state, 0 in S state and 0 in L state. (U="0" R="3" S="0" L="0"). This means that the cluster itself is in R state (>R</state>).

    The main thing to check for here is that the cluster itself is in R state, which it should always be (after 30 seconds or so) if you have 2 or more nodes running.

     

  5. Each speaker (CCM and each node) has a line like one of these: 

    <peer d="10.157.128.1" t="1416565558192730" g="8" s="R"/>
    <peer d="10.157.128.51" t="1416565554463633" g="8" s="R" x="1416565561661803"/>
    <peer d="10.157.128.52" t="1416565556855745" g="8" s="R" x="1416565566631206"/>

    The d= bit is the IP address of the peer. The s= bit is the state. The x= bit is only there if this is a peer that the node in question is talking to directly.

 

  • No labels