A common cause of frustration is inability to get compute nodes up and running. Before you read this tutorial, we suggest you read Network interfaces for nodes fully.
This section applies to router nodes on all hypervisors, and to compute nodes on KVM and Xen4 only. It does not apply to Virtuozzo (formerly PCS), VMware or Hyper-V compute nodes.
Assumptions
Interface | MAC address | Function |
---|---|---|
eth0 | 52:54:00:12:34:56 | Node network and storage network |
eth1 | 52:54:00:12:34:57 | PVIP network (upstream IP) |
eth2 | 52:54:00:12:34:58 | Unused |
eth3 | 52:54:00:12:34:59 | Unused |
We will try to make it boot using the following IP addresses:
IP Address | Ethernet Interface | Function |
---|---|---|
10.157.128.30/20 | eth0 | Node network |
10.157.208.30/20 | eth0 | Storage network |
10.157.192.30/20 | eth1 | Storage network |
You will note that eth0
has two IP addresses on it. In the default nodetemplate.xml
file, two functions share the same ethernet interface. This is perfectly acceptable.
We recommend you start numbering your nodes with the last octet of the IP address as 20 or more. We are using 30 here.
The IP address on the node network is the address the node uses to boot. Consequently, the interface attached to the node network must be configured to network boot in the node's BIOS.
Note the relationship between the interface name and the function. This is set out in nodetemplate.xml
. You can change this (see Nodeconfig), but we recommend you try to get your first compute nodes up and running with the standard configuration.
Basic diagnostics
Checking the node obtains an IP address on boot
The first step in the node boot process is that node will obtain an IP address from the node dhcp server. In order for this to work correctly, the following conditions must be satisfied:
- The BIOS of the node is configured to boot over the network (variously called 'PXEboot', 'tftp boot' or 'network boot'), using the correct network interface (that should always be interface with the node function, so
eth0
in our table above). - The network interface concerned is connected to a switch, which in turn is connected to the correct network interface of the system running the cluster controller (in a single machine install, that is the node network interface, normally
eth1
) - The MAC address of the node has been correctly entered when the node was created, and the node has been given an IP address (here
10.157.128.30
).
The following screenshot shows a node attempting to boot but failing (we pulled the ethernet cable out so we could capture the screen). The exact screen layout will depend on your BIOS.
Note the MAC address circled in red. Compare this with your list of ethernet addresses indicated in the table above, and you will note that it is (correctly) using interface eth0
to try to boot.
Common problems
If the node does not boot, this can be a result of:
- Configuring the machine's BIOS to boot from the wrong interface (be aware that BIOS numbering may not match the numbering of interfaces once the node is booted).
- Transcribing the MAC address of the boot interface incorrectly.
- Failing to add the node to Flexiant Cloud Orchestrator.
- Connecting the wrong network or no network at all to the ethernet interface concerned.
In the above screen shot, we showed what happened if the network cable was not connected at all. Here are some other failure modes:
Wrong network interface configured to boot
Here you can see that the node is trying to boot using MAC address 52:54:00:12:34:57
not MAC address 52:54:00:12:34:56
as desired in the configuration chart. This might be because:
- The MAC address of interface
eth0
is in fact52:54:00:12:34:57
(in which case the remedy is to use the right MAC address), or - The wrong interface is configured for network boot in the BIOS (in which case the remedy is to reconfigure your BIOS).
It is important to identify which of these is the problem. If the problem is that the wrong interface is configured for network boot in the BIOS, but you take the approach of changing the MAC address in use and swapping the network cables, you will be booting up using your eth1
interface and not your eth0
interface. This will cause problems later in the boot process if you leave nodetemplate.xml
stating that eth0
has the node network, as the node will not correctly configure. You can identify this failure by checking that the network interfaces as listed on the node monitor (see below) have the MAC addresses that you think they should have. If the MAC address of eth0
(or other interface if you have changed nodetemplate.xml
) does not match the MAC address you used to boot, that is an indication that you are booting using the wrong interface.
Checking the Cluster Manager's configuration
Once you have checked that Flexiant Cloud Orchestrator has the correct settings for the node, you can test whether the cluster manager has picked up the new node by logging into it at the command prompt and using the node-tool
utility:
root@test:~# node-tool 10.157.128.30 52:54:00:12:34:56
It returns a matching MAC address, indicating that Flexiant Cloud Orchestrator thinks the node has been configured correctly.
Checking Connectivity
It is difficult to check connectivity to any machine until it has an IP address. One way is to inspect the CAM table (forwarding table) on your switch. Another is to see if the cluster management system is receiving DHCP packets and if so what it is sending back.
The following shows a successful request for a lease by a client.
root@test:~# fgrep dhcpd /var/log/syslog Oct 21 13:01:27 test dhcpd: DHCPDISCOVER from 52:54:00:12:34:56 via eth1 Oct 21 13:01:27 test dhcpd: DHCPOFFER on 10.157.128.30 to 52:54:00:12:34:56 via eth1 Oct 21 13:01:29 test dhcpd: DHCPREQUEST for 10.157.128.30 (10.157.128.1) from 52:54:00:12:34:56 via eth1 Oct 21 13:01:29 test dhcpd: DHCPACK on 10.157.128.30 to 52:54:00:12:34:56 via eth1
The following shows an unsuccessful request for a lease (due to the MAC address of the interface requesting the lease not matching the node MAC address specified when the node was added:
root@test:~# fgrep dhcpd /var/log/syslog Oct 21 13:35:36 test dhcpd: DHCPDISCOVER from 52:54:00:12:34:57 via eth1: network 0.0.0.0/0: no free leases
'no free leases'
is dhcpd's somewhat cryptic way of telling you that no IP address has been associated with that MAC address.
If there is no connectivity between the node and the cluster controller, you will not see any lines in the log referencing the node's MAC address.
The early boot process
If an IP address is obtained, then very quickly (probably too quickly for you to see) it will load via tftp
a boot loader called gpxelinux.0
which should recognise the IP address, and start loading the rest of the system over http.
You should be able to follow the remainder of the early boot process in the log file of the cluster manager by looking at the file /var/log/apache2/nodeconfig-access.log
. Here's what that file looked like when the boot reached the splash screen below:
root@test:~# tail -13 /var/log/apache2/nodeconfig-access.log 10.157.128.30 - - [21/Oct/2012:12:19:19 +0100] "GET /boot/KVM/pxelinux.cfg/00000000-0000-0000-0000-000000000000 HTTP/1.0" 404 443 10.157.128.30 - - [21/Oct/2012:12:19:19 +0100] "GET /boot/KVM/pxelinux.cfg/01-52-54-00-12-34-56 HTTP/1.0" 404 427 10.157.128.30 - - [21/Oct/2012:12:19:19 +0100] "GET /boot/KVM/pxelinux.cfg/0A9D801E HTTP/1.0" 404 415 10.157.128.30 - - [21/Oct/2012:12:19:19 +0100] "GET /boot/KVM/pxelinux.cfg/0A9D801 HTTP/1.0" 404 414 10.157.128.30 - - [21/Oct/2012:12:19:19 +0100] "GET /boot/KVM/pxelinux.cfg/0A9D80 HTTP/1.0" 404 413 10.157.128.30 - - [21/Oct/2012:12:19:19 +0100] "GET /boot/KVM/pxelinux.cfg/0A9D8 HTTP/1.0" 404 412 10.157.128.30 - - [21/Oct/2012:12:19:19 +0100] "GET /boot/KVM/pxelinux.cfg/0A9D HTTP/1.0" 404 411 10.157.128.30 - - [21/Oct/2012:12:19:20 +0100] "GET /boot/KVM/pxelinux.cfg/0A9 HTTP/1.0" 404 410 10.157.128.30 - - [21/Oct/2012:12:19:20 +0100] "GET /boot/KVM/pxelinux.cfg/0A HTTP/1.0" 404 409 10.157.128.30 - - [21/Oct/2012:12:19:20 +0100] "GET /boot/KVM/pxelinux.cfg/0 HTTP/1.0" 404 408 10.157.128.30 - - [21/Oct/2012:12:19:21 +0100] "GET /boot/KVM/pxelinux.cfg/default HTTP/1.0" 200 439 10.157.128.30 - - [21/Oct/2012:12:19:21 +0100] "GET /boot/KVM/boot.msg HTTP/1.0" 200 480 10.157.128.30 - - [21/Oct/2012:12:19:22 +0100] "GET /boot/KVM/splash.lss16 HTTP/1.0" 200 6841
Splash screen and operating system load
Soon after this point, you should see a splash screen load like this:
The boot:
prompt allows our engineers to specify various debugging options. Wait ten seconds (or press 'enter' if you are impatient) and you should see the node image start to load:
Loading the node image may take a couple of minutes. You will soon be presented with the node monitor (see the next section).
It is unlikely that anything will go wrong during this stage. If it does, it is likely to indicate a hardware incompatibility between the node you are using and our software.
Understanding the node monitor
After a minute or two, the node monitor should display. You can find more information on the node monitor here.
Initially it will note the fact that configuration has not been completed:
And after ten seconds or so, it should show successful node networking configuration:
There are several things to note in the above display:
- First, note that the two ethernet interfaces which are used (
eth0
andeth1
) are marked asUP
in the display, and the names match the MAC addresses associated with them (under "link/ether
" above). - All the correct IP addresses are applied to the ethernet interfaces as per the table above, except for the PVIP function.
- The interface PVIP function (
eth1
here) deliberately does not generate an IP address on the native ethernet interface. However, it is marked as a member ofpvip-bridge
, which indicates this interface is correctly configured too. - At the bottom of the display the text '
ERROR: remote default route
' indicates that the router built into the node has not yet obtained a default route from its upstream router; this is normal, and may take a couple of minutes to resolve.
At this point the nodeconfig-access.log
file looks like this:
root@test:~# tail -24 /var/log/apache2/nodeconfig-access.log 10.157.128.30 - - [21/Oct/2012:13:11:25 +0100] "GET /boot/KVM/pxelinux.cfg/00000000-0000-0000-0000-000000000000 HTTP/1.0" 404 443 10.157.128.30 - - [21/Oct/2012:13:11:25 +0100] "GET /boot/KVM/pxelinux.cfg/01-52-54-00-12-34-56 HTTP/1.0" 404 427 10.157.128.30 - - [21/Oct/2012:13:11:25 +0100] "GET /boot/KVM/pxelinux.cfg/0A9D801E HTTP/1.0" 404 415 10.157.128.30 - - [21/Oct/2012:13:11:25 +0100] "GET /boot/KVM/pxelinux.cfg/0A9D801 HTTP/1.0" 404 414 10.157.128.30 - - [21/Oct/2012:13:11:25 +0100] "GET /boot/KVM/pxelinux.cfg/0A9D80 HTTP/1.0" 404 413 10.157.128.30 - - [21/Oct/2012:13:11:25 +0100] "GET /boot/KVM/pxelinux.cfg/0A9D8 HTTP/1.0" 404 412 10.157.128.30 - - [21/Oct/2012:13:11:25 +0100] "GET /boot/KVM/pxelinux.cfg/0A9D HTTP/1.0" 404 411 10.157.128.30 - - [21/Oct/2012:13:11:25 +0100] "GET /boot/KVM/pxelinux.cfg/0A9 HTTP/1.0" 404 410 10.157.128.30 - - [21/Oct/2012:13:11:25 +0100] "GET /boot/KVM/pxelinux.cfg/0A HTTP/1.0" 404 409 10.157.128.30 - - [21/Oct/2012:13:11:25 +0100] "GET /boot/KVM/pxelinux.cfg/0 HTTP/1.0" 404 408 10.157.128.30 - - [21/Oct/2012:13:11:25 +0100] "GET /boot/KVM/pxelinux.cfg/default HTTP/1.0" 200 439 10.157.128.30 - - [21/Oct/2012:13:11:26 +0100] "GET /boot/KVM/boot.msg HTTP/1.0" 200 480 10.157.128.30 - - [21/Oct/2012:13:11:26 +0100] "GET /boot/KVM/splash.lss16 HTTP/1.0" 200 6841 10.157.128.30 - - [21/Oct/2012:13:11:10 +0100] "GET /boot/KVM/images/xvp/vmlinuz-3.2.0-32-generic HTTP/1.0" 200 133950 10.157.128.30 - - [21/Oct/2012:13:11:43 +0100] "GET /boot/KVM/images/xvp/vmlinuz-3.2.0-32-generic HTTP/1.0" 200 4967015 10.157.128.30 - - [21/Oct/2012:13:11:58 +0100] "GET /boot/KVM/images/xvp/extility-node-2.1-19.img HTTP/1.0" 200 169487716 10.157.128.30 - - [21/Oct/2012:13:14:36 +0100] "GET /xvp/configs/525400123456 HTTP/1.1" 200 3899 10.157.128.30 - - [21/Oct/2012:13:14:44 +0100] "GET /xvp/configs/52:54:00:12:34:56?action=payload HTTP/1.1" 200 840132 10.157.128.30 - - [21/Oct/2012:13:14:46 +0100] "GET /xvp/configs/525400123456?action=evr.conf HTTP/1.1" 200 1310 10.157.128.30 - - [21/Oct/2012:13:14:46 +0100] "GET /xvp/configs/525400123456?action=bird.conf HTTP/1.1" 200 1787 10.157.128.30 - - [21/Oct/2012:13:14:46 +0100] "GET /xvp/configs/525400123456?action=bird6.conf HTTP/1.1" 200 1833 10.157.128.30 - - [21/Oct/2012:13:14:47 +0100] "GET /xvp/configs/52:54:00:12:34:56?action=bird.conf HTTP/1.1" 200 1787 10.157.128.30 - - [21/Oct/2012:13:14:48 +0100] "GET /xvp/configs/52:54:00:12:34:56?action=bird6.conf HTTP/1.1" 200 1833 10.157.128.30 - - [21/Oct/2012:13:14:47 +0100] "GET /xvp/configs/525400123456 HTTP/1.1" 200 3917
The following events are significant:
- At
[21/Oct/2012:13:11:26 +0100]
it loaded its boot message and splash screen. - At
[21/Oct/2012:13:11:58 +0100]
it loaded the node image - At
[21/Oct/2012:13:14:36 +0100]
it loaded the payload (see Node Payload System) - After that, it loaded various configuration files.
Troubleshooting
The most common cause of problems at this stage is that FCO's interface naming does not match the user's expectations. See Network interfaces for nodes for more details.
This problem typically exhibits itself where a node hash multiple NIC cards, for instance a 2 port onboard NIC, and a 4 port PCI expansion card. The BIOS might consider the 2 port onboard cards interfaces 0 and 1, and the 4 port expansion card interfaces 2, 3, 4 and 5. However, when the machine is booted, FCO considers the 4 port expansion card eth0
, eth1
, eth2
and eth3
, and the two port onboard NIC eth4
and eth5
. Typically the issue here is that the system administrator has corrected the wrong problem. The story typically goes like this:
- The administrator connects the node network (correctly) to the first port of the 4 port NIC, thinking this will be
eth0
. - He goes to the BIOS and enables network booting from the first ethernet interface, and writes down the MAC address. This is in fact the MAC of the first onboard NIC.
- He uses that address when creating the node in Flexiant Cloud Orchestrator.
- He notes the boot process fails, and rather than correcting which ethernet interface he is using to boot and the MAC address used, he moves the ethernet cable to the onboard NIC.
- The machine begins to boot, but the node monitor shows the network does not fully configure itself
- It shows that interface
eth0
has the MAC address of the first 4 port NIC card, which is different to the MAC address specified when the node was created in Flexiant Cloud Orchestrator.
The key to detecting problems here is to check that the interface with the node function as set out in nodetemplate.xml
(which will be eth0
unless you have changed it) has the MAC address that you used to configure the node. If this is not the case, then you are booting from an interface which is not one destined to be the node interface. If the boot process succeeded thus far, this means one of the following is true:
- more than one of your interfaces can reach the node control interface (probably bad as it indicates your LANs are not segregated);
- (more likely) that you are booting from the wrong interface and have the node network interface connected to that wrong interface; or
- (if you really want to boot from the interface you are currently booting from) you need to adjust
nodetemplate.xml
, as your boot interface is not what FCO callseth0
; or - you need to disable your onboard NIC card so interface naming is consistent.
Checking you have access to the node
At this point you should be able to gain direct access to the node using the following command from the command prompt of the cluster manager
root@test:~# sshe 10.157.128.30 root@node-10-157-128-30:~#
You will then be logged into the node. If for any reason you cannot gain access this way, you should be able to gain access using the console. Press CTRL-ALT-F1
to get to the login screen, and enter root
as the user name. You can find the password as follows:
root@test:~# fgrep NODE_CONSOLE_PASSWORD /etc/extility/config/vars export NODE_CONSOLE_PASSWORD="ZUurakUkZFeJsPK3"
Connectivity to the upstream router
If connectivity to the upstream router is correctly established, the node monitor should display a screen like this:
Note the text that says "OK: remote default route
", indicating everything is OK.
If you see the text 'ERROR: remote default route
', this indicates that the router built into the node has not yet obtained a default route from its upstream router; this is normal, and may take a couple of minutes to resolve, particularly if you are using the OSPF routing protocol.
If you are using the STATIC routing protocol (the default) then Flexiant Cloud Orchestrator cannot reliably detect absence of upstream connectivity to the router. In other words, Flexiant Cloud Orchestrator may report that connectivity is present when it is not. If you have having problems with connectivity to VMs, follow carefully the sections below marked "Check the node's router is correctly configured" and "Checking the Connectivity to the Upstream Router" even if no error is reported.
To check network health when you are connected to the node using sshe
, use the following command:
root@node-10-157-128-30:~# cat /var/run/extility/evr/nhc-remote OK: remote default route
In the event that you have an error, follow the procedure below:
Check the node's router is correctly configured
First log into the node:
root@test:~# sshe 10.157.128.30 root@node-10-157-128-30:~#
Next use the evrs
command to log into the virtual router within the node:
root@node-10-157-128-30:~# evrs
You should see a display like this:
Note the text starting evr
at the bottom. This indicates that you are connected to the virtual router. You can type exit
or press CTRL-D
to exit the virtual router shell at any time.
Next check that the virtual router's upstream IP address is correct by using the command:
root@node-10-157-128-30:/# ip addr show evrr-000000
You should see an output like this:
Note the IP address is correct as per the table set out at the start of this article. You can also check the IPv6 address is correct.
Checking connectivity to the upstream router
Your upstream router which provides connectivity to the VMs on the node to the internet should be configured as the .1
IP address of the PVIP
network (in this case 10.157.192.0/20
, so the router would be 10.157.192.1
). A secondary router can be configured at the .2
IP address.
You can use different IP addressing schemes by editing the router configuration files - see Managing Routing Protocols
To provide connectivity, you will require IP connectivity to your router, and to establish a routing protocol between them.
To check you have connectivity to your upstream router, in the virtual router window, ping it, by typing the following (changing the network if appropriate):
root@node-10-157-128-30:/# ping 10.157.192.1
If you do not see reply packets, it is likely you do not have connectivity between the node and your router. Remedy this by examining your switch and router configuration.
The above ping command will only work within the evrs
shell, connected the virtual router. You will not see connectivity to the internet from the node outside the evrs
shell. This is a deliberate security precaution.
Checking you have a default route
The virtual router learns its default route(s) from your upstream router. Checking that you have a default route is thus a good way of establishing that a routing protocol has been established. From the evrs
connection to the virtual router, type:
root@node-10-157-128-30:/# ip route show match 0/0
The above command will only work within the evrs
shell, connected the virtual router. You will not see connectivity to the internet from the node outside the evrs
shell. This is a deliberate security precaution.
You should see output like this:
root@node-10-157-128-30:/# ip route show match 0/0 default via 10.157.128.1 dev evrr-000000 proto bird default dev lo scope link metric 15
Note the text 'default via 10.161.192.1 dev evrr-000000
', which indicates a default route via the virtual router's upstream interface to your router at 10.157.128.1
.
If you see output like this:
root@node-10-157-128-30:/# ip route show match 0/0 default dev lo scope link metric 15
that indicates that there is no default route and that your routing protocol is not established or is not receiving a default route, possibly because your upstreadm router is not originating it.
Troubleshooting routing protocols
Flexiant Cloud Orchestrator supports both OSPF and BGP as routing protocols. See Managing Routing Protocols for details. Please see the appropriate section below.
Troubleshooting OSPF
Within the virtual router shell, type:
root@node-10-157-128-30:/# birdc
This will connect you to bird
, the routing daemon used by the virtual router.
The above command will only work within the evrs
shell, connected the virtual router. You will not see connectivity to the internet from the node outside the evrs
shell. This is a deliberate security precaution.
The OSPF routing process is named evrospf
. To troubleshoot it, at the bird prompt, type:
bird> show protocols all evrospf
You should see output like this:
Note the one imported route (the default).
If you do not see a routing process called evrospf
at all, check you have OSPF configured in /etc/extility/local.cfg
. To check the built config, run the following command on the cluster controller, and note the top line.
root@test:~# fgrep NODE_ROUTINGPROTOCOL /etc/extility/config/vars export NODE_ROUTINGPROTOCOL="OSPF" export NODE_ROUTINGPROTOCOL_AUTHCLAUSE="authentication none;" export NODE_ROUTINGPROTOCOL_AUTHCLAUSE_0="authentication none;" export NODE_ROUTINGPROTOCOL_AUTHCLAUSE_1="authentication cryptographic;" export NODE_ROUTINGPROTOCOL_PASSWORD="6gKabyzGwrug2wgm" export NODE_ROUTINGPROTOCOL_SECURITY="0" export NODE_ROUTINGPROTOCOL_SECURITYCLAUSE="" export NODE_ROUTINGPROTOCOL_SECURITYCLAUSE_0="" export NODE_ROUTINGPROTOCOL_SECURITYCLAUSE_1="password "6gKabyzGwrug2wgm";"
If you do see a process called evrospf
but you do not see a received route, your OSPF session has not established. Follow the procedure below:
- Check OSPF is enabled on your upstream router
- Check OSPF is enabled on the interface concerned of your upstream router
- Check OSPF is advertising a default route on your upstream router (by default origination or otherwise)
- Check OSPF authentication is enabled on your upstream router if
NODE_ROUTINGPROTOCOL_SECURITY
is set to 1 above, and check the password is correct.
Troubleshooting BGP
Within the virtual router shell, type:
root@node-10-157-128-30:/# birdc
This will connect you to bird
, the routing daemon used by the virtual router.
The above command will only work within the evrs
shell, connected the virtual router. You will not see connectivity to the internet from the node outside the evrs
shell. This is a deliberate security precaution.
There are two BGP processes named evrbgp1
and evrbgp2
, to connect to up to two configured BGP routers, normally on the .1
and .2
IP addresses. To troubleshoot the BGP processes, at the bird prompt, type:
bird> show protocols all evrbgp1
You should see output like this:
Note that the session is up and established, and there is one imported route (the default route).
If you do not see a routing process called evrbgp1
at all, check you have BGP configured in /etc/extility/local.cfg
. To check the built config, run the following command on the cluster controller, and note the top line.
root@test:~# fgrep NODE_ROUTINGPROTOCOL /etc/extility/config/vars export NODE_ROUTINGPROTOCOL="BGP" export NODE_ROUTINGPROTOCOL_AUTHCLAUSE="authentication none;" export NODE_ROUTINGPROTOCOL_AUTHCLAUSE_0="authentication none;" export NODE_ROUTINGPROTOCOL_AUTHCLAUSE_1="authentication cryptographic;" export NODE_ROUTINGPROTOCOL_PASSWORD="6gKabyzGwrug2wgm" export NODE_ROUTINGPROTOCOL_SECURITY="0" export NODE_ROUTINGPROTOCOL_SECURITYCLAUSE="" export NODE_ROUTINGPROTOCOL_SECURITYCLAUSE_0="" export NODE_ROUTINGPROTOCOL_SECURITYCLAUSE_1="password "6gKabyzGwrug2wgm";"
If you do see a process called evrbgp1
but you do not see an up/established connection, then your upstream router router configuration is probably incorrect:
- Check BGP is enabled on your upstream router
- Check the AS number configured on your upstream router is as per the bird configuration (65000 by default)
- Check that an iBGP neighbour is configured with the IP address of the PVIP interface of the node (this is the ip address of
evr-000000
as set out above). - Check that the neighbour connection is not shut down.
If you see an up/established connection, but do not see an imported route, follow the procedure below:
- Check your upstream router either carries a default route or is set to originate one
- Check your upstream router does not filter out an outbound default route
- If your default route on your upstream router is received via iBGP from another BGP speaker (this configuration is not recommended unless you are a routing expert), then ensure the neighbour is configured as a route reflector client.