Workaround for instances with networking bug on Ubuntu 18.04

We recently discovered a bug on instances launched with the Ubuntu 18.04 image. This bug is caused by the new networking configuration abstraction renderer called Netplan that is used by default in Bionic.

TL;DR

The bug will cause either intermittent SSH access to VMs after reboot or will completely prevent all network access post reboot. ~~The only workaround we have so far for existing VMs that have already been launched is to disable DHCP for the private only network interface. This will unfortunately disable true private networking on the instance altogether. The medium term solution is to remove Netplan for new 18.04 images. If you have not already rebooted, we strongly urge you to follow this guide carefully before doing so~~.

Update: there is a newer guide on how to resolve this issue by removing Netplan altogether: Remove Netplan from Ubuntu Bionic Beaver (18.04)

If you have setup your instances with an SSH key instead of a password, you will need to temporarily create a username and password so that you can access the VM via the console should the network be configured incorrectly.

If you have rebooted your instance and are no longer able to connect, please get in contact with us and we will recover the instance and / or the networking to allow you to gain access to your machine(s) and apply the fixes detailed below.

If you would prefer us to apply the fixes to your instance, we are more than happy to, but we will need access to your VMs temporarily in order to do so.

Detailed Description

What happens on other Civo cloud images is that a single default gateway is set to be the interface on the 172.31.0.0/16 network which is used specifically for NAT via 172.31.255.254 (see here for more detail). On the 18.04 Netplan images however, because instances have two virtual NICs, the default configuration created by Netplan and systemd-networkd creates two default gateways. When bringing up the private interface from the 10.X.X.X range, it assigns a second default gateway from DHCP which we don't want. An example can be seen below:

ip r

Which will output something similar to:

default via 172.31.255.254 dev ens4 proto dhcp metric 100
default via 10.0.0.1 dev ens3 proto dhcp metric 100        # <- This one we don't want
...

You will also notice that both routes have a metric of 100, meaning they have equal priority (lower metric numbers equal higher priority), priority is therefore determined by list order. Setting a lower metric in Netplan configuration does not seem to take any effect for static routes. Furthermore, setting UseRoutes=false or a lower RouteMetric directly in systemd-networkd config gets overwritten at boot, which is very helpful.

What this means is when you boot the instance, there is a race condition as to which default route comes up first and takes priority. The effect of this is that if the 172.31.0.0/16 is first in the list, networking works as expected. If the 10.X.X.X/24 comes up first, networking does not work. This is particularly painful as it often works the first time and then never works again after the first reboot.

To see the configuration for Netplan, you can view the following two files, which will look similar to these:

/etc/netplan/50-cloud-init.yaml

network:
  version: 2
  ethernets:
    ens3:
      dhcp4: true
      match:
        macaddress: 00:00:00:00:00:00
      set-name: ens3

/etc/netplan/60-cloud-init.yaml

network:
  version: 2
  ethernets:
    ens4:
      dhcp4: true
      match:
        name: ens4
      set-name: ens4
      routes:
        - to: 0.0.0.0/0
          via: 172.31.255.254
          metric: 50

Workaround

Update: while the below steps will work, there is a newer guide on how to resolve this issue by removing Netplan altogether: Remove Netplan from Ubuntu Bionic Beaver (18.04)

IMPORTANT: IF YOU HAVE CONFIGURED THE VM TO BE ACCESSIBLE VIA SSH KEY ONLY (NO PASSWORD ACCESS), IF YOU MAKE A MISTAKE IN THE CONFIGURATION THAT CAUSES AN ISSUE WITH THE NETWORKING, WE WILL BE UNABLE TO ACCESS THE MACHINE... PERMANENTLY. WHILST SETTING PASSWORDS FOR SSH IS OFTEN INADVISABLE, IN THIS CASE IT IS HIGHLY RECOMMENDED IN ORDER TO ENSURE ACCESS TO THE VM VIA THE CONSOLE, SHOULD A MISCONFIGURATION OCCUR. PLEASE ENSURE YOU FOLLOW THE STEPS CAREFULLY AND CHECK / TEST WHERE INDICATED TO MAKE SURE YOU DO NOT LOSE YOUR DATA.

If you would like us to do this for you, please get in touch and we will run through these steps. We will however need access to the machine in order to do so.

If at all possible, we would recommend one or more of the following before applying the workaround:

Backup your data.
Migrate data onto another Civo instance (not using 18.04).
Mount a volume to the instance following this guide and move your data onto it.
Snapshot the VM.

Step 1 - Recovering your machine

If you have not rebooted your machine and / or still have SSH access, you can skip this step and move onto Step 2.

If you have already rebooted and are no longer able to access the VM continue reading Step 1 (this step).

If you set up your machine with a password, you should be good to move onto Step 4. If you only have SSH key access to the machine, get in touch with us and we will try and update your NAT address to point to the other interface therefore allowing access. Once you have done this, you can move onto Step 3.

Step 2 - Find whether you created your machine using a key or a password

There are two paths you can take, depending on whether your instance was configured to use an SSH key at creation or a password. If you created your instance with an SSH key, move onto Step 3. If you created your instance with a password move onto Step 4.

Step 3 - Create a temporary user and password

YOU DO NOT HAVE TO PERFORM THIS STEP. HOWEVER, IF YOU MAKE A MISTAKE IN THE NETWORK CONFIGURATION THAT PREVENTS SSH ACCESS, YOU WILL NOT BE ABLE TO RECOVER THE MACHINE. EVER.

Log into your instance via SSH. Then create a new user

sudo adduser civorescue

When prompted, set a password for the user. Ensure you set a strong password:

Use a combination of letters (upper and lower case), numbers and symbols.
It should be at least 8 characters long (if not longer).
Avoid using a single dictionary word like "workaround", even if you replace letters with numbers or symbols such as "W0rk4r0und".
Ensure that you keep it somewhere safe. A password manager like 1Password, LastPass, KeePassX etc is ideal.
Ensure you use a password that you have not used elsewhere in case another site is compromised and malicious users gain access to your password.

Even if you do not wish to keep the password after setting it in this guide, you must still ALWAYS use strong passwords.

Enter new UNIX password: 
Retype new UNIX password:

For each of the following, hit return:

Enter the new value, or press ENTER for the default
    Full Name []: 
    Room Number []: 
    Work Phone []: 
    Home Phone []: 
    Other []:

Finally select Y when asked if the information is correct:

Is the information correct? [Y/n] Y

Add the user to the sudo group

sudo usermod -aG sudo civorescue

Now move onto Step 4.

Step 4 - Ensure you have console access

This step is important. Do not assume console access works. It is well worth the effort to confirm you have console access now, rather than be sorry later.

Navigate to the instance in the Civo UI and locate the console button (computer screen icon) from the action buttons on the top right of the instance. This will launch a spice terminal that will prompt you for a username and password.

If successful, also confirm that your user has sudo access by running:

sudo su

This should prompt you for your password and then switch to the root user. Exit the root user with:

exit

If this all worked, then we are good to move onto Step 5.

Step 5 - Find the address of the private network interface

Find the IP address of the ens3 interface:

ifconfig ens3

This should output something similar to:

ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.0.0.2  netmask 255.255.255.0  broadcast 10.0.0.255
        ether 00:00:00:00:00:00  txqueuelen 1000  (Ethernet)
        RX packets 5  bytes 1224 (1.2 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 14  bytes 1564 (1.5 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

The part you are interested in is inet 10.0.0.2, which is the address you need for the next step. In this case it is 10.0.0.2. Yours will likely be different but will be an address from the 10.0.0.0/8 range.

Move onto Step 6.

Step 6 - Remove the incorrect gateway and restore SSH access

Remove the incorrect gateway, replacing <GATEWAY_IP> with the address from the previous step:

sudo route del -net 0.0.0.0 gw <GATEWAY_IP> netmask 0.0.0.0 dev ens3

Using the address from our example, this should look something like this:

sudo route del -net 0.0.0.0 gw 10.0.0.2 netmask 0.0.0.0 dev ens3

If you are accessing the VM via the console, then this should have resumed SSH connectivity which will make things easier for you as you can now use normal terminal functionality such as copy and paste. Close the console and SSH into the VM if you haven't already. Then move onto Step 7.

Step 7 - Update network configuration

Log into your instance via SSH or the console if you aren't already logged in. Open the following file on your VM using your editor of choice: /etc/netplan/50-cloud-init.yaml.

Next, either copy and paste the following or update it in the file.

network:
  version: 2
  ethernets:
    ens3:
      dhcp4: false # <- set to false
      match:
        name: ens3 # <- remove MAC address key and value and replace with name: interface
      set-name: ens3

THIS IS WHERE THE NETWORK CONFIG CAN GO WRONG. PLEASE MAKE SURE YOU ARE VERY CAREFUL AND DOUBLE CHECK YOUR CHANGES.

You can check that the config is valid with:

netplan generate

This will not ensure that the config is correct, only that it is valid.

Once you are confident the changes are correct, save the file.

Now move onto Step 8.

Step 8 - Reboot the instance

You can now reboot the instance.

reboot

Now move onto Step 9.

Step 9 - Confirm configuration

Rebooting the instance should hopefully have worked and you should now have (or still have) access to the machine. You can confirm this by SSHing into the instance. Once in the instance, running

ip r

Should now only show one default route:

default via 172.31.255.254 dev ens4 proto dhcp metric 100
...

Move onto Step 10 only if you normally access the VM with an SSH key and you created a username and password in Step 3 that you wish to disable.

Step 10 - Remove user password access

IF YOU REMOVE THE ABILITY TO LOG IN WITH A PASSWORD AND DO NOT HAVE AN SSH KEY OR OTHERWISE RELY ON PASSWORD ACCESS VIA SSH, YOU WILL NOT BE ABLE TO ACCESS THE VM AND ANY ACCESS SSH, CONSOLE OR OTHERWISE WILL BE LOST AND UNRECOVERABLE AS WILL ANY DATA ON THE VM. DO NOT RUN THIS STEP ON ANY USER THAT YOU NEED TO LOG IN WITH USING A PASSWORD

From the VM run:

passwd civorescue --lock

This will prepend a ! to the encrypted password in /etc/shadow meaning it will result in a value that will match no possible encrypted value.

TL;DR

Detailed Description

Workaround

Step 1 - Recovering your machine

Step 2 - Find whether you created your machine using a key or a password

Step 3 - Create a temporary user and password

Step 4 - Ensure you have console access

Step 5 - Find the address of the private network interface

Step 6 - Remove the incorrect gateway and restore SSH access

Step 7 - Update network configuration

Step 8 - Reboot the instance

Step 9 - Confirm configuration

Step 10 - Remove user password access

Dan Weinberg

Further reading

These may also be of interest

Why create multiple networks?

Firewalling an Ubuntu 16.04 instance

Why are network policies in Kubernetes so hard to understand?