Upgrading a Cloud Infrastructure With Terraform and ANsible

Upgrading and expanding instances and volumes with Terraform and Ansible in a OpenStack Cloud environment

Marc Gallofré

Sep 3, 2020 6 min read Terraform, Ansible, Open Stack, Cloud Infrastructure, News Hunter

Upgrading a running cloud infrastructure is a critical task that have to planned carefully in advance. Before choosing which strategy to follow for upgrading, we have to asks ourselves some questions:

Is okay for us to have some downtime in our service? Can we stop our services and for how long?
Is there any data compromised which need to be saved and restored?
Do we really need to updrage our infrastructure?
Is there any security risk in doing so?
Do we have the necessary resources?

In this article we will deal with the scenario where we can have some downtime and there is data that need to be saved and restored.

In our scenario we start with 11 instances running in a OpenStack cloud provider, with the following configuration:

3 x m1.medium instances (1 VCPUs + 4GB RAM)
6 x m1.large instances (2 VCPUs + 8GB RAM)
2 x m1.xlarge instances (4 VCPUs + 16GB RAM)
6 x 80GB disk

We have a Docker Swarm platform running in all instances with services to run real-time processes, 3 Apache Cassandra nodes, 3 Blazegraph nodes, 3 Zookeeper nodes, 3 Kafka nodes, a Mongo DB and other services.

We want to upgrade our infrastructure to:

4 x m1.medium instances (1 VCPUs + 4GB RAM)
9 x m1.large instances (2 VCPUs + 8GB RAM)
4 x m1.xlarge instances (4 VCPUs + 16GB RAM)
3 x 3TB disk
1 x 11TB disk

The Apache Cassandra and Blazegraph data need to be migrated to the new disks and all services will have to start running again with the minimun downtime possible.

To do so, we will make use of Terraform, to modify the previous configuration with the need requirments and create volume disks snaptshots for migrating our data. Then, with Ansible we will mount the new volumes with the migrated data and install the Docker Swarm platform again.

Creating voume disks snaptshots

First, we need to backup our data. To do so, we can either use an external service such as Google Cloud Storage or use the OpenStack’s volume snaptshots. In any case, my advice is to stop the services running in the instances where you have the data to avoid any data corruption during the process.

In our cause, we are going to create volume snapshots with OpenStack and rememeber the snapshot ID, which will be needed leter with Terraform configuration.

Although this is the first step, if you are working in a real-time environment I recommend to do it at the end before you run Terraform, to minimaice the data loss during the downtime.

As our platform is using Swarm to orquestrate docker containers, we don’t have an option to stop a running container as it is with docker-compose or stand-alone containers. Thus, we will have to remove the running service to stop it:

docker service rm <service-name>

The removing process of a service may take sometime, thus it is important to verify that the manager have removed the service:

docker service inspect <service-name>

As well as, to verify that the service was removed from the node where it was running:

docker ps

If the service does not appear, then we can be sure that we successfuly removed the service and we can process to create a spanshot.

Once the service has been removed, we can proceed to detach the volume from the instance and create the snapshot. This process can be done manually from the OpenStack Dashboard or CLI. When detaching the volume from the instance, the status of the volum should appear as Available instead of In-use. The snapshot cam be created with attached volumes too, however doing a snapshot with a detached volume is safer and recomended for our purpose.

It is not possible to delete the old volumes, because they have an assigned snapshot. To delete them, first we have to delete the snapshots. This process, should be done at the end, once we have verified that our data migration succeded.

Upadating Terraform files

The Terraform files configuration should be updated to reflect our desired newer configuration.

The flavors type and the amount of instances for each type can be updated like this, where we map the type of instance to their characteristics and numbers.

flavor_name = {
	"manager" = "m1.medium",
	"cassandra" = "m1.large",
	"blazegraph" = "m1.xlarge",
	"worker" = "m1.large",
	"hpc-worker" = "m1.xlarge",
	"mongo" = "m1.medium",
	"zookafka" = "m1.large"
}

instance_count = {
	"manager" = 3,
	"cassandra" = 3,
	"blazegraph" = 1,
	"worker" = 3,
	"hpc-worker" = 3,
	"mongo" = 1,
	"zookafka" = 3
}

The image used in the instances can be updated to a newer version. It is important to check the image version since the images can be outdated or unavailable. To look for updated and available images:

openstack image list --status active

To update the terraform file with the desired image, use the Image ID instead of the image name to ensure that you always use the same image:

image_id = {
	"Ubuntu20.04LTS" = "7085d64d-f591-4a23-bdfe-dbbd1288afcf"
}

To create new instances, we have to define new resources. In that case, we are creating a new hpc-worker instance using the previous maped values. The count key is used to create multple instances according to our previous declared variable instance_count, and the same apply for the other variables mapings (var.). The count.index is used to extract the index of each instance [0,1,2 … n) and generate a name like hpc-worker-0 for the first instance.

resource "openstack_compute_instance_v2" "hpc-worker" {
    count = var.instance_count["hpc-worker"]
    name = "${var.node_name["hpc-worker"]}-${count.index}"
    image_id = var.image_id[var.image_name["image"]]
    flavor_name = var.flavor_name["hpc-worker"]
    key_pair = var.key_pub
    security_groups = var.security_group
    network {
        name = var.network
    }
    metadata = {
        ssh_user = var.role_ssh_user["hpc-worker"],
        prefer_ipv6 = false,
        my_server_role = var.node_name["hpc-worker"],
	python_bin = "/usr/bin/python3"
    }
}

To create a 3TB volume using the previous volume snapshot, so the data will be migrated to the new volume. Thus, we need to indicate the snapshot_id which we want to use. If the snpashot is smaller than the new volume, we will need to expand the volume later, otherwise our instance will not use the full volume capacity. To attach the volume to the new instance, we need to provide the instance_id where we want to attach the volume and the volum_id to attach.

resource "openstack_blockstorage_volume_v3" "volume_cassandra_1" {
    name = "${var.volume_name}-cassandra-1"
    size = 3000
    snapshot_id = "f2714817-f9f8-42f3-aa7c-363d7b887983"
}

resource "openstack_compute_volume_attach_v2" "attach_cassandra_volume_0_to_db_instances" {
    instance_id = openstack_compute_instance_v2.cassandra[1].id
    volume_id = openstack_blockstorage_volume_v3.volume_cassandra_1.id
}

Upgrading our infrastrcutre with Terraform

Once we have defined our desired infrastructure, we can start with the deployment:

terraform plan
terraform apply

Mounting, expanding disks and ploying Docker Swarm with Ansible

mkdir -p /mnt/data
sudo mount /dev/sdb /mnt/data
xfs_growfs /dev/sdb

Terraform Ansible Open Stack Cloud Infrastructure News Hunter

Upgrading a Cloud Infrastructure With Terraform and ANsible

Creating voume disks snaptshots

Upadating Terraform files

Upgrading our infrastrcutre with Terraform

Mounting, expanding disks and ploying Docker Swarm with Ansible

Related