This is a Knowledge Base article regarding changing the topology of aerospike that we were using in our team.
Timeline to Problem
- We had a 7 node aerospike cluster using i3.xlarge nodes.
- The cluster was designed to keep cookie data, hence the i3 instances were used which provide ample Instance store but less processing power. (4 core 32 GB RAM 850GB instance store)
- A sister team was handed over the cookie related project
- The sister team created a new cluster, as they wanted more compute power
- We wanted to separate concerns, so, we kept using the old cluster
- Now the cluster was hardly using the over-provisioned disk, and was close to bottle-necking on CPU for the other purposes the cluster was being used.
- Change the cluster from 7x i3.xlarge boxes (4 core 32 GB RAM 850GB instance store) to 3x c5d.2xlarge boxes (8 core 16 GB RAM 200GB instance store).
- Have minimal impact while doing this (We were aware of clients using the first set of IPs of the cluster in their configs, so they needed to be retained. This way the changes to the aerospike cluster will mostly be seamless)
- Have minimal loss of data / interruption to services
- The replication factor for the namespace used in aerospike was 1 which was set to conserve disk space as we were OK with a node going down as that would just lead to data for a part of requests not being present without affecting the application adversely (until the node is made healthy again)
- This meant that removing nodes from the cluster meant that that data would be lost
Create a new parallel cluster
- This would involve doing backup/restore between the old and new cluster based on timestamps, the data being moved in each incremental backup will reduce
- Point the applications to point to the new cluster
- Do a final sync to allow for all old data to be moved to new cluster
- This required that all applications using the cluster would move to the new cluster at the same time which seemed impossible to orchestrate.
- We would need to make sure that all clients are accounted for as any missed clients will require more operational hassle
- This will require managing two clusters till the point of the switch which adds to instance costs, as well as operational overhead
Stop one node at a time and move its data to the cluster
- This would involve forcefully creating a split brain such that one node is isolated from the cluster (and that node now acts as a cluster on its own)
- Clients will not be able to connect to isolated node.
- Now, we can do a backup of data on isolated node and restore it to the cluster
- This was the most complicated way of going about it as it will require iptable rules to block network communication and a problem with the rules will lead to data loss for sure
- This still does not solve the problem with replication factor being set to 1 as all future changes will require similar migrations.
Rolling restart of the present cluster changing the replication factor
- This is required as aerospike does not allow changing the replication factor on the fly
- After changing the config, aerospike is restarted. While starting up, aerospike will walk through all the data on disk and create an index.
- For the amount of time aerospike is starting, the nodes will not be able to access its data, so this method involved data unavailability.
- Once replication factor is set to 2, it is all unicorns and rainbows from there on.
Final process followed:
- We chose the Method 3. of doing the migration
- The first node restart took 2.5 hrs, and we were fine doing the restart over a period of few days. (First two nodes were restarted on Monday 8th July)
- Around the same time, we found that one of the major sources of queries to the cluster was temporarily suspended, and hence we started doing more that one node restart (higher data unavailability, lesser time to get replication factor set to two across the cluster)
- On Tuesday 9th July, all the remaining nodes were restarted. Now the replication factor on the namespace was 2 which meant that nodes can be added/removed without worrying about data loss until and unless we only make single node changes
- The first three nodes on the cluster were upgraded to c5d.2xlarge boxes
- Aerospike on the box was stopped
- We let migrations to complete
- Aerospike version on the box was upgraded to make sure that we did not have a mixed version cluster
- Box was stopped
- The instance type was changed using the AWS interface
- The 850GB st1 EBS volume attached on the box was detached and deleted
- A new 200GB gp2 EBS Volume was attached to the box to be equivalent in size to the instance store on c5d.2xlarge boxes (gp2 volume was chosen as min size of st1 volume is 500GB, and we would not be using that much capacity - not to mention the cost will be higher for no added benefit)
- Start the box
- Make aerospike config changes to account for:
- New disk locations
- Lower amount of RAM on the box
- Start aerospike
- Make sure that after every change, migrations are completed
- On 10 July, 4x i3.xlarge boxes were shut down one after the another completing the migration.
- No changes were required to the aerospike clients
- Data unavailability if it affected the applications was kept to a minimum
- Future changes will be easy as we are using the cluster with replication enabled now.
Important things observed
- Nodes for which replication factor was set to 2 tried to created replicas on other nodes which was rejected by other nodes
- This teaches us that limited mixing of configs is possible in Aerospike (Only during migrations)
- Migrations needed to be made faster, as the default speed of migrations is very slow Aerospike Migrations