Are you running out of disk space (storage space), are you just completely lost in your computer, and don’t know what to do to fix your problem? Well let me tell you, I have yet to find a problem, or task to complicated that given enough time, I could not over come. Below is the ongoing task of a complete redesign of a web – database structure that is being implemented on a very large client of the company I work for; and all this while the client is using the computers to generate revenue, and serve their clients. If you think upgrading your single hard drive on your computer is difficult, stay tuned to this post to see the outcome of the complexity and uniqueness of the task below.
And on a side note, even people that are in the field of IT / network structures can not comprehend HOW to make this migration / upgrade.
The task:
Upgrade server from 250GB to 2TB hard drive on 2 separate servers. Estimated time to completion, 3 hours.
The Structure:
Server 1 (xen1) – the primary server
Running CentOS as Virtual Host (VH or VS), has 3 Virtual Machines (VM). Each machine has 2 logical volumes (LV) to create separate drives for each VM. Two LV’s are connected using DRBD.
Web server is running and is live environment, meaning that the client is using this VM to generate revenue.
SQL server is running and is live environment; this VM is cloned using DRBD onto Server 2.
Server 2 (xen2) – the backup server
Running CentOS as Virtual Host (VH or VS), has 3 Virtual Machines (VM). Each machine has 2 logical volumes (LV) to create separate drives for each VM. Two LV’s are connected using DRBD.
Web server is running and is used for testing.
SQL server is DRBD cloned and is not running; used for manually fail-over.
My assessment:
The overall structure and layout of the above is completely wrong, and is useless in any event. The manual fail-over allows for downtime, which is not good practice in an real-time raid environmental. So a complete redesign was in order just to perform the task of upgrading the hard drives.
New Task:
Design a complete fail-over load-balanced redundant structure to allow for optimal uptime, and allow for ease of upgrading hard ware in the future. The new design is outline in the following image. Estimated time for completion: 5 weeks
And here are the steps I performed in order to make this structure designe change happen, in real time with minimal downtown.
1. Verify no traffic to test server at all. ntop, iftop, tcpdump, etc. tools were used.
2. Turn down test web server and wait a few days. (Let’s see if any calls come in about that machine)
3. Turn down the physical machine.
4. Add 2TB drive to second bay, and boot using Gentoo.
5. dd if/dev/sda of=/dev/sdb bs=1024; and wait
6. Turn up physical machine.
7. Create new LVG’s and LVM’s for new device structures.
8. dd if=/dev/mapper/VG_test_os of=/dev/mapper/VG_new_test_os
9. dd if=/dev/mapper/VG_test_data of=/dev/mapper/vg_new_test_data
10. Modify config files to use new devices in test os
11. Boot up the new test vm…modify.. reboot..
12. Inform client of new test machine IP, and wait for their conclusion to continue the migration/upgrade
This completes thehard drive upgrade on the backup server. Now one point is that since I cloned the original drive over to the 2TB drive, everything is working as before. The manual redundant SQL Server is still using the DRBD structure for full clonning, however, this will be replace with a clustered DB structure on a larger LVM. (This one is going to be tricky… no more DRBD). The test server was upgraded to a larger LVM and once booted, the drives were then expanded using the Windows OS to use the complete drive.
Once confirmation from the client is received that the temp server is ready to go, we move onto the fun stuff, which kinda moves pretty quickly into the new structure.
FAQ:
Some people will of course have questions that they want to ask, so before hand, I’ll go ahead and answer some of the major questions here. If you have a question and it is not listed / answered below shot me an email dscott at ucann2 dot org.
Q. How does the change in structure affect the developers and how code is developed and rolled into production?
A. Nothing changes on the end-user side. The IP that any developer will use to roll code out into production and that process will remain the same.
TO DO:
1. adjust proxy to hit temp machine
2. turn down prod machine
3. clone through the network prod(xen1) to prod(xen2)
4. turn prod(xen2) + expand drives
5. rotate sql server from xen1 to xen2 (shutdown sql (xen1) drbd secondary sql-xen1; drbd primary sql-xen2 turn up sql-xen2) [WOW!]
6. TEST LIKE HELL! (remember this is production envirnment!)
COMMANDS to remember
`nc` changed, no longer need -p when using -l “nc -l ” NOT “nc -l -p ”
On the Receiving server
root@xen2 $ nc -l 9901 | dd of=/dev/dvice_name
On the sending server
root@xen1 $ dd if=/dev/device_name | nc 9901

