Scaling out Computing Clusters to EC2
The aim of this use case is to create and configure everything needed to deploy a mini-cluster in a private network with NIS and NFS and let external nodes from EC2 to connect via VPN to the server and join this network and then the cluster with SGE.
For all of this we have take the following technical considerations:
with these requirements in mind we have to create 3 images:
The Server:
private ip | eth0 10.1.1.99 |
---|---|
public ip | eth1 147.96.1.100 |
hostname | oneserver |
The XEN workernodes (local)
private ip | eth0 10.1.1.55 |
---|---|
hostname | local01 |
The EC2 workernodes (ip range from 10.1.1.100 to 10.1.1.254)
vpn ip | tap0 10.1.1.100 (asigned by VPN server) |
---|---|
public ip | eth0 automatically assigned by Amazon |
hostname | workernode0 |
The procedure to create and configure the images can be quite long, for this I recommend you that first create a clean installation of the linux distro of your choice (we used Ubuntu). Later this will be used to create all other VM images.
Once you have this image copy it, then mount it and edit the files /etc/hostname and /etc/network/interfaces, this will be the server (we call it oneserver). Copy again the original images and put the proper values for the same files, we called these nodes WorkernodesX, unmount both images and start them with Xen, we recommend you the following HOWTOs for NIS and NFS, since are simple a straightforward. For one server image you will have to follow the “server steps” on the HOWTO's and so on for the clients (workernodes).
Now when configuring the VPN server its important to allow the clients to use duplicate certificates (all the clients with same certs) this is because machines on EC2 are the same and we don't want to create separate AMIs, for this include in the openvpn configuration file at /etc/openvpn/server.conf the line “duplicate-cn”, we use the following HOWTO for the VPN, the VPN was ONLY configured in the server, not the client images, this is because the client installation of the VPN will be on EC2 images only.
To create an AMI use the bundle command, in our experience get the latest version since could happen that you have problems bundling the /etc/fstab or the /etc/hosts files, remember you can use a workernode local to bundle the initial EC2 image. Also install VPN client and SGE and configure them, the easiest way to do this is start a copy a local workernode a configure it, then bundle it with the command ec2-bundle-image, and you are finished. There is a minor tweaking we have to do, since SGE works mainly the hostnames of the machines, and Amazon assigns automatically names to new instances on EC2, we created a script that was executed on the boot of the machine, and depending on private vpn ip address a name was generated for it, since OpenVPN supports definition of ranges for machines joining the vpn, we reserved the range from 10.1.1.100 to 10.1.1.254 to EC2 instances. All of this must be configured in the /etc/hosts of the server (oneserver).
Once all of the configuration is finished, now we need to create ONE templates to launch all the machines. You must create one template for each local machine (oneserver, workernode0,workernode1…) but only one template for any number of machines you want to launch via EC2. Check out the documentation on EC2 configuration and template creation here. For an example, if you have this available images:
lgonzalez@machine:bin$ ec2-describe-images IMAGE ami-e4a94d8d one-w2/image.manifest.xml 587384515363 available private i386 machine IMAGE ami-cdb054a4 sge-dolphin/image.manifest.xml 587384515363 available private i386 machine IMAGE ami-d8b753b1 sge-parrot/image.manifest.xml 587384515363 available private i386 machine IMAGE ami-dcb054b5 sge-squirrel/image.manifest.xml 587384515363 available private i386 machine
And we chose the last image ami-dcb054b5, you can configure the ONE-EC2-template as following:
CPU=1 MEMORY=1700 EC2=[ AMI="ami-dcb054b5", KEYPAIR="gsg-keypair", ELASTICIP="75.101.155.97", INSTANCETYPE="m1.small", AUTHORIZED_PORTS="22-25"] REQUIREMENTS = 'HOSTNAME = "ec2"'
The ELASTICIP, INSTANCETYPE and AUTHORIZED_PORTS are optional.
To start the testing all of this, start OpenNebula and add the ec2 host with:
lgonzalez@machine:one$ one start oned and scheduler started lgonzalez@machine:one$ onehost create ec2 im_ec2 vmm_ec2 lgonzalez@machine:one$ onehost list HID NAME RVM TCPU FCPU ACPU TMEM FMEM STAT 0 ec2 0 0 100 on
submit the created ec2 template to initiate a ec2 instance like this:
lgonzalez@machine:one$ onevm create ec2.template ID: 0
later the scheduler will deploy the machine on the ec2
lgonzalez@machine:one$ onevm list ID NAME STAT CPU MEM HOSTNAME TIME 0 one-0 pend 0 0 00 00:00:05 lgonzalez@machine:one$ onevm list ID NAME STAT CPU MEM HOSTNAME TIME 0 one-0 boot 0 0 ec2 00 00:00:15
And then you can see more detailed information (like IP address of this machine):
lgonzalez@machine:one$ onevm show 0 VID : 0 AID : -1 TID : -1 UID : 0 STATE : ACTIVE LCM STATE : RUNNING DEPLOY ID : i-1d04d674 MEMORY : 0 CPU : 0 PRIORITY : -2147483648 RESCHEDULE : 0 LAST RESCHEDULE: 0 LAST POLL : 1216647834 START TIME : 07/21 15:42:47 STOP TIME : 01/01 01:00:00 NET TX : 0 NET RX : 0 ....: Template :.... CPU : 1 EC2 : AMI=ami-dcb054b5,AUTHORIZED_PORTS=22-25,ELASTICIP=75.101.155.97,INSTANCETYPE=m1.small,KEYPAIR=gsg-keypair IP : ec2-75-101-155-97.compute-1.amazonaws.com MEMORY : 1700 NAME : one-0 REQUIREMENTS : HOSTNAME = "ec2"
In this case is IP assigned ec2-75-101-155-97.compute-1.amazonaws.com.
Now we check that the machines are running in our cluster, we have one machine running locally from our Xen resources (local01) and one machine runnining in EC2 (workernode0).
oneserver:~# qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@local01 BIP 0/1 0.05 lx24-x86 ---------------------------------------------------------------------------- all.q@workernode0 BIP 0/1 0.04 lx24-x86 ----------------------------------------------------------------------------
To test the cluster, submit some jobs to SGE via qsub <script.sh>, before that we need to change to the account of nistest, since thats the user we configured for the NIS and SGE.
oneserver:~# su - nistest oneserver:~# qsub test_1.sh; qsub test_2.sh;
Now we see how jobs are scheduled and launched into our hybrid cluster.
nistest@oneserver:~$ qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@local01 BIP 0/1 0.02 lx24-x86 ---------------------------------------------------------------------------- all.q@workernode0 BIP 0/1 0.01 lx24-x86 ---------------------------------------------------------------------------- ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 1180 0.00000 test_1.sh nistest qw 07/21/2008 15:26:09 1 1181 0.00000 test_2.sh nistest qw 07/21/2008 15:26:09 1 nistest@oneserver:~$ qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@local01 BIP 1/1 0.02 lx24-x86 1181 0.55500 test_2.sh nistest r 07/21/2008 15:26:20 1 ---------------------------------------------------------------------------- all.q@workernode0 BIP 1/1 0.07 lx24-x86 1180 0.55500 test_1.sh nistest r 07/21/2008 15:26:20 1 ----------------------------------------------------------------------------
The interesting parameter here is the scalability provided by EC2, since you can launch any number of instances on EC2, and add them as working nodes into your SGE Virtual Private Cluster managed by OpenNebula.