10.26.2010

I/O (part 2)

In part 2 of I/O we will consider how to observe some pain points in your overall storage design. These concepts could be applied to any technology once you understand how they can be applied. The concern areas include any connection between the application running in the operating system all the way to the spinning platters inside the disk drive. In this I will speak specifically to iSCSI as this is becoming increasingly common in storage networks.

Servers
Lets begin right at the server where the application or files are presented from. There are some things to tune here but nothing that will make a significant difference. If using a physical server ensure the NIC(s) you are connecting to the storage with are 1Gb server type network cards. Most popular ones these days support some type of TCP offloading and the associated drivers are a better quality in the supported OSs. If this machine is virtual the VM itself will not be performing the iSCSI translation rather VMware will be handling this piece. If you find yourself needing to use an iSCSI initiator from within a VM use a dedicated  vmxnet 3 virtual NIC if supported. One of the methods to check if I/O is the issue, check PerfMon or iostat (with respect to OS) and look for queue depth, length, or hold time. This measurement can indicate if the OS is holding SCSI requests waiting to be processed. One potential solution depending on the root cause is to enable MPIO as this can assist with performance issues and also provide iSCSI redundancy.

Virtual Host
The next link in the chain is normally VMware, Xenserver or some other virtualization technology. In a physical environment this can obviously be skipped. In a virtual host environment some of the same rules apply however keeping in mind you now have many servers using the same iSCSI connections. In a local storage environment you had a direct path between the controller and the disk drive using a 68pin or SAS cable and was typically capable of more then 1Gb/sec. Now you have many servers using perhaps a single 1Gb connection to it's respective disk as well as the latencies introduced with the other components. Evaluating the performance here can be done in a similar approach by checking for disk latency and queue. Make sure latency is less then 50ms and queue is less then 50. If using an application, like a SQL database, some vendors have much stricter limit of between 2ms and 10ms for latency. Using such technologies as MPIO, better network cards, updated drivers, fully patched hosts can assist to provide the desired performance. Also providing dedicated iSCSI interfaces should be one of the first things considered in a properly designed host.

Infrastructure
Moving to the switch infrastructure can also play a significant role in the overall performance and is often overlooked. The basic rule is to use a good quality switch with plenty of port buffering. This will ensure the packets flow through without becoming blocked due to the buffers filling. This could be seen from the VM and the host showing high levels of latency however the SAN showing low overall utilization and no signs of stress. The switch itself may not show a high CPU level or any other stress as it may not have a lot of traffic on all ports or the configuration may not have CPU intensive tasks. Also to ensure the switch will not be asked to perform some of these other functions or pass non-iSCSI traffic it is recommended to use dedicated switches. In some designs or budgets this may not be possible so ensure the switch you are using is a good quality switch. Some examples include the HP 2910al or the Cisco 3750. Obviously there are many full Gb switches on the market even in the sub $200 range and may be fine for lab/test situations I would caution using them in a production network as these may not have enough buffering to maintain a non-blocking state.

Storage
Considering storage, this is one area that is not as clear. Due to the amount and diversity of technology these vendors use one must understand the architecture and hardware used. Typically most vendors will have some method to measure CPU, memory utilization (often local cache), disk queue depth and latency. Virtualized systems will always perform better (as most systems) when RAID 10 or RAID 50 sets are chosen over RAID 5. Using SAS, SCSI or FC 10Krpm or 15Krpm disks obviously will always perform better then the SATA, SAS 7Krpm disks. Another philosophy concerning the number of spindles or amount of disks used can also prove to be beneficial however as SAN vendors use different technology this may or may not help as much as it used to. One consideration to support this is if the disk controller can handle many disks in a large RAID set. Recently Intel and others have shown processors are becoming so fast software based RAID can outperform hardware based RAID sets. Also as you are designing your disk system do not add parity disks (or equivalent of a disk) in your write I/O calculations as this stripe when written will actually increase write time. Read times will lessen however also keep in mind especially in virtualized environments the platters are housing blocks of simply more blocks of data. Each time the virtual OS writes a file it changes a block (VMware example) in the .vmdk file, then changes a block on the VMFS partition, which in turn changes a block on whatever filesystem the SAN uses to store data. In the world of virtualization this can be virtualized, not sitting directly on platters, also. ;-)

Enjoy!

10.14.2010

Personal Virtualization

Here are some tips to make virtual workstation technologies perform better. Some of these are specific to VMware but could be applied to other virtualization platforms like Virtualbox.

For a new VM that you are creating select:

> Store Virtual disk as a single file.

If you have an existing VM make sure all of the snapshots are deleted (if you have taken any) and do this:

> vmware-vdiskmanager -r sourceDisk.vmdk -t 2 destinationDisk.vmdk

In this case the source disk will be the large VMDK file. After you convert you will need to edit the vmx (text based) file to reference the new vmdk file unless you used the same file name. Obviously you'd have to convert the disk to a new directory in this case or change the name. Once it's converted you will actually see 2 new files, one is the very small text file that defines the raw virtual disk file and the other is the raw virtual disk file itself. DO NOT LOOSE THE TEXT FILE! It is essentially impossible to remake as there is a special code in there that references the large raw file.

If you run the 'vmware-vdiskmanager' itself you can see all the options you can do.

Another tip is use multiple partitions to reduce the level of fragmentation. If you are using Linux format the partition with XFS or ext4. I normally give each partition 3-5 VMs and have partitions of 25-50GB.

Another tip is if you can use RAID 0 or RAID 1 of very fast hard drives. I am using 2 WD Raptor 150GB drives at home. I can run 4 VMs at once running a RAID 0 with 4 GB of physical ram. The key here is not necessairly MB/sec but I/O persec. This is where the 10KRpm drives rival any other SATA drive on the market by far. These disks are 50% faster. However if you use RAID 1 you will not loose too much if you use a quality drive like the WD RE3 1TB drive. This is one of the faster ones on the market. Do not worry about hardware vs. software RAID as the current processors have enough performance to lessen the need for hardware RAID (unless you have the money to burn).

I've also done a little research on whether or not to use Enterprise of 'RAID' type drives. There can be a sight advantage beyond the (in some cases) longer warranty and build quality. RAID supportable drives are designed to intentionally fail and even can send commands back to the RAID controller (software or hardware) telling the state of the drive. A standard disk will attempt retries for a number of minutes (typically 2) before it will announce a failure ultimately confusing the RAID software as it may have already declared the disk FAILED even if the disk recovered. Considering RAID type SATA drives will declare themselves failed in a short period of time (7-10 seconds) if it cannot recover and send the message to the RAID software. This behavior is specifically evident in the Western Digital line but are similar with other manufacturers and may not be a critical reason to choose these disks for home/test.

Enjoy!

Resource Management

Saw something very interesting today... In setting up a little demo environment with some colleagues we only had a server with some very limited resources. In particular 8GB of ram and we needed to check out the latest version of VMware View. Once everything was finally booted up I found the virtual machines' balloon driver taking effect, memory sharing, memory compression, and memory swap on every VM. Things were a bit slow but perhaps we'll chalk this up to a test of ESXi 4.1 resource management and even better - nothing crashed :-)!

10.05.2010

I/O

Welcome!

For a first post I figure I'd talk about one of the main issues I've found while virtualizing machines within any of the technologies from VMware ESX , Citrix Xenserver, ... I/O capabilities of the storage where these virtual files or partitions reside whether connected by IP networks (NFS, iSCSI), Fiber channel, and local storage. Storage medium usually consists of either SATA, SAS, SCSI, FC, and SSD. I'm going to make an attempt to speak about these different technologies.

First let me dis-arm the idea where FC is faster then iSCSI and NFS is the worst. This all depends on how it's implemented. When iSCSI is configured to use 10Gb/s networks it can easily surpass 4Gb or even 8Gb FC just as NFS can easily be as fast as the other technologies. The real difference becomes whether or not multi-path is enabled and if the storage is capable of these performance levels in the first place. Multi-path brings a couple benefits; the first being the aggregate bandwidth of all the connections added together. If mating the 2 appropriate technologies each I/O request made can be channeled through a separate path. The other benefit can be realized from not having a single point of failure. Typically if the technology supports this configuration it will have the ability of failing over to another path, or re-issuing I/O requests if the request never comes back with an acknowledgement. If configured properly your virtual machines will not crash but simply hang for a short period of time then regain activity when the requests have timed out. I will speak to this in greater detail in a later post of how to configure some of these technologies.

Another issue relating to storage is how it itself is configured. I've noticed a huge difference in whether SAS or SATA or if it is configured as a RAID5 or a RAID 10 or 50. I'll let Wikipedia ( http://en.wikipedia.org/wiki/RAID ) define RAID for me :-) however understanding the differences and trade-offs of each type can lead you to disaster or complete success even within the same disk technology. What I mean by this - it is possible to see a RAID 50 or RAID 10 SATA storage array achieve closely the same performance level as a RAID 5 SAS array. Let's say for example a SATA disk is capable of 100 iops (input output operations per second) and a SAS disk is capable of 175 iops. Keep in mind there are other contributing factors but these are averages.

If we take a 10x SATA 1TB disks RAID 10 and show for every spindle capacity we actually have 200 iops
RAID 10 = double the spindles for 5TB capacity = 2 x iops = 1000iops for 5TB or storage

Next we take 9x SAS 600GB disks RAID 5 and show for every spindle capacity we actually have 155iops
RAID 5 = 1.125 spindles for 4.8TB capacity = .889 x iops = 1400iops for 4.8TB of storage

In this example we can see SAS still leads by 140% however the cost difference could be an interesting story. From this example we could also see if we configure 6 SATA disks in a RAID 5 for 5TB of capacity our performance is substantially less.
RAID 5 = 1.2 spindles for 5TB capacity = .833 x iops = 500iops for same amount of storage.

This is substantially less then even our RAID 10 configuration. There are some other considerations due to relative read and write performance of each technology. Write performance comparing a RAID 5 and 10 could alone even out the numbers in the above equations between SAS and SATA. In a RAID 5 all bits must be written to all disks, the parity bit calculated and written. This must happen for each I/O request and can only happen in succession. When the process is applied to multi array type RAID levels as in RAID 10 or 50 these operations can happen simultaneously with most current generation controllers. This can also improve the performance of the overall system. Considering this with the above equations we could potentially realize an additional 25%-75% penalty depending on the amount of writes.

Considering a virtualized environment where we are not simply dealing with documents and SQL databases, we are dealing with virtual disks and every read and write occurring corresponds with virtual disk blocks the virtual operating system is changing. The point here is these disk blocks can be larger then the standard I/O chunks we have been used to traditionally dealing with.

Bottom line - when planning a storage system for virtualization consider the number of virtual machines and each machine's I/O requirement in the physical world or the performance level desired and add about 10%-20% for the extra virtualized layer in between.