|
Aug 10
2011
|
OpenStack IO performance unreliable due to DiskScrubbingPosted by: Robin Webster in Infrastructure on Aug 10, 2011 |
|
Following a bit of reading (turns out not enough), including an article on Cloud performance on the thebitsource , we concluded that our small scale development app which relies on a single MySQL server would be more than catered for by a RackSpace cloud server. We needed a small (but consistent) IO requirement and modest memory/CPU.
The first month of service was a complete success and we began to consider migration of live systems to cloud servers; then we where suddenly hit by a dramatic IO performance drop lasting 6 hours making the instance unusable during the online day.

The average wait time for IO increased to 60ms compared to less than 1 during normal service. The RackSpace support team responded quick to our support ticket to let us know it was due to DiskScrubbing.
The OpenStack system initiates a DiskScrubbing procedure each time an OS image is deleted to ensure your data is not still lurking on the disk when you leave. Writing a lots of zeros across an area of the disk kills IO on that disk for other users.
So I guess we were either on our own on a server for the first month, or just with quiet neighbours. But we soon guessed that this problem would occur when there are API's available to quickly create and destroy new instances. And we were not wrong. The problem came back over and over, the worst period was 18 hours of dreadful IO. We asked to be moved to a new host, which was seamless but our new neighbours were just as noisy and our service was often unusable.
Rackspace's only resolution suggestion was to better design our app for the cloud, which would be fine if we were ready to scale bigger than a single cloud server instance, which for this particular app we are not. I had wrongly assumed that storage could be provisioned from outside of the server you are on to remove this IO bottleneck. But at time of writing only the Rackspace cloud files service was available, which they clearly state is not suitable for database environments. Feeling more than a little burnt by the whole experience we swapped for a dedicated host from UK2 which meets our needs for cost and performance. (the rackspace entry point for dedicated servers was quite a significant jump from the cloud offering.)
I'd like to try the experiment again having re-read the comments in response to the above bitsource article recommending using amazons elastic block storage (EBS) I think we would have a very different experience. But with billing on an per IO and size of disk used I think we could quickly get up to the cost of a dedicated host. If I manage to convince the developers to risk the pain again I will give it a go. Watch this space!










http://victortrac.com/EC2_Ephe...BS_Volumes
http://blog.rightscale.com/200...explained/
Seems that for a reliable consistent average disk wait time of less than 5ms the only cost effective way to achieve this is through an entry level dedicated server. We currently pay 100GBP per month for a suitable server with RAID1 internal disks to meet this objective. Financially it makes perfect sense, from an environmental perspective it makes no sense at all; we barely even register on the stops for CPU utilisation, so there are a pair of PSU's in that machine that are going to be wasting power. Hopefully in time there will be more guarantees available on minimum IO performance that will make cloud servers an option for this kind of workload.