The Perfect DB Storage Array

Home > datacenter, smugmug > The Perfect DB Storage Array

The Perfect DB Storage Array

April 27, 2007 Don MacAskill

I’ve long known that YouTube had a secret weapon in their datacenter codenamed ‘Colin‘, but yesterday at the MySQL Conference, I met three more secret weapons – codenamed ‘Paul‘ and his team (sorry, guys, I’ve forgotten your names!).

Paul and his team are incredible. Paul’s keynote was easily the most interesting thing for me at the entire conference because of how technical and authoritative it was. It certainly helped that he spoke our language – he got down and dirty with his hardware, not just MySQL tuning variables, and discussed real world fixes. Plenty of other MySQL sessions were interesting, but most of them focused at a high level rather than down near the bare metal. We’ve long since left most of the high level stuff behind and are, ourselves, focused on bare metal.

Best of all, the MySQL team at YouTube sees eye-to-eye with us when it comes to DB storage arrays. There are a few differences, I think, but we’re essentially very similar. Hopefully my description of our ideal, perfect, high-performance DB storage array can help out any other startups out there looking for solutions. Certainly having our internal assumptions validated by YouTube helps.

I hate the “queries per second” or “queries per day” metrics, because they tell you absolutely nothing about how complicated or long the queries are, but we do many billions of queries per day, if you’re into those metrics. So we care a great deal about getting good, fast hardware.

The list:

We like DAS for our DB boxes, with RAID controllers in the external enclosure, rather than internal disks. This is one area I’m not sure YouTube agrees with us (they might, we just didn’t discuss it). Let me explain:
- When a server has some fatal hardware problem, we like to just yank it out of the rack, slide another identical server in place, hook it up to storage, and turn it on. No mess, no fuss.
- Using LVM, we can add more storage and/or more spindles easily.
- Had problems in the past with RAID controllers failing and new ones not correctly picking up the RAID tags on the drives. External enclosures have two controllers, making single card failures less problematic.
We love spindles. The more the merrier. Our typical RAID 1+0 array has 14 of them, making 7 effective spindles. At best, that means we can do 7 concurrent operations at a time.
We love fast spindles. Give us 15K drives any day of the week.
We love enclosures with odd-numbered drives. 15 drives, 13 drives, something odd. Why? Because we want *1* hot spare, not 2, and want the rest of the spindles for reads & writes.
We love big battery-backed write caches. We stick them in write-back mode for super-fast writes (easily the hardest thing in a DB to scale).
We hate read caching. We disable it entirely. The cache on RAID controllers is relatively puny (128-512M) compared to the RAM in our DB boxes (32GB), so any reads that aren’t in our DB’s main memory certainly aren’t going to be in the RAID controller cache. We want every byte in the cache for writes. Plus, we don’t want read cache misses to get serialized behind the pending writes.
We hate prefetching. We disable it entirely. The DB is smart enough to request data it needs without the RAID controllers trying to be smart and tying up disks and the entire I/O path with extra data we don’t need.
We want very configurable stripe/chunk sizes. Some controllers just have presets, like “DB”, which often have tiny (16K) sizes. Ugh. We want 1MB+ stripes.

Now, unfortunately, finding arrays that do all of this stuff is tough. We end up haggling with vendors, or wrestling with configurations, etc. And usually we have to compromise on one or two of the items. 😦 I think we’re close to finding an ideal one, but we’re not quite there yet. You’ll hear it here first when we do.

If you’re willing to lose DAS, both LSI (and according to YouTube, Adaptec) let you get at most of the settings I mention above. I haven’t used 3Ware for a while, but I understand that they do not. If I’m wrong, someone please correct me.

Finally, our typical DB class machine is a Sun X2200 M2 with two dual-core CPUs and 32GB of RAM. The RAM and the disk stuff I talk about above are far far more important for our workload than the # or speed of CPUs, and it sounds like the same holds true at YouTube. We’re popping SAS cards in them and attaching to DAS units.

Anyway, hope that helps any of you out there wondering what to buy. I will still blog about Sun’s storage shortly (and why it didn’t match up to what we needed), I’ve just been busy. This should help add some context, though.

Give me a few more days. 🙂

UPDATE: Found one that does nearly everything we want – the Dell MD3000.

Categories: datacenter, smugmug

Comments (70)

Loren

April 27, 2007 at 3:16 pm

Hi,
This is a timely post for me, as we are looking for an appropriate storage array. Just today, we had a rep from NetApp in the office to talk about their solutions. Have you worked with their equipment before? He certainly made the competitors sound incredibly inefficient.
Casey

April 27, 2007 at 9:13 pm

Timely for us too. We are doing a huge server farm upgrade and datacenter move at the same time w/ completely new hardware from end-to-end (what a pain)! I’ve been eying up the Dell MD3000 DAS array for our master DB. I too find the memory more important that the CPU, too bad 16 DIMM slots + redundant power is almost impossible to find from a major vendor. Speaking of that, Don, do you have any issues using the single PSU X2200 for a mater DB server? How do you address this single point of failure? Are we just too paranoid?!?

@ Loren, I have heard only good things about NetApp, although I think they are more into the SANs than DAS (and IMHO SAN’s are wayyyyyyy overpriced for things like a DB array that usually doesn’t need connectivity to more than two hosts). You are paying for all those fancy features like replication, thin provisioning, fibre channel, etc.
Don MacAskill

April 27, 2007 at 9:27 pm

@Loren:

I don’t have any firsthand experience with NetApps. From what I usually hear, they’re A) expensive with a huge premium, and B) more suited to SAN than DAS or NAS.

I know some people who have used them a lot, though. I’ll see if I can get some info out of them or have them post here.
Don MacAskill

April 27, 2007 at 9:28 pm

@Casey:

We have four Dell MD3000s at SmugMug now, they’re actually the exact storage I’m talking about in this post. The only thing we haven’t (yet?) figured out is how to disable read caching. The diagnostics and documentation on the thing refer to two states (Enabled / Disabled), but we can’t find the setting to change it.

If we can find it, we’re probably golden. And if we can find it, you can bet I’ll be blogging about it. They’re sweet arrays.

As for DB master on an X2200, we’re not using them there yet. We use them for slaves. Our master DB box had 3 PSUs in it.

I’m pushing Sun to release an X4200 class box with 16 DIMMs. (X4200 M2, maybe?) That’d be perfect for our things that are less fault-tolerant. If that’s of interest to you, too, let me know – I’ll make sure they see your comment. 🙂
Bernd Eckenfels

April 27, 2007 at 11:43 pm

@Don: I can agree with your “tuning tips”. Have you had a look at FC-to-Something in Addition to SAS-to-Something? The “turn off prefetch” is a bit risky depending on the RAID implementation, especially if you work with large chunks. Normally you will get at least 1-chunk read ahead, but well I guess thats expected. The Sun SE 3510FC would meet your requirements, but the management should be done via console (instead of management software which is somehow buggy, especially in turning on WB cache)

@Loren: you can consider NetApp as Network Filesystems (NFS). I would not use them as FC or iSCSI Hosts because you lose all the advantages of their file systems. And especially the iSCIS shows 5-15% performance loss over FC (in my tests).

Personally I am not sure how good DB-over-NFS works but I know some installlations where you have very large databases with medium commit load, where the netapp installation is very good to handle.
Don MacAskill

April 28, 2007 at 12:00 am

@Bernd Eckenfels:

Yes, we have lots and lots of FC-to-something in our datacenter now. I’m actually ok with FC to the enclosure (or maybe even iSCSI, though I haven’t tried it), but I want SAS drives rather than FC or SCSI. Still, given a choice, I’d prefer 100% SAS.

InnoDB uses 16K pages, so doing prefetches on large stripes is a waste. But doing a read on a large stripe is great because the odds of the 16K page being on a single disk, rather than spanning 2, is high.

DB-over-NFS sounds extremely slow and scary to me. *shudder*
tyler

April 28, 2007 at 9:00 am

Don, I’m a little confused, I remember reading that you were using amazons storage service – they were promoting you as their poster child for that service. do you now use a different system? Excuse me if this is a silly question
Jacques Marneweck

April 28, 2007 at 9:16 am

@tyler they use amazon s3 for storing images. The databases are another matter! 😉
Don MacAskill

April 28, 2007 at 9:17 am

@tyler:

No questions are silly. 🙂

We use S3 to store our photos. Lots and lots of them (200TB+ worth). But we don’t use S3 for our databases because it’s just not designed for that. For our databases, we need things to be as fast as they possibly can.

Even super-fast disks attached directly to our database boxes aren’t fast enough for us – we wish we could buy something even faster. We can’t, so we have to make do, but S3 would be much slower. Not acceptable for a DB app that needs to do as many queries as we do.
PENIX

April 28, 2007 at 9:35 am

I have never thought of disabling some of the read prefetching and caching on the controller. I’ll be making these changes ASAP!
Duncan

April 28, 2007 at 9:39 am

How do RAMSANs stack up in your view of high-speed storage for a DB server? CCP (EVE Online) claim to love theirs, along with MS SQL.
Don MacAskill

April 28, 2007 at 9:42 am

@PENIX:

Glad I can help give you some ideas. If possible, though, I would try it both ways simultaneously (say, using MySQL replication and switching one of your slaves but leaving the other slave the same) and make sure. Your workload and mine may differ.

But I’ll bet you’ll get a little boost. 🙂
Don MacAskill

April 28, 2007 at 9:44 am

@Duncan:

I’ve never gotten to play with a RAMSAN (or any other large sized SSD installation) because they’re extremely expensive. If/when they come down in price, I’m sure we’d take a look, but we can get fifteen 15K SAS drives in an enclosure with redundant RAID controllers for like $8K. RAMSAN doesn’t even come close.

For applications other than ours, where price is less of an object, I’m sure they could work wonders. It makes sense that they’d be screaming fast.
Hugh

April 28, 2007 at 9:47 am

Interesting, do you have any figures on average query time?
Alexis

April 28, 2007 at 9:55 am

Don, have you looked at storage like 3par and the likes? I was wondering about the choice of a DAS vs. SAN, then I priced one on dell’s web site and it came down to a much cheaper price tag (1/10th) , which I can imagine is a strong driver.

I’m a in a similar but not quite identical situation (s/MySQL/Oracle/ + CPU-bound DB by virtue of using a lot of pl/sql) and I’ve been using Apple Xserve Raid for development (nice and cheap) and xiotech for production. Now it’s time to grow storage again and I’m looking at 3Par. If anyone has experience with their ‘ware I’m interested.
Don MacAskill

April 28, 2007 at 10:06 am

@Alexis:

Price is the killer for us with things like 3Par, yeah. Far too expensive. The Dell MD3000 arrays do almost everything exactly how we’d want them to and they’re insanely inexpensive.

I know fotolog uses 3Par though: http://www.google.com/search?q=fotolog+3Par

We have lots (70+) of Apple Xserve RAIDs, they’re awesome. I know Oracle uses them in JBOD mode *a lot* for a ton of their customers, they have a whitepaper out and plenty of other details about what they’re able to get out of it: http://www.google.com/search?q=Oracle+Xserve+RAID

I’ve begged Apple to offer 15K drives in their gear. If they did, we’d be using them.

If you’re CPU-bound, though, you might not benefit nearly as much as we do. We’re I/O bound, not CPU bound, so every little bit helps a lot.

Finally, Paul at YouTube has 15 years of experience as an Oracle DBA (at PayPal, no less), and only 8 months of experience with MySQL. Since their HDD requirements are similar to ours, I imagine the same settings should work great in your Oracle installations.
Dan

April 28, 2007 at 10:07 am

Don – any thoughts on the external HW RAID vs Linux’s SW RAID?

Linux SW RAID is incredibly fast, and has the added benefit that one isn’t tied to a storage vendor.
Don MacAskill

April 28, 2007 at 10:07 am

@Hugh:

Sorry, I don’t. I’d gladly share them if I did, but I don’t really track that. I watch out for slow queries all the time, but the really fast ones, I just don’t profile.

I care more about how many aggregate queries I can pull off, in general, than anything else anyway. We do many tens of thousands of queries per second at any given time during the day, and that’s the metric I struggle to keep up with. 🙂
Don MacAskill

April 28, 2007 at 10:10 am

@Dan:

Yes, lots of thought. The really big problem for us isn’t hardware RAID vs software RAID – we use a ton of software RAID in our servers. The big problem is a battery backed write cache.

We’re dreaming of a day when motherboards keep battery backed caches onboard for all of RAM or even just a subset for storage. But until then, writes are just way too slow when you need to guarantee they get to something durable. Software RAID is fine except for that one point.

It’s important to note that when I say “battery backed write cache” what that really means to us is a “write buffer” so we can get instant confirmation that the data is safe and move on to the next write. Then the buffer can efficiently combine writes and optimize disk access without our app stalling waiting for the disks to get the data.

Clear as mud? 🙂
Dan

April 28, 2007 at 10:13 am

Don – right, forgot about all the benefits of the write buffer via battery. What file system are you using for your DB partitions? Are you doing filesystem level tuning as well?

You mentioned LVM – does that have any performance implications with the extra layer of indirection in the VFS?
Matt

April 28, 2007 at 10:16 am

“too bad 16 DIMM slots + redundant power is almost impossible to find from a major vendor.”

Look at IBM.
Don MacAskill

April 28, 2007 at 10:21 am

@Matt:

I did. They couldn’t deliver. I need 2 CPU sockets with 8 DIMMs apiece.

Sun is the only major vendor who has a box like that. Alas, it doesn’t offer dual power supplies.
Chad

April 28, 2007 at 10:28 am

SAS arrays are the bomb. I suggest either the msa60 or the new SUN 2500 series. The Sun is really interesting, with is choice of SAS, FC and upcoming iSCSI (DAS and or SAN configs). AND the prices are very decent for entry level enterprise (really workgroup) storage.

As far as the x2200 systems mentioned… get the x4100!! and they do have 3 new systems in the works that will have more internal drives (8-16, to be exact) as well as Intel based servers from SUN are on their way…

U
Don MacAskill

April 28, 2007 at 10:33 am

@Chad:

We have a Sun 2540 in our datacenter right now, and it’s great. Alas, I don’t think they’re available in quantity just yet, but when they are, I think Sun may have a hit on their hands.

We’re using the Dell MD3000 ourselves, but haven’t yet figured out how to turn the read cache off. I’m hopeful, though…

As for the X4100s, we need 8 DIMMs/CPU. The X4xxx series only offer 4 DIMMs/CPU. 😦
Loren

April 28, 2007 at 10:35 am

I don’t know much about storage solutions… At this point, I can only regurgitate what the NetApp tech rep was saying. For instance, using dual parity RAID instead of RAID1+0 for more efficient protection from dual disk failure. i.e. Rather than duplicating the entire array, they use two disks for parity. In turn, the extra spindles from the duplicate array become free for actual storage and can be included in the “aggregate” for greater disk IO, according to NetApp :).

I suppose we’re tired of DAS and are looking more for a NAS/SAN, so perhaps I’m looking for info in the wrong place. Thanks for the replies!

Cheers,
Johnny

April 28, 2007 at 11:19 am

Don,

I am glad to see and read about another geeky person dealing with scaling their hardware. I do this kind of stuff for a living (mostly SMBs) and I feel for you. Good luck and keep sharing and keep us posted. I wish there were more techie admins would share their stories so we can all learn from them.

As someone else said here, I too, have never thought about turning off the prefetch cache. I will look into doing that.

BTW, I love SmugMug. It is the best!
Don MacAskill

April 28, 2007 at 11:30 am

@Loren:

Using dual-parity RAID (RAID-6) is great for data redundancy, but terrible for performance compared to RAID1+0. And let’s not forget that RAID1+0 arrays can survive multiple drive failures – if they’re the right drive.

It basically becomes a speed vs size tradeoff. For us, speed is king, so RAID1+0 is the only way to fly.
Dan

April 28, 2007 at 11:42 am

Don – comments on filesystems and LVM from my post above? Guessing you might have missed it.

Thanks
Don MacAskill

April 28, 2007 at 11:47 am

@Dan:

Oops, I did. Thanks! 🙂

So we use ext3 on LVM, and other than the “noatime” mount option, we don’t really do any filesystem tuning. I played around with ext3’s different journaling modes and didn’t notice a huge difference one way or the other (plus I’m not well versed enough with what they might do to my data integrity to say for sure).

LVM doesn’t seem to slow things down noticeably, which is nice, and even if it did a small amount, it would be worth it for us to be able to do snapshots for backups. LVM is easily the best way to take consistent backups of your InnoDB databases.

The filesystem I’m really interested in is ZFS. I’m looking forward to doing some heavy DB load analysis on it when I get a chance. Assuming it performs well, it’s a no brainer to use ZFS wherever possible. Since we’re a Linux shop, and ZFS is still not on Linux yet, that’ll mean keeping my fingers crossed… but dreams do come true. 🙂
Dan

April 28, 2007 at 11:54 am

Thanks – yes ZFS looks very interesting, and with Sun’s investing in Linux, I would expect to see it soon.

How is InnoDB performing for you? At my $newJob, they are mostly using MyISAM (*gasp*), because of the performance issues.

However, they also are using 3ware cards with internal storage, and have no where near your scale right now.
Don MacAskill

April 28, 2007 at 11:56 am

@Dan:

InnoDB is awesome. It’s massively more scalable than MyISAM. There’s absolutely no way we’d even be alive today if we were using MyISAM.

I think that’s pretty common in the industry.

The only real downsides to InnoDB are the size footprint on disk and it’s thirst for RAM. But the upsides far outweigh those minor downsides.
Ivan

April 28, 2007 at 12:14 pm

Why do you need big stripe/chunk sizes ? Unless you have a sequential traffic (a lot of INSERTs, not many UPDATEs) this seems inefficient to me.
— Ivan
Dan

April 28, 2007 at 12:21 pm

I agree with you on InnoDB. $newJob seems to think that they have a write performance issue if they switched to InnoDB. They also have a massive replication chain, so using ‘LOAD DATA FROM MASTER’ won’t work anymore when setting up a new slave.

Are you splitting out your InnoDB databases into namespaced tables? IE: where the data resides in individual files somewhat like MyISAM?

Also just for my curiosity – what is the site written in? I took a look at your slides, and saw a reference to ‘cURL’, but I wasn’t sure if you were referring to the command line tool / libcurl or something else.
live tv

April 28, 2007 at 12:44 pm

I can’t even begin to imagine the amount of the db power necessary to run some of those large sites.
tinkertim

April 28, 2007 at 1:12 pm

I am *really* liking Sun and the freedom to use Solaris or Ubuntu on my clusters. This is a neat recap for those of us who don’t get out much, thanks for posting it! I’m working now with Sun’s blade solution and loving it (it also helps solve some of those redundant power problems).

We do stuff (pretty much) the same, only we use a dual 10G fiber ring spinning 15k sas drives which works out pretty well, and CLVM vs just LVM2. Combined with a PXE boot system, AoE and OCFS2 we sail pretty well. Failures are really, really easy to deal with going PXE.

Another ingredient we bring into it is Xen, because the 10G fiber and SAS backed LV’s spin so well, swapping for PV guests over the network isn’t at all sluggish (like it was, anyway) and live migration is pretty painless.

I’m going to go apply some of these tweaks and take another look in general. I really, really need to get a little more out of some of the farm, this stuff is after all rather expensive 🙂 Thanks for the blog!
Don MacAskill

April 28, 2007 at 1:58 pm

@Ivan:

We want big stripe/chunk sizes because when we go to do a read (InnoDB uses 16KB pages), we want to minimize the chance that that 16KB chunk spans two disks. The goal is to keep every single 16KB read on a single spindle, leaving the other spindles open to do other reads or writes.

Make sense?
Don MacAskill

April 28, 2007 at 2:00 pm

@Dan:

InnoDB has been fairly good when it comes to writes for us. But the real win is row-level locking. That’s been huge for us.

We do use innodb_file_per_table, yes. Thank goodness. 🙂

I was referring to libcurl, since lots of languages provide an interface to it. We use PHP for lots of our stuff.
Don MacAskill

April 28, 2007 at 2:03 pm

@tinkertime:

The vast majority of our systems are diskless and netboot using PXE/DHCP/TFTP, too. It makes managing large farms such a breeze, doesn’t it? 🙂
Peter

April 28, 2007 at 4:01 pm

We use a NetApp for a lot of our InnoDB based databases. It’s fast, stable, and allows us to do some really nice stuff like snapshotting our master db to build slaves, or flexclone a db for testing.

It is an expensive SAN solution, I think their support is really great, and have had mostly good things to say about their stability.

My company does mostly table scans or completely random IO, so we end up with a lot of hot disks, or we end up i/o bound on a single spindle often. I would expect any storage solution to have these problems with these kinds of usage patterns.

One other great thing about the NetApp is that you can do mixed NAS and SAN, so we end up serving databases from a few shelves and small mailstores from others.

All of that said, it’s great to see that other people are looking at RAID arrays and that 15K seems to be the way of the future.

Best,

-Peter
Tom

April 28, 2007 at 4:18 pm

Many billions of queries per day? The problem isn’t your storage – it’s your code! With the traffic you get, there’s absolutely no logical reason why you’re doing this many transactions. Let’s say the average page view does 10 queries – at that rate, you’d need 100 million daily page views to reach just one billion queries, nonetheless “many” billions. And the last time I checked, smugmug wasn’t in the top 100 most trafficked websites, so it’s not possible you’re doing that kind of load.

If you’re running into problems, the issue isn’t the hardware, it’s the software.
Don MacAskill

April 28, 2007 at 5:15 pm

@Peter:

Thanks for the insight. Having zero knowledge of NetApp, it’s nice to see someone with experience pipe up. Thanks!
Don MacAskill

April 28, 2007 at 5:19 pm

@Tom:

We do way way more than 100M requests per day. I’d have to check, but I’ll bet we serve that many photos a day, let alone web pages, AJAX calls, and miscellaneous other HTTP calls. That’s not enough volume to crack into the Top 100, though, you’re right.

According to Alexa, we’re #350 in the US, but Alexa discounts a huge amount of our traffic because they only look at the smugmug.com hits, and we have tens of thousands of accounts using their own custom domains. Well over 10% of our traffic (It may very well be over 20%, I just don’t have the data handy) goes to non-smugmug.com domains, making our relative size larger, just difficult to track. Not to mention that Alexa is hardly accurate at all.

But again, I don’t like ‘queries per day’ as a metric. Our queries are all super-simple, super-fast queries. There are certainly people out there who do far less query volume than we do but stress their hardware harder. Bad metric, but people seem fixated on it, so I offer it up. 🙂
Cock

April 28, 2007 at 6:06 pm

>Because we want *1* hot spare, not 2, and want the rest of the spindles for reads & writes.

You obviously never saw or heard of two disks failing at the same time (or in relatively fast succession).

> Ugh. We want 1MB+ stripes.
> We want big stripe/chunk sizes because when we go to do a read (InnoDB uses 16KB pages), we want to minimize the chance that that 16KB chunk spans two disks. The goal is to keep every single 16KB read on a single spindle, leaving the other spindles open to do other reads or writes.
> Make sense?

If you have 1MB stripe size and issue, say, a read query that requires 50 consecutive DB pages of 16KB each, they’ll all land on two (mirrored) disks while all the other spindles will … wait?
If your workload is completely random and all queries are unique (don’t exist in cache), I don’t see how you could possibly serve that many queries in a day (see below) unless your DB is only few GB in size and can be cached in RAM in which case it really doesn’t matter what storage you use and how you “optimize” it.

>but we do many billions of queries per day, if you’re into those metrics

Fascinating…
2 billion / 86,400s = 57,000 queries per second. As one can get only about 2,000 IOPS from a 14 spindle disk array, I don’t see how you’re possibly doing 4,000+ IOPS (57,000 queries per second / 15 spindles, assuming one query generates one IO, which is probably underrating it by a factor or 4x or more) on that number of spindles.
Either your DB is very small, mostly read-only database (few GB that can be entirely cached in server RAM and/or application layer) or you’ve messed up your math.

>For us, speed is king, so RAID1+0 is the only way to fly.

Nice… And it doesn’t look you bother with backups/snapshots either.

BTW, I don’t remember when was the last time I saw this many ridiculous and uninformed comments on a single Web page.
Don MacAskill

April 28, 2007 at 7:32 pm

Ok, I’ll take the bait.

Re: drive failure

We’ve had two disks fail at once, just like anyone else. But we have redundant entire DB machines to cover those sorts of failures. And, let’s not forget, RAID1+0 can survive multiple disk failures as long as they’re the right disks. That’s what cold spares are for.

Re: 1MB stripe

There are never idle disks in our boxes, so the other disks will either handle other read requests or writes.

Re: workload

We serve that many queries in a day by using lots of machines. Replication is awesome.

Re: billions of queries per day

Our typical DB size is a little north of 300GB, so we’re able to keep close to 10% (32GB of RAM in each box) in RAM. And we have lots of boxes. We’re not doing 57,000 queries per second on a single machine, and I never said we were. Slaves rule.

Oh, and you might want to hit gradeschool math one more time. 57000 queries * 86400 seconds = nearly 5 billion queries, not 2 billion.

Re: RAID1+0

You might want to hit gradeschool reading, too, since you clearly missed where we use LVM. LVM makes it a breeze to take regular snapshots.

Once you’ve graduated from the 5th grade, come back and see how uninformed this page is. Maybe then you can work on building your own site that can handle a query or two per second.
Grammar

April 29, 2007 at 2:47 am

“I haven’t used 3Ware in awhile, but I understand that they do not. If I’m wrong, someone please correct me.”

“When “awhile” is spelled as a single word, it is an adverb meaning “for a time” (“stay awhile”); but when “while” is the object of a prepositional phrase, like “Lend me your monkey wrench for a while” the “while” must be separated from the “a.” (But if the preposition “for” were lacking in this sentence, “awhile” could be used in this way: “Lend me your monkey wrench awhile.”)”
Don MacAskill

April 29, 2007 at 9:03 am

@Grammar:

Wow, I only made one mistake? Sweet! I rip through these blog posts once, without proof reading or re-reading, so they’re usually riddled with them.

Thanks for the pointer – changed!
Loren

April 29, 2007 at 2:24 pm

I said dual parity RAID, not RAID-6, for a reason. Again, I can only quote NetApp, and I don’t have first hand experience yet, but supposedly their RAID-DP does not suffer the performance hit of RAID-6. And, as someone else mentioned, the dual use SAN/NAS is a huge plus.

And I don’t really buy the cost issue. When scale, maintenance, and management costs are factored in, centralizing storage is very cost effective.
Chris Li

April 30, 2007 at 2:42 am

A very inspiring read!

I have been working with another taste from MySQL, i.e. MaxDB perviously SAPDB … and in progress to plan migrating to MySQL for MUCH MUCH better support….

Just a foolish question… I have been researched online to how to disable that Linux disk read cache…. I have been trying to google around with no luck… any quick fix here ?

We love RAID 10 , 15K rpm also … but seems the vendor all goes for 2.5″ hard disk now … if you want 15Krpm you end up paying premium for their enclosure… and also it takes the rack space also … how to get a balance between both ? any recommendation ?

For RAID controller I really loved 3ware and HP ones… I love to use 3Ware as boot controller for Linux as they are the only one provide 2 port Hardware RAID (I am saying HW RAID … not that of Adaptec … pretty much half HW half SW). Our shop is not large enough to endorse PXE … we are not flushing OS that often and we are close to data center (only 30 minute from downtown to data center here in Hong Kong)

I love HP RAID controller as you can easily set between 0% R / 100% W Cache and vice versa with their Boot CD without single problem, and it’s specially worth mentioning that their Linux part is very stable.. especially when you can use their SIM to monitor to the every spindle and exchange before they fail. I think it is why they don’t provide odd number disk … the disk is supposed to be replace way before they fails. The space @ data center is premium and much higher than that in warehouse, right ? 😉
Casey

April 30, 2007 at 4:21 pm

Don,

We are interested in the X4200 class box with 16 DIMMs– please let Sun know! We don’t really have contact with anyone at sun, so if you want to point us to a good rep/sales person that would be VERY helpful! As you probably know, the workload of our website seems to be similar to yours. Likewise, we are bootstrapped, no VC money, profitable, and always looking for a good value 🙂

If we knew each other in real life, I think we might even become friends 🙂
Arni

May 1, 2007 at 8:16 pm

>Our typical RAID 1+0 array has 14 of them, making 7 effective spindles

Just a quick question regarding the spindles. During read operations, wouldn’t an effective RAID controller give you 14 effective spindles?
Pit

May 4, 2007 at 6:26 pm

Did you ever take a look at the jackrabbit array. There’s a benchmark of this 24TB storage system ($23,830) available at http://scalableinformatics.com/public/JR-benchmark-v1.pdf. They ship with NCQ HD, but perhaps they can deliver with 15K rpms to, which could enhance the numbers a bit. I looked a couple of day’s at the report and found the numbers quite promising.
Marc Farley

May 9, 2007 at 8:12 am

I work for an iSCSI storage vendor – EqualLogic. We have an even number of drives in our system – that’s not what you want, but we do have SAS arrays (16 disks) and can load balance I/Os from a single system to give you more spindles. I know people think iSCSI is slow, but I can assure you these arrays are very fast and I you can ask a customer of ours who is doing a gaziilion database lookups for whatever unit of time you choose. I’m the blogger there and I know you don’t want a lot of vendor propaganda here, but if you want to contact me, you should have my email with this comment.
Marc Farley

May 9, 2007 at 8:15 am

Stupid typo in my previous post. I meant to say we can load balance I/O’s coming from a single system across multiple arrays. That way you get to access 28 spindles with two arrays (each array having 2 hot spares – I know you only want a single hot spare…..)
Trevor

May 17, 2007 at 11:38 am

Are you putting transaction logs and tables on the same array? It seems like you’d want to devote all of that battery backed cache to the transaction log. Have you done any testing in this area?
Don MacAskill

May 17, 2007 at 11:44 am

@Trevor:

Good point. We’ve played around with it both ways: transaction logs on one disk, binary logs on another, and table space on a 3rd. Performance-wise, it works great.

Where it doesn’t work so great, though, is when using LVM to do live snapshots for backups. We like the ability to simply snapshot a partition without having to lock the tables or go read-only or anything, but if the logs and tables are split up, that’s basically impossible.

So currently we’re back to “everything on one disk” and just making sure that our data is partitioned in such a way that we maintain a reasonable write load.

But we’ll continue to investigate and play around. 🙂
mfc

May 26, 2007 at 3:31 pm

Hi,

http://bugs.mysql.com/bug.php?id=26662

Coming in 5.0.42 of MySQL. Solaris support
for innodb_flush_method = O_DIRECT.
Looks like a good way tobypass the OS files system
cache for Innodb.

Thoughts?
Don MacAskill

May 26, 2007 at 5:09 pm

@mfc:

Thanks for the link and note! Yes, that is good news – we use O_DIRECT on Linux already.

One outstanding question, though, is whether ZFS has a directio mode or not. I can’t seem to find any mention of it – but I’m talking with two of the co-inventors of ZFS, so hopefully they can shed some light on the situation.
Joe

May 26, 2007 at 7:01 pm

To pit et al:

Did you ever take a look at the jackrabbit array. There’s a benchmark of this 24TB storage system ($23,830) available at http://scalableinformatics.com/public/JR-benchmark-v1.pdf. They ship with NCQ HD, but perhaps they can deliver with 15K rpms to, which could enhance the numbers a bit. I looked a couple of day’s at the report and found the numbers quite promising.

Thanks for the pointer, had been getting more traffic as a result of that.

First: (full disclosure) my company designed/built and is selling the JackRabbit (JR); head over to the site if you want more info.

Second: There is a SAS variant using a different chassis and mixed internal/external arrays. The 5U JR chassis uses SATA 300 NCQ’s, it is the most price performance effective route for the market space we are looking at with this unit. If there is enough demand for the unit with SAS drives, it could be done.

This said, there is a fundamental reason why you want SAS disk, and it isn’t bandwidth, it is seek performance. Lots of folks might argue the bandwidth point, but at the end of the day, with a great raid controller such as we use in JR, you are not going to get 2x better throughput on your SAS drives. The differences will be the streaming data rate, if you can get the data off disk without having it fill read cache on the drive or on the controller, are around 125 MB/s best case for 15kRPM drives versus 78 MB/s best case for 7.2kRPM drives.

From a design point of view, these 125 MB/s units can sit 3 to a SAS channel before over-committing the channel. Which means for our 5U unit, we would need 16 SAS channels. We can do this using a variety of RAID units (prefer hardware RAID), though most of the SAS RAID units we have looked at have pathetically small caches.

If we want to “enhance the numbers a bit” as you indicated, we could double the number of RAID cards in the unit. Also note that we benchmarked with RAID6, and not RAID10, as a number of other vendors do. We used xfs under Linux. RAID6 kills small random write and seek performance. RAID 10 is much better for that, with the 0 portion (the stripe) as wide as possible. Its a tradeoff between risk and raw mega-tonnage of performance. For a RAID 10 unit, I haven’t done the benchmarking yet. Wasn’t sure if there was interest.

That said, are the numbers not good? Please educate me on this, I had thought they weren’t bad at all.
Don MacAskill

May 26, 2007 at 7:11 pm

@Joe:

The Jackrabbit looks very interesting, but the numbers you’re quoting aren’t particularly applicable to the use case I’m describing.

We don’t care, for the most part, about how many MB/sec we get because we’re doing 4KB writes. So we care about how many of them we can get to the array and how fast each write returns to the OS that it’s been complete (speed of write back cache). We need lots of IOPS but not lots of MB/s.

Clear as mud?
Joe

May 26, 2007 at 7:25 pm

Hi Don:

Actually it is quite clear :). The target market for JR is more for high performance storage for computational systems where MB/s rules, and more to the point, how many bits per second you can push out the network pipes. I did measure IOPs with an eye towards the DB market, but no matter how many 7.2kRPM drive you throw at a database, they are still 1/2 the seek performance of the 15kRPM drives, and that seek performance is going to kill you on 4kB writes (read-modify-write for a journaled fs).

One thing you might want to revisit is the choice of file system. ext3 has some bottlenecks in it journaling code. Under heavy load it has some performance issues.

All this said, if you have a particular favorite set of benchmarks that are worth looking at (SPC-2?, etc), please advise. We want to expand what where we play, and to do that requires understanding whether or not the units will perform adequately. That starts with a good measuring device/benchmark …
mfc

May 27, 2007 at 8:51 am

Hi,

Maybe you could post some thought on ZFS and its potential for
database use. I’m hopeful about its volume management and
data integrity capabilities but less optimistic about its actual
raw performance.

http://blogs.sun.com/roch/entry/zfs_and_oltp

Its seems like adding layers to databases just slows things down and
your thought about an O_DIRECT bypass for ZFS would be a neccessity.
Don MacAskill

May 27, 2007 at 11:26 am

We’re very early in investigating ZFS, but there seem to be a few obvious wins, assuming ZFS works the way I understand it to (and again, this is very early).

We care about write latency the very most. Which means we care about being able to access lots of disk spindles. When adding more spindles to a ZFS pool, writes should automatically get striped across the new disks.

This is helped by COW (copy-on-write), since the write doesn’t have to occur on the disk in which the original data resides. Instead, ZFS should be able to intelligently pick a disk from among the pool members that has the least amount of IO and write to it. It’s a little more complicated than that, but you get the point, I hope.

Big open issues are definitely directio (although you don’t want directio for MyISAM tables, only InnoDB, since MyISAM doesn’t cache any record in RAM itself and relies on the disk cache). It sounds like ZFS has a read pre-fetch, too, which I’m 99% sure we’d want to disable entirely or tune very low. The odds of an extra few KB actually being useful are fairly low given our data size and the amount of RAM we have in these boxes. Finally, I’m worried about ZFS preferring reads to writes. For most applications, this is awesome and makes a lot of sense. For our architecture, though, our job is to get writes to disk as fast as we can. We have extensive caching mechanisms and replication to make reads easily scalable – it’s writes we can’t scale as well.

In a perfect world, we’d like to use ZFS attached to just a bunch of raw disks, rather than hardware RAID, so that ZFS can do all of the mirroring and striping in the pool and so we can easily extend the pool (add more spindles) as we run out of write capacity. That means, though, that having a battery-backed write cache might not be easily doable – which is a big loss. I have some ideas, and I’m talking to Sun about some other ideas, on how to solve this problem.

I’ll definitely blog about ZFS and what we’re thinking and looking for soon – clearly there’s interest.
Jimmy B

June 5, 2007 at 6:03 pm

Hi Don, good stuff. It seems the question of disk size comes up quite a bit. With disk sizes getting upwards of 300 GB now, what are your thoughts on the performance on these larger drives….Vendors will tell you that they are just as fast but I don’t entirely trust the larger spindles…

What size and speed of drives do you guys lean towards?
Timmy J

June 13, 2007 at 10:40 am

I’ve read through this 3 times; great stuff. I’m curious: What is your approximate ratio of writes to reads?

(I have a situation where we have tons of writes compared to much fewer reads. But it’s critical the reads have the lowest possible latency. We’re considering flash drives. Any thoughts?)
mfc

June 13, 2007 at 4:25 pm

Our write to read ratio is 2:1.
We are also considering flash drives for redo logs.
Sandisk (M-Systems) FFD35 U3S 8 P80 which is
around 2k a pop!
Tao Shen

June 18, 2007 at 2:25 am

Hi,Don:

Nice blog, and nice to see someone with experience posting their quest for nice database IO storage subsystems.

I have been thinking about this same problem for a while now, what do you think about the following architecture?

Using a OpenSSI cluster for your compute nodes and run mysqld on the SSI over say 24 nodes of cheap commodity 1Udual core CPUs(don’t know if MySQL or PostgreSQL would scale horizontally that way)

and then have another cluster of cheap 1U dual core system with 4 commodity hotswappable hard drives(SATA even) and run either lustre fs, or hadoop file system. You can choose to use Infiniband switch if you want, but dual GigE over say 24 nodes will give you 250MB/s*24=6GB/s theoretical IO bandwidth. In the case of TSUBAME(japan’s supercomputer) they use lustrefs, and hadoop file system is simply a open source google file system. I don’t know how much IO smugmug currently require, but linearly scaling IO bandwidth over GigE sounds a good solution

oh, only if I had 1000 nodes to test this idea LOL
Tao Shen

June 18, 2007 at 2:30 am

oh just to add to my post above, no RAID DAS boxes needed in that setup, just cheap drives. You are basically running an architecture similar to google’s fs and bigtable over 1000s of cheap 1U nodes. In terms of cost and scalability, which way is better.

Since google bought youtube, I was wondering if google will convert youtube’s db architecture to use GoogleFS and Bigtable instead of master-N slave mysql clusters.
Tao Shen

June 18, 2007 at 7:17 pm

Don:

one last thing: you should look into Iwill H4103 barebone…plug 4 dual core Opterons in there gives you 8 cores, 16 slots mem, and use the PCI-X expansion for your fibre channel connection to DAS. I think it’s cheaper than buying Sun’s boxes, and smaller 1U profile.
Glen Shok

December 26, 2007 at 10:30 am

Hey, although I am a marketing puke (with a highly technical background) I came by this blog post and found it interesting. We have an array that has been used by very large MySQL database shops and we have implemented features specifically for DB (1MB stripes) among others. I know this is an obvious plug, but you may want to take a look at it. http://www.pillardata.com

If you want to yell at me for posting this: gshok@pillardata.com. 🙂

And to add to Tao’s post – I worked with a defense contractor to implement something similar, using cheap 1U servers and a GFS like Lustre…this solution also works well and makes the storage and servers disposable when implemented correctly. Although I work for an array vendor, this is a very elegant solution that is resilient, but needs a lot of care and feeding when using open systems stuff like Lustre. There are some “private” GFS vendors out there that have better uptime.

Good luck. And Happy New Year!
convert mts

November 15, 2009 at 10:02 am

This was great last time, looking forward to it!