He was last spotted in China on top of the Great Wall:
Have you seen Smuggy? Snap a shot, let us know!
Ok, so I guess I’m a total n00b. In hindsight, SLAs make a lot of sense after all. The whole point isn’t to compensate SmugMug for our loss, it’s to make it unprofitable for the service provider to keep making the same mistakes.
In other words, let’s say Amazon’s margins on S3 are 15%. (I have no data, I’m just picking that number out of the air). If Amazon has a serious problem during a month, they have to cough up 25% to all their customers. In other words, they lose 10% instead of make 15%.
That’s pretty major incentive – and it now totally makes sense why SLAs are so highly valued.
Amazon has finally released and put into effect their SLA for S3. I know a lot of my readers will be thrilled about this. 🙂
I’ve gotten a few questions about Nirvanix in the past month or so, especially about the fact that they offer an SLA (and that S3 didn’t). I think this probably puts the final nail in Nirvanix’ coffin because:
- Why would you trust Nirvanix, a no-name company, with your precious data?
- Worse, they’re affiliated with MediaMax/Streamload in some way, who have a reputation of poor service. (I’ve even seen reports of data loss at Streamload, though I haven’t bothered to check).
- Just how much is an SLA worth when there’s nothing behind it to back it up?
- They’re more expensive than Amazon. Um, duh.
SLAs don’t mean a lot to us, anyway, as I’ve said before because:
- Everything fails sometimes.
- The SLA payment is rarely comparable to the pain and suffering your customers had to deal with.
But I know it’s very important to lots of people, so I expect there’s cheering and dancing in the streets. 🙂
UPDATE: I get SLAs now. Sorry for being dumb.
Any storage experts out there? Can you forward this to any you may know?
An interesting thread developed in the comments on my post about Dell’s MD3000 storage array regarding theoretical maximum random IOPS to a single HDD. I’m hoping by bringing it up to the blog level, we can get some smart people who know what they’re talking about (ie, not me) to weigh in.
I’ve always believed that for a small random write workload, the revolutions per minute (rpm) of the drive was the biggest limiting factor. I think I’ve believed this for a few reasons:
- It seems logical that the biggest “time waster” in seek time is probably rpm anyway. Even if the drive arm has found the right position on the platter, it likely has to wait some amount of time, up to a full revolution, before it can write.
- rpm is a “fixed” number, and thus easier to calculate, than seek which is more variable. So taking the easy way out, one of my favorite hobbies, seemed appropriate.
Using this theory, a 7200rpm drive can do a theoretical maximum of 120 IOPS, and a 15K drive can do 250. Note that these are fully-flushed non-cached writes to the spinning metal, with no buffering or write combining. Over the years, my own tests seem have validated this theory, and so I’ve just always believed it to be gospel.
Tao Shen, though, commented that my assumption is wrong, and that seek time is the limiting factor that matters, not rpm, and that faster drives can deliver more IOPS than my rpm math. He posits that a 15K drive with a 2ms seek time can do 500 IOPS. Now, he may have access to better drives than I do, since I think our fastest are 3.5ms (best case scenario), not 2ms. That’s what the latest-and-greatest Seagate Cheetah 15K.6 drives seem to do, too.
So which is it? Am I totally smoking crack? Is he? Or is the truth that seek time and rpm are so intimately tied together that separating them is impossible?
How does one calculate theoretical maximum IOPS?
Got a funny bone and a camera? Think you’d enjoy a fabulous weekend for two at a luxury resort in California’s Napa Valley? Have I got the contest for you!
The contest is simple: Grab your camera and take a photo of Smuggy, our mascot, in an outrageous, exotic, or surprising locale and you could score big. No purchase necessary, open to anyone in the US. (Sorry to our friends around the world – dang lawyers! 😦 )
Create a Smuggy out of whatever you’ve got handy or drop us an email and we’ll send you stickers, camera straps, and more. Here’s our very first entry:
I’m happy to announce a few API-related things:
- v1.2.0 is now marked as stable. Everyone should be safe to use this.
- v1.2.1 is now the new beta release and introduces lots of new stuff. The docs aren’t fully updated yet, but you can read about the new methods and variables on this dgrin thread.
- A SmugMug API contest! Win an iPhone (or something of equivalent value if you’re not in an iPhone territory or just don’t want one)
I’m hoping we can make contests like these a regular occurrence, but we’ll see how this one goes first. 🙂
You can find all the details over on this dgrin thread announcing the contest.
Oh, and don’t forget that SmugMug API developers get lifetime free Pro accounts. So there’s no cost to enter and play.
So I’ve written about storage before, specifically our quest for The Perfect DB Storage Array and how Sun’s storage didn’t stack up with their excellent servers. As you can probably tell, I spend a lot of my time thinking about and investigating storage – both small-and-fast for our DBs and huge-and-slower (like S3) for our photos.
I believe we’ve finally found our best bang-for-the-buck storage arrays: Dell MD3000. Here’s a quick rundown of why we like them so much, how to configure yours to do the same, and where we’re headed next:
- The price is right. I have no idea why these companies (everyone does it) continue to show expensive prices on their websites and then quote you much much cheaper prices, but Dell is no exception. Get a quote, you’ll be shocked at how affordable they really are.
- DAS via SAS. If you’re scaling out, rather than up, DAS makes the most sense and SAS is the fastest, cheapest interconnect.
- 15 spindles at 15K rpm each. Yum. Both fast and odd. Why odd? Because you can make a 14 drive RAID 1+0 and have a nice hot spare standing by.
- 512MB of mirrored battery-backed write cache. Use write-back mode to have nice fast writes that survive almost all failure scenarios.
- You can disable read caching. This is a big one. Given we have relatively massive amounts of RAM (32GB on server vs 512MB on controller) *and* that the DB is intelligent at reading and pre-fetching precisely the stuff it wants, read caching is basically useless. Not only that, but it harms performance by getting in the way of writes – we want super-fast non-blocking writes. That’s the whole point.
- You can disable read-ahead prefetching. Again, our DB does its own pre-fetching already, so why would we want the controller trying to second guess our software? We don’t.
- The stripe sizes are configurable up to 512KB. This is important because if you’re going to read, say, a 16KB page for a DB, you want to involve only a single disk as often as you can. The bigger the stripes, the better the odds are of only using a single disk for each read.
- The controller ignores host-based flush commands by default. Thank goodness. The whole point of a battery-backed write-back cache is to get really fast writes, so ignoring those commands from the host is key.
- They support an ‘Enhanced JBOD’ mode where by you can get access to the “raw” disks as their own LUNs (in this case, 15), but writes still flow through the write-cache. Why is this cool? Because you can move to 100% server-controlled software storage systems, whether they’re RAID or LVM or whatever. More on this below…
Ok, sounds good, you’re thinking, but how to I get at all these goodies? Unfortunately, you have to use a lame command-line client to handle most of this stuff and it’s a PITA. However, you asked, so here you go (commands can be combined):
- disable read cache: set virtualDisk[“1”] readCacheEnabled=FALSE
- disable read pre-fetching: set virtualDisk[“1”] cacheReadPrefetch=FALSE
- change stripe size: read the docs for how to do this on new virtualDisks, but to do online changing of existing ones – set virtualDisk[“1”] segmentSize=512
- Enhanced JBOD: Just create 15 RAID 0 virtual disks! 🙂
- BONUS! modify write cache flush timings: set virtualDisk[“1”] cacheFlushModifier=60 – This is an undocumented command that changes the cache flush timing to 60 seconds from the default of 10 seconds. You can also use words like ‘Infinite’ if you’d like. I haven’t played with this much, but 10 seconds seems awfully short, so we will.
Wishlist? Of course I have a wishlist. Don’t I always? 🙂
- This stuff should be exposed in the GUI. Especially the stripe size setting should be easily selectable when you’re first setting up your disks. It’s just dumb that it’s not.
- Better documentation. After a handy-dandy Google search, it appears as if the Dell MD3000 is a rebranded LSI/Engenio array, which lots of other companies also appear to have rebranded, like the IBM DS4000. But the Engenio docs are more thorough, which is how I found the cacheFlushModifier setting. (On a side note, why do these companies hide who’s building their arrays? They don’t hide that Intel makes the CPUs… Personally, I’d rather know)
- Faster communication. I asked Dell quite awhile ago for information on settings like these and I had to wait awhile for a response. I imagine this might be related to the Engenio connection – Dell may have just not known the answers and had to ask.
- Bigger stripe sizes. I’d love to benchmark 1MB or bigger stripes with our workload.
- Better command-line interface. Come on, can’t we just SSH into the box and type in our commands already?
Ok, so where are we going next?
- ZFS. I believe the ‘Enhanced JBOD’ mode (15 x RAID-0) would be perfect for ZFS, in a variety of modes (striped + mirrored, RAID-Z, etc). So we’re gonna get with Sun and do an apples-to-apples comparison and see what shakes out. Our plan is to take two Sun X2200 M2 servers, hook them up to a Dell MD3000 apiece, run LVM/software RAID on one and ZFS on the other, then put them under a live workload and see which is faster. My hope is that ZFS will win or be close enough that it doesn’t matter. Why? Because I love ZFS’s data integrity and I believe COW will let us more easily add spindles and see a near-linear speed increase.
- Flash. We’ve been playing around with the idea of flash storage (SSD) on our own for awhile, and have been talking to a number of vendors about their approaches. It’s looking like the best bet may be to move from a two-tier storage system (system RAM + RAID disks) to a three-tier system (system RAM + flash storage + RAID disks) to dramatically improve I/O. If we come across anything that works in practice, rather than theory, I’ll definitely let you know.
- MySQL. We’ve now got boxes which appear to not be CPU-bound *or* I/O bound but are instead bounded by something in software on the boxes, either in MySQL or Linux. Tracking this down is going to be a pain, especially since it’s out of my depth, but we’ve gotta get there. If anyone has any insight or ideas on where to start looking, I’m all ears. We have MySQL Enterprise Platinum licenses so I can probably get MySQL involved fairly easily – I just haven’t had time to start investigating yet.
Also, you might want to check out this review of the MD3000 as well, he’s gone more in-depth on some of the details than I have.
Finally, I’m hoping other storage vendors perk up and pay attention to the real message here: Let us configure our storage. Provide lots of options, because ‘one size fits all’ is the exception, not the rule.
Sun’s announcement today that they’re unifying Storage and Servers under Systems is a good move, I think, but they’ve still got work to do. I believe (and everyone at Sun has heard this from me before) that their storage has been failing because it’s not very good. I hope this change does make a difference – because Jonathan’s right that storage is getting to be more important, not less.
UPDATE: One of the Dell guys who works with us (and helped us get all the nitty gritty details to configure these low-level settings) just posted his contact info in the comments. Feel free to call or email him if you have any questions.