Dell MD3000 – Great DAS DB Storage | SmugMug's Don MacAskill

Home > datacenter > Dell MD3000 – Great DAS DB Storage

Dell MD3000 – Great DAS DB Storage

October 1, 2007 Don MacAskill

So I’ve written about storage before, specifically our quest for The Perfect DB Storage Array and how Sun’s storage didn’t stack up with their excellent servers. As you can probably tell, I spend a lot of my time thinking about and investigating storage – both small-and-fast for our DBs and huge-and-slower (like S3) for our photos.

I believe we’ve finally found our best bang-for-the-buck storage arrays: Dell MD3000. Here’s a quick rundown of why we like them so much, how to configure yours to do the same, and where we’re headed next:

The price is right. I have no idea why these companies (everyone does it) continue to show expensive prices on their websites and then quote you much much cheaper prices, but Dell is no exception. Get a quote, you’ll be shocked at how affordable they really are.
DAS via SAS. If you’re scaling out, rather than up, DAS makes the most sense and SAS is the fastest, cheapest interconnect.
15 spindles at 15K rpm each. Yum. Both fast and odd. Why odd? Because you can make a 14 drive RAID 1+0 and have a nice hot spare standing by.
512MB of mirrored battery-backed write cache. Use write-back mode to have nice fast writes that survive almost all failure scenarios.
You can disable read caching. This is a big one. Given we have relatively massive amounts of RAM (32GB on server vs 512MB on controller) *and* that the DB is intelligent at reading and pre-fetching precisely the stuff it wants, read caching is basically useless. Not only that, but it harms performance by getting in the way of writes – we want super-fast non-blocking writes. That’s the whole point.
You can disable read-ahead prefetching. Again, our DB does its own pre-fetching already, so why would we want the controller trying to second guess our software? We don’t.
The stripe sizes are configurable up to 512KB. This is important because if you’re going to read, say, a 16KB page for a DB, you want to involve only a single disk as often as you can. The bigger the stripes, the better the odds are of only using a single disk for each read.
The controller ignores host-based flush commands by default. Thank goodness. The whole point of a battery-backed write-back cache is to get really fast writes, so ignoring those commands from the host is key.
They support an ‘Enhanced JBOD’ mode where by you can get access to the “raw” disks as their own LUNs (in this case, 15), but writes still flow through the write-cache. Why is this cool? Because you can move to 100% server-controlled software storage systems, whether they’re RAID or LVM or whatever. More on this below…

Ok, sounds good, you’re thinking, but how to I get at all these goodies? Unfortunately, you have to use a lame command-line client to handle most of this stuff and it’s a PITA. However, you asked, so here you go (commands can be combined):

disable read cache: set virtualDisk[“1”] readCacheEnabled=FALSE
disable read pre-fetching: set virtualDisk[“1”] cacheReadPrefetch=FALSE
change stripe size: read the docs for how to do this on new virtualDisks, but to do online changing of existing ones – set virtualDisk[“1”] segmentSize=512
Enhanced JBOD: Just create 15 RAID 0 virtual disks! 🙂
BONUS! modify write cache flush timings: set virtualDisk[“1”] cacheFlushModifier=60 – This is an undocumented command that changes the cache flush timing to 60 seconds from the default of 10 seconds. You can also use words like ‘Infinite’ if you’d like. I haven’t played with this much, but 10 seconds seems awfully short, so we will.

Wishlist? Of course I have a wishlist. Don’t I always? 🙂

This stuff should be exposed in the GUI. Especially the stripe size setting should be easily selectable when you’re first setting up your disks. It’s just dumb that it’s not.
Better documentation. After a handy-dandy Google search, it appears as if the Dell MD3000 is a rebranded LSI/Engenio array, which lots of other companies also appear to have rebranded, like the IBM DS4000. But the Engenio docs are more thorough, which is how I found the cacheFlushModifier setting. (On a side note, why do these companies hide who’s building their arrays? They don’t hide that Intel makes the CPUs… Personally, I’d rather know)
Faster communication. I asked Dell quite awhile ago for information on settings like these and I had to wait awhile for a response. I imagine this might be related to the Engenio connection – Dell may have just not known the answers and had to ask.
Bigger stripe sizes. I’d love to benchmark 1MB or bigger stripes with our workload.
Better command-line interface. Come on, can’t we just SSH into the box and type in our commands already?

Ok, so where are we going next?

ZFS. I believe the ‘Enhanced JBOD’ mode (15 x RAID-0) would be perfect for ZFS, in a variety of modes (striped + mirrored, RAID-Z, etc). So we’re gonna get with Sun and do an apples-to-apples comparison and see what shakes out. Our plan is to take two Sun X2200 M2 servers, hook them up to a Dell MD3000 apiece, run LVM/software RAID on one and ZFS on the other, then put them under a live workload and see which is faster. My hope is that ZFS will win or be close enough that it doesn’t matter. Why? Because I love ZFS’s data integrity and I believe COW will let us more easily add spindles and see a near-linear speed increase.
Flash. We’ve been playing around with the idea of flash storage (SSD) on our own for awhile, and have been talking to a number of vendors about their approaches. It’s looking like the best bet may be to move from a two-tier storage system (system RAM + RAID disks) to a three-tier system (system RAM + flash storage + RAID disks) to dramatically improve I/O. If we come across anything that works in practice, rather than theory, I’ll definitely let you know.
MySQL. We’ve now got boxes which appear to not be CPU-bound *or* I/O bound but are instead bounded by something in software on the boxes, either in MySQL or Linux. Tracking this down is going to be a pain, especially since it’s out of my depth, but we’ve gotta get there. If anyone has any insight or ideas on where to start looking, I’m all ears. We have MySQL Enterprise Platinum licenses so I can probably get MySQL involved fairly easily – I just haven’t had time to start investigating yet.

Also, you might want to check out this review of the MD3000 as well, he’s gone more in-depth on some of the details than I have.

Finally, I’m hoping other storage vendors perk up and pay attention to the real message here: Let us configure our storage. Provide lots of options, because ‘one size fits all’ is the exception, not the rule.

Sun’s announcement today that they’re unifying Storage and Servers under Systems is a good move, I think, but they’ve still got work to do. I believe (and everyone at Sun has heard this from me before) that their storage has been failing because it’s not very good. I hope this change does make a difference – because Jonathan’s right that storage is getting to be more important, not less.

UPDATE: One of the Dell guys who works with us (and helped us get all the nitty gritty details to configure these low-level settings) just posted his contact info in the comments. Feel free to call or email him if you have any questions.

Categories: datacenter

Comments (66)

Luntastic

October 1, 2007 at 9:16 pm

How are you planning on handling redundancy and failover?
Don MacAskill

October 1, 2007 at 10:06 pm

@Luntastic: Good question. We handle redundancy and failover at the software level, using replicaton, rather than at the hardware level with multipathing and the like.

When a server fails, we have other copies standing by, ready to take the load. And when they do fail, thanks to DAS (rather than, say, internal storage), we just slide the server out of the rack, slide an identical cold spare in, power it on, and we’re good to go.

We prefer to buy lots of cheap redundant hardware rather than small amounts of expensive hardware as a general rule. Our DB layer is no different.
Tao Shen

October 1, 2007 at 11:23 pm

Hey Don:

Yes, the Dell MD3000 is an excellent array.

The mysql issues you speak of: is due to the mutex contention and innodb locking bugs. Supposely, mysql 5.0.37 or above kind of helps a little, but scaling still isn’t linear past 2 cores, as opposed to postgres’s linear scaling to 16 cores, and DB2’s linear scaling to 96 cores.

mysql’s bug:
http://bugs.mysql.com/bug.php?id=15815

Postgres beats mysql in 8 to 16 core db boxes hands down
http://tweakers.net/reviews/674/6/database-test-8-way-opteron-pagina-6.html

Too bad, postgreSQL’s replication is not as robust as mysql’s yet.
Tao Shen

October 1, 2007 at 11:47 pm

One more question, Don:

what SAS controller do you use in the servers to connect to the DAS? Perc 5E? It’s funny how you are wasting the “RAID” functions on the Perc 5E to connect to another RAID controller basically.
KwangErn

October 2, 2007 at 4:56 am

How would this weight compared to building your own system? Am sure you can have more control when you are building your own machines, not to forget up to specs on what you really need.
Don MacAskill

October 2, 2007 at 9:12 am

@Tao Shen: No, we don’t use the RAID SAS card(s) from Dell, we’re using the non-RAID one. It’s just an HBA.

We massively prefer external raid to internal raid for a few reasons. The biggies are: Mirrored RAID on two controllers in the MD3000 means even a controller failure won’t lose writes, plus cold swapping a server is a much easier and faster operation when you don’t have to switch RAID controllers.
Don MacAskill

October 2, 2007 at 9:13 am

@KwangErn: I don’t know about you, but I don’t have the resources to custom build my own RAID enclosure complete with low-level firmware, mirrored controller cards, etc etc. Even companies like Dell and IBM prefer to put their name on someone else’s array. If they can’t build it, I certainly can’t.

Maybe I’m not understanding the question, though… ?
Don MacAskill

October 2, 2007 at 12:34 pm

@TaoShen: Yeah, I’m aware of that MySQL bug. In fact, you can see that I’m a participant in the discussion thread there. 🙂
Wes Felter

October 2, 2007 at 1:41 pm

When you’re testing ZFS it would be nice to compare a SAS JBOD against the MD3000 in “JBOD” mode – this would show exactly how much performance the write cache delivers.
Casey

October 2, 2007 at 5:13 pm

Don, I remember asking you about the MD3000 earlier this year and I am happy to report we just ordered two of them for more than 50% off list price (you are right, Dell’s web price is way high). We are also going to try out the MD3000i (iSCSI version) for some internal/corporate/backup stuff with a tiered storage system (15 SAS drives in the 3000 chasis & 15 SATA drives in the MD1000 chasis). It’s much easier pulling the trigger on new technology when you know other self-funded web companies are already happy with their purchase(s) of the same gear! Keep up the great work.
dpk

October 2, 2007 at 5:38 pm

“Too bad, postgreSQL’s replication is not as robust as mysql’s yet.”

And MySQL’s replication is about as non-robust as it can get, so that’s really saying something. 😉 On my wishlist: multi-master slaving, instant failover, and actual *data* replication rather than merely *query* replication.
Adam Jacob Muller

October 2, 2007 at 6:19 pm

Where did you guys actually find the cli utility for download, I scoured dells website and can’t find it
Justin

October 2, 2007 at 6:31 pm

The CLI comes with the Dell resource CD that you can get if you drill down for “support” on the dell website for the MD 3000. It also comes with two or three chunky Java based daemons that you really don’t need. It is called SMcli. The easiest thing to do is to dump the config then look at the text file, then you can modify and reload it.

When I was doing testing I didn’t find turning off the read cache was very helpful. It is possible that the controllers use intelligent write caching where they use more of the memory for write caching, as required. Read caching really helped with read performance. I think disabling read cache means that a striped raid unit can no longer get much more than a single SAS disks read performance which is pretty sad when you consider how many disks are available!

For me the negative on this array is it is an entry level controller in the with minimum cache memory and minimum bandwidth capability. The PERC 5/e cards are capable of stonking bandwidth and the MD3000 has not one, but 4 sas ports, but the whole 3000 controller can’t manage better than 400mb/sec of nett throughput which it only approaches with difficulty. If you have two hosts and four cards and an MD1000 expansion array connected to the MD3000 and a bunch of LUNs (30 drives) what is wrong with expecting more than 400mb/sec total out of it?
Matt

October 2, 2007 at 7:23 pm

@Justin: I would just like to remind you what Don mentioned.

“We’ve now got boxes which appear to not be CPU-bound *or* I/O bound but are instead bounded by something in software on the boxes, either in MySQL or Linux.”

I don’t know how much I/O throughput a system is capable of handling, and it may vary based on application, but in this instance, he is claiming that it is not the I/O throughput that is holding anything up. So if that is the case it wouldn’t matter if it was able to shovel data to the sytem any faster.

Though I would find it facinating to know under what circumstances a 400mb/s pipe would become a bottleneck.
Don MacAskill

October 2, 2007 at 8:39 pm

@Justin:

Honestly, I could care less about MB/s. I care far more about I/O per second, and most of our I/O is very small (InnoDB uses 16KB pages, so lets say we were doing 6000 IOPS, that’d be 93.75MB/s @ 16KB, well under 400MB/s).

Everyone’s workloads are different, so if >400MB/s is your target, you’ll definitely need to look elsewhere. But for lots of relatively cheap IOPS, this enclosure is pretty dang good.
Don MacAskill

October 2, 2007 at 8:41 pm

@dpk: Amen to multi-master replication and decent failover. As for row-based replication, rather than query-based, I think that’s in 5.1 already.
Tao Shen

October 2, 2007 at 10:08 pm

it’s amazing that mySQL closed that bug, and mysql 5.0.45 still can’t scale past 4 cores(16 concurrent connections).

Given barcelonas and penryn, 8 core boxes are dirt cheap now, when is mysql support going to suck it up and start scaling like DB2 and Postgres?
Justin

October 3, 2007 at 9:54 am

On 4 core machine with mysql 5.1

1 parallel run – 2 seconds 1 core busy
4 parallel runs – 4 seconds 4 cores 100% user busy
16 parallel runs – 16 seconds 4 cores 100% user busy

Conclusion: after cpu saturation, linear scaling. Parallel work is actually more efficient, not less.

So why are you surprised they closed the bug? the original issue it uncovered is fixed. It is a terrible test case anyway – far from real world usage. Select * from an innodb table does a table scan, and locks every single row while it works.

I am skeptical that there is a easy-to-hit dead-spot for mysql/innodb with 8 cores where disk isn’t busy, cpu isn’t busy, but adding work does not get more work done. I watch mysql performance blog, they are all over these kinds of problems (for instance http://www.mysqlperformanceblog.com/2007/09/26/innodb-auto-inc-scalability-fixed/ ) and don’t seem to be tracking any such _general_ scaling issues. With all respect to Don if he is observing scaling issues but doesn’t have a test case that shows it, then they could all be down to a single small issue with his pattern of use: just like the auto-inc scaling bug I linked to.

As for 400mb/sec vs random IOs, yeah the 3000 scales with small random I/O ops. One 15k SAS drive can do about 100 tiny IOP/s, the array isn’t any faster at one single random request than that, but it degrades more slowly as you add work, topping out at something over 1000 IOP/second. Obviously how far you can go down that route depends on whether you picked RAID5 or RAID1 and how many drives are in the LUN, and whether your random small requests are really random or have hot-spots.

Is 400mb+/sec of sequential read or write is useful, of course it is! Actually on a single LUN on the MD3000 you don’t even hit 200mb/sec). It makes for much faster sysadmin work for one (copying around a mysql data directory for instance) or makes table scans or optimize-table work faster. If you stuff 8 SAS drives into a bog standard 3u box connected to a single PCI-X SAS card, you can read at over 1gb/sec using linux so whats wrong with wanting that performance in a much more expensive enclosure?

they reserve the faster controllers not for cost reasons (the controller uses a puny processor) but for marketing. Oh well. Whatever.
Tao Shen

October 4, 2007 at 12:39 am

Justin: I am refering to the 5.0 community server. 5.1 is not released the community as a production ready release.

from the mysql performance blog, which i monitor, “Good news is the bug is fixed. Bad news is it is fixed only 5.1.22, which is not released yet.
I wonder if the fix is going to be ported to 5.0, as I mentioned it affected many production systems and not all of them are ready to upgrade to 5.1.”

i am excited for 5.1, since a lot of great work is merged into it.

“One 15k SAS drive can do about 100 tiny IOP/s”
no. A 7200 rpm SATA drive with 10ms seek can do 100 IOPS theoretically. 15K SAS with 2ms seek usually do 400-500 max IOPS per drive.

That’s why 15x15K SAS drive arrays can theoretically do about 6K IOPS …
Don MacAskill

October 4, 2007 at 8:54 am

@Tao Shen: I hate to disagree with you, but…

The bug is fixed in 4.1.23, 5.0.37, and 5.1.something. Check the bug report – the MySQL performance blog is behind or out-of-date.

Also, if we’re talking about small random I/Os, a 15K drive can’t come close to 400-500 IOPS.

A 15K drive rotates at 15,000 rpm, or 250 rps, resulting in a theoretical max of 250 IOPS.

Write-back caches like the MD3000 make this number much higher by pooling and combining writes (and/or caching and prefetching reads, depending on your settings), thank goodness, but the drive itself can’t do more than 250.
mfc

October 6, 2007 at 12:14 pm

Hi Don,

Where do you see evidence that the auto-inc problem is fixed in 4.1.23?
I’m looking at http://dev.mysql.com/doc/refman/4.1/en/news-4-1-23.html
and http://bugs.mysql.com/bug.php?id=16979
but I don’t see anything about it.

This bug is a really big deal for me

Thanks,

Mike
Don MacAskill

October 6, 2007 at 7:54 pm

@Mike:

Re-read that bug again – it’s not about auto-increment, it’s about linear scalability as you add CPUs/cores. Thankfully, this is fixed (or greatly remedied) in 4.1.23, 5.0.37, and any recent 5.1 builds.

The auto-increment bug issue isn’t one we personally face, yet, but I’m glad to see it’s been fixed in 5.1.22: http://www.mysqlperformanceblog.com/2007/09/26/innodb-auto-inc-scalability-fixed/

I’m not sure if they’ll backport to 5.0 (I’d expect so since it affects paying customers and 5.1 isn’t released yet), but I’m guessing 4.1 is unlikely.
mfc

October 7, 2007 at 1:50 pm

Hi,

I see you were talking about bug 15815. Someone sort
of blended the discussion over to bug 16979.

BTW, we are testing a SSD from SAN Disk (formerly M-Systems)
http://www.sandisk.com/OEM/ProductCatalog(1331)-SSD_Ultra320_Wide_SCSI_35.aspx for use as ib_log* replacements.

So far it isn’t working with our MegaRaid card (getting i/o errors).
Trying it direct, where it works, but it is about 2x *slower* than a single
disk. We are trying to get some support on this, so we will see how it
turns out. Have you found a cheap SSD that works in this type of an
application?
Tao Shen

October 7, 2007 at 11:18 pm

@Don: I respect you a lot, but your math is wrong:

“A 15K drive rotates at 15,000 rpm, or 250 rps, resulting in a theoretical max of 250 IOPS.”

First, you are assuming that the drive has one platter. Some drives have 2-4 platters.

Second, you are assuming that only 1 IOP can occur per rps, which is false.

Third, if one 15K drive can only do 250 IOPS, how can 15 15K drives do 6000 IOPS? hehehe

The correct assumption is that for truly random access to occur, the drive would try to seek first. So the theoretical maximum IOPS is the inverse of the seek time. For example, 7200rpm drives have 10ms seek. Therefore, the max IOPS for those drives = 100 IOPS. 15K Rpm drives have 2ms seek. Therefore, their max IOPS = 1/2ms = 500.

Of course, they will never reach max, usually 7200rpm drives do 80-90 IOPS, and 15K drives do 300-400 depends on the System on Chip controller that’s used on the drive.

http://www.storagereview.com/Testbed4Compare.sr

As you can see, one of the best drives right now: Seagate Cheeta 15K5 does 417 IOPS under 128 IO queue depth.
Don MacAskill

October 7, 2007 at 11:30 pm

@Tao Shen:

Interesting. I don’t see how my math was wrong – just possibly my assumption. 🙂

Assuming a purely random read/write ratio, it’s fairly rare for more than 1 IOP to occur per revolution, but yes, it can happen. I’d call it the exception rather than the rule, though. Which means we are revolution limited, possibly, and not just seek limited. One of the big limiters for seek time is revolutions, after all.

Note that for a single drive’s theoretical max, I’m talking about fully flushed writes, so caching isn’t involved at any level – OS, controller, disk.

6000 IOPS can be achieved over 15 drives by the write controller queuing up writes and combining them. 6000 IOPs to the OS -> less than 6000 IOPS by the time they hit spinning metal.

I’ve never seen a 15K drive do 2ms (the best ones we have do 3.5ms *best case*, not even average, I believe), but maybe you have better drives than I do. Seagate doesn’t make any, for example. In any event, I’ve never been able to drive a 15K drive anywhere near 500 IOPS in a random read/write pattern. Not anywhere close. I’m lucky if I can get near 250.

I’ll be the first to admit that maybe I’m giving rpm too much weight over seek, but in my experiments over the years, that seems to hold true. I suspect the real truth is that rpm is a huge factor in seek time, so they’re so closely correlated that for someone like me, who’s not at the drive firmware level measuring microseconds, they’re indistinguishable.

In any event, I think the question is an interesting one, so I’ll see if I can write up a question on my blog and we’ll see if some experts wanna have a mud wrestling match in the comments. I’m by no means an expert, so I’ll watch from the sidelines and take notes. 🙂
Don MacAskill

October 8, 2007 at 8:19 pm

@Tao Shen: The discussion continues over at http://blogs.smugmug.com/don/2007/10/08/hdd-iops-limiting-factor-seek-or-rpm/

Jeff Bonwick, one of the inventors of ZFS, has already weighed in. Be interesting to see who else shows up.
Rich

October 9, 2007 at 2:40 pm

Do you have a Dell premier account, or was the price different than that posted on their public website?
robert towne

October 25, 2007 at 10:16 am

have you added more than 1 md3000+2md1000 combinations?

i had the dell sells rep tell me that as many hba’s i could add to my server they’d support as many md3000’s attached. i started having problems 2 months after going live and nearly 1 month into the support call (yes, it really has taken that long) – I am now being told you can only have 1 md3000 PER HOST. I specificially asked our sales engineer to verify how many md3000’s we could have on 1 host and was told no problem having multiples. now I basically have unsupported equiptment and dell engineer’s that will not help me. not a place i enjoy being.
Brent Condreay

October 26, 2007 at 9:26 am

Hello everyone! My name is Brent Condreay and I am the Server/Storage Specialist from the inside sales team that works with Don and Chris. I’ve got to say that it’s good to see our product making it’s way into these kinds of discussions. The technical knowldge you all casually throw around on this thread is exceptional to say the least!

Anyways, if any of you have any installation, service, pricing, or purchase questions regarding MD3000 or any other server/storage hardware feel free to shoot me an email or give me a call and use Don as your reference. My contact information is below:

@ Don: Thanks for the free pub buddy, I’m trying to get a halloween picture with your mascot to send your way!

Contact Info:
Brent Condreay
brent_condreay@dell.com
1-800-456-3355 x713-4031
Fax: 1-512-283-9790
Ken Robertson

October 31, 2007 at 3:50 pm

Did you experience the same 300mb/sec limit on read throughput? I am contemplating using a MD3000 clustered with two servers to provide an HA iSCSI server, though curious if it really does have a 300mb/sec limit. Even though that wouldn’t really max out over iSCSI, even if I bond several connections together, it seems rather weak considering the box is connected over a multi-lane connection, so theoretically, it should be getting about 4x that.
Jimmy Chiu

November 3, 2007 at 6:17 pm

I have just finished setting up 2 nodes NX1950 cluster w/MD3000 + MD1000, (30) 15k 300GB SAS drives.

Created 2 virtual disks

Disk 1. RAID 10 (8 spindles)
Disk 2. RAID 5 (18 spindles)

I used IOmeter from http://www.iometer.org to run some benchmark on one of the clustered node.

On both virtaul disks.
With 16 outstanding I/O, 1MB, 75%Read25%write, and 100% Random settings, I was getting roughly 415MB~430MB/s transfer rate.
With 16 outstanding I/O, 16k, 75%Read25%Write, and 100% Random settings, I was getting roughly 110MB~120MB/s transfer rate.

I am running into problems though however. I created a test LUN on the RAID10 virtual disk. Connected to it with one of my test server via a private network segment w/ dedicated Dell PowerConnect 5324 switches.

The IOMeter benchmark is really disappointing however on the test LUN.
With 16 outstanding I/O, 16k, 75%Read25%Write, and 100% Random settings, I was only getting around 9-10MB/s transfer rate.

Any advise?
j

November 6, 2007 at 1:04 pm

“This stuff should be exposed in the GUI. Especially the stripe size setting should be easily selectable when you’re first setting up your disks. It’s just dumb that it’s not.”

Actually, stripe sizes are quasi-selectable in the GUI when you create your virtual disk. The “file system, database, multimedia” selections will adjust the stripe size (and cache policy) for you.
Don MacAskill

November 6, 2007 at 2:19 pm

@j: Yeah, but not the settings we want. Which one lets us do a 512K stripe, anyway? 🙂

I just don’t understand why these important features wouldn’t be configurable in an ‘advanced options’ panel or something. Some number of your customers would welcome it, and I seriously doubt any would get upset / not buy / etc if you added it.
Paul Theodoropoulos

November 21, 2007 at 6:30 pm

i’m managing a new installation of an MD3000, i’ve read the CLI docs carefully – and read the commentary here, but i’m naturally ultra cautious at the moment. above, Don said “change stripe size: read the docs for how to do this on new virtualDisks, but to do online changing of existing ones – set virtualDisk[”1″] segmentSize=512”. Just for my peace of mind – this is truly a non-destructive change, right? This can be performed on an array that’s in production with no losss of existing production data? Everything I’ve read suggests that’s the case, but…well, as above. caution…
Paul Theodoropoulos

November 24, 2007 at 11:01 am

answering my own question here – did a lot of additional reading and that gave me confidence to go for it. no problems. segment resize is non-destructive, took about half a day for it to complete.
Paul Theodoropoulos

November 29, 2007 at 11:24 am

boy, we’re finding this array to be a complete dog. when we reach a combined read/write/s of about 350 in production, the array tops out – 100% utilization – we start getting innodb lock waits, and everything goes to hell. i’ve tweaked the living daylights out of this thing, with only the barest shred of improvement. dell enterprise storage support says it’s working properly, so they can’t help. very frustrating. Don, are you really getting 6k iops as you suggest above (or am i misreading)?
Paul Theodoropoulos

November 29, 2007 at 12:21 pm

again, following up on myself, simply with additional details: dell 1950 as the head w/16G ram, mysql 5.0.45, all disks in the array in a raid 10 configuration for about 1TB, innodb database is about 50G, using reiserfs on the array, mysql tuned to the specs of what we have (e.g. 13G innodb buffer pool etc). we started at 512k segment size, reduced to 256k segment size with some improvement, contemplating going to 128k. by my thinking, with innodb having a 16k page size, and our load almost entirely in small, fully random requests, a segsize thats eight times the size of a single request, on a 14 disk striped mirror, should be a good middle ground)
Don MacAskill

November 30, 2007 at 3:44 pm

@Paul:

I was referring to 6000 IOPS because Tao Shen was, not because we’re seeing that sort of performance. We’re not pushing the drives hard enough at this point to see anything like that.

But we’re certainly able to do more than 350. We could probably get greater than 350 with 2 drives, certainly with 3 or 4, let alone 15. We are doing thousands of IOPS – just not six thousand.

It sounds like you may have a different InnoDB workload than we do, or different configuration, or both, because this sounds suspiciously like one of InnoDB’s many problems with scaling and concurrency, not the disk.

Try benchmarking the array using traditional disk benchmarking tools, with an eye to IOPS rather than MB/s, and see how it does.

Good luck!
Paul Theodoropoulos

December 1, 2007 at 11:16 am

thanks for the reply, Don. i ran iozone last night, and we’re seeing decent performance in that – which suggests that maybe something other than the array is the choke-point. fun with databases.
Fred

December 4, 2007 at 12:23 pm

All, just heard from Dell that my MD3000SAS will only see a total drive array of 2 TB but they sold me 3 TB worth of drives. The fix, a firmware patch sometime next year. The SAS controller will only see 2 TB in a storage unit.
Ed Beheler

December 4, 2007 at 2:00 pm

Have you looked at the Promise VTrak M610p? It’s a 16-drive SATA drive SCSI enclosure. I am using a M610i, which is the iSCSI equivalent, and they both look the same and are supposed to use the same software. The following is from the iSCSI version:

Cheap – $4500 for a 16 disk enclosure. Uses SATA drives, which cost under $400 a pop for a 1TB. So, you’re looking at $10k for a fully populated enclosure.
DAS – yeah, dual ultra 320’s. Or use iSCSI, or fibre channel.
16 spindles. Available SATA drives that i’m finding top out at 7200 RPM though.
Battery backed cache – comes with 256, upgradable to a gig.
Disable read caching? yep.
Disable prefetching? dunno. I don’t see that option.
Configurable stripe sizes — looks like 64KB – 1MB
sam

December 6, 2007 at 6:08 pm

Has anyone used this storage with an Oracle RAC Cluster.We have a problem when both nodes are attached to the storage and the performance in some parts turns from 1 second to 60 seconds. After huge investigation we are left with the conclusion that the storage is this issue. We currently only have one HBA per Node connected to the MD3000.I have just seen a Dell document stating that minimum for RAC should be 2. Im not sure at this stage if that is just for redundancy or actually needed due to internal workings. Storage not my area. Any comments or tips greatly appreciated. Can contact me offline on sam112233@gmail.com if off topic.
Mark

December 10, 2007 at 7:52 am

Hey, we also suffer from mysql auto-inc. A good temp fix until 5.1 is if you had an hourly non-depended cron for a batch insert increase the frequency. It seems to multiply the issue if you’re doing bigger inserts as that causes deadlocks from the timeouts on locks etc. It’s helped us by dropping our hourly to five minutes a great deal!
manpageman

February 13, 2008 at 5:23 am

Re: tracing your mysterious problem.

It’s a shame you’re on linux at this instance. On a sun box, Dtrace is your friend. It’ll get ported sometime soon, I’m sure.
Guanghui

February 14, 2008 at 9:28 pm

Is that MD3000 work with Dell 2950 which install solaris 10u4(zfs)?
I want to purchase a MD3000+2950,and install the solaris 10 update4,as a file server.
But i am not sure that will work well with solaris 10u4,the drivers will be a problem.
The MD3000 will work as JBOD mode,and from the solaris side,there will be 15 hard disks,then make a raidz2,and one hotspare,the system will have a good performance,and more security.
Kam

February 24, 2008 at 9:39 pm

We tested Solaris pre u4 on 2950’s vs 2970’s at my previous company. My advice is stick with the Opteron processors [2970] as the Xeon’s don’t quite hold up under Solaris due to the integrated memory controller on AMD. FYI, solaris installed fine on both boxes even though it’s not “technically” a supported OS according to Dell’s site; however if you read enough though, there is a partnership between Dell and Sun and they advertise that Solaris is a supported OS on the poweredge series.
Toby

March 5, 2008 at 8:46 am

Have a look at the google malloc code. I believe that someone did some
benchmarking (wrt mysql) that showed things marketly improving on the
scalability front by using google malloc.
Don MacAskill

March 5, 2008 at 12:20 pm

Yeah, we’ve run MySQL using’ google’s malloc, but for us it was unstable. Much as I want performance, I don’t want it at the risk of stability. 🙂
Joe Hartley

April 9, 2008 at 10:19 am

Hi folks. I’m currently working with a company that provides custom reporting on aggregated sales data that we store for a period of time, typically 2 years. We’ve been using EMC storage for years, but our new corporate owners have balked at the expense of expansion and maintenance of the DMX4 platform, not without some justification.

We are being pushed very hard to consider an MD3000 based approach to replacing the DMX4, and in the course of research, I found this thread. Some of it is not necessarily relevant to us, as we’re using Oracle and not MySQL, but some of the issues here seem to be database-independent.

We are using RHEL4 on Sun V40z servers, and have roughly 7TB of data. I’m very concerned about Robert Towne’s report that Dell support says you can only use one MD3000 per host. If this is true, then the maximum useable capacity of the MD3000 based solution would be ~11.5TB in a RAID5 configuration. If we use virtual disk snapshots and copies, then the usable space would seem to be in the neighborhood of 5TB, once space for the vdisk snapshot and copy were taken into consideration.

Robert, I’d be very interested at hearing whether you resolved this issue, and any other experiences people may have had with the MD3000, especially in terms of redundancy and expandability under Linux.

Joe Hartley
jh at brainiac dot com
Jon Strabala

April 15, 2008 at 12:35 pm

I read both this article and your other one about SunFire X2200 M2 servers. Do you use the DELL MD3000 with the SunFire X2200 M2 servers. If so what HBA card did you use and what is your impression of the combination ? Can anyone conclusively indicate that a MD3000 will work well with Solaris ?
Dave

April 30, 2008 at 10:44 am

The MD3000 can be expanded using MD1000’s for a total of 45 disks in storage array.

Dependent on the drive capacity this can provide up to 45TB (using SATA 1TB disks) of raw capacity.

Dave
Justin Huff

June 9, 2008 at 9:54 am

Don,

Did you get around to the JBOD ZFS/SWRaid benchmarks? We’re considering some changes in our DB arch, so this is of interest:)

–Justin
BonZ

June 27, 2008 at 9:13 pm

Did Dell ever fix the firmware for the MD3000 to go to arrays larger than 2 Gigabytes?

What do people think about the MD3000 now that it has been out for a while?

Is it still considered to be a dog?
Marvin

July 4, 2008 at 8:40 am

We just bought an MD3000 with a pair of Dell PowerEdge 1950 servers to run in a cluster based on Microsoft Cluster Services.

We have some sort of a problem with the MD3000 it seems. Clustering works perfectly when using the software capabilities but when we pull out one of the plugs of one of the servers the MD3000 doesn’t recognize the other server. Although the warning light stays blue. We double checked the cluster settings and had them checked by two external companies, software cluster setup seems ok. But I question the firmware of the MD3000. It does not seem willingly to switch over to the other node when pulling the main server plugs out.

Have you got any idea if the firmware of the MD3000 has anything to do with it?

Thanks in advance,

Marvin Wigmore
Laurent

July 4, 2008 at 10:13 am

Linux users, what kind of software setup do you have ?

Are you using dell supported distributions ?
Are you using 2.6.18 + dell drivers/tools or 2.6.24+dm_rdac+multipath ?
What is the host type you’ve chosen on the MD3000 ?
Acclaim PC Services

October 11, 2008 at 11:02 am

This unit is terrible. They shipped me one with the incorrect drives in it and now won’t replace them with the correct ones due to their ordering system messing up.

Dell customer service is by far the WORST of any company.

Oh and don’t even think about getting cheap drives for the unit. Dell sells the hard drives at over 500% retail markup and they have microcoded the MD3000 to ONLY and I mean ONLY (This means even if you buy the same model from the HCL direct from the HDD manufacturers it won’t work) work with dell purchased drives.

I don’t know why everyone on review sites gives this a thumbs up. The only thing this thing is good for is a nice large paperweight with flashing lights.
tech

October 14, 2008 at 3:42 pm

Acclaim PC services:

you are correct about the dell markup on drives/caddies, however, ive found you CAN use SOME ‘non-supported’ sata and sas disks. off the shelf WD5000YS works with the proper sata interposer board.

gripes: ‘cli’ should be via ssh. also important commands from cli not included in gui… why?
Craig Smith

November 17, 2008 at 4:55 pm

All,

I am wondering if anyone has successfully got this to play with (OPEN)Solaris?

I am struggling to get more than one LUN to be usable in OpenSolaris. I am able to see the target, and after I call the command “devfsadm -i iscsi” and then “format” I can see all the disks (they even show as coming from a md3000i). When I try to select any drive after the initial LUN I get a disk can not be opened error and am unable to label or format the iscsi drive.

Am I banging my head on the wall for no reason with this?
Scott Ehrlich

November 18, 2008 at 3:49 am

I purchased an MD3000i, purchased the snapshot premium feature, using the MD Storage Manager partitioned space in a raid set for snapshots and assigned said space for snapshots. Now, of all Dell’s documentation I have, nothing tells me how to actually configure the allocated snapshot space to take snapshots of any of the RAIDs I created (happen to be two x RAID 5), schedule snapshots, or anything else.

How do I fully manage snapshots beyond allocating space for them? I happen to be using CentOS 5.2 (same as RHEL 5.2).

Thanks.
Zixxiruz

December 2, 2008 at 4:12 pm

Nice day,
Rdnkrqwh

December 2, 2008 at 7:24 pm

Nice day,
Rhafyuhp

December 3, 2008 at 1:29 pm

nice site,
Dave

December 22, 2008 at 3:40 am

Significant Upgrade Now Available for the MD3000 and MD3000i – Generation 2

Please see:

http://en.community.dell.com/forums/p/19247968/19…
Krypto77

December 30, 2008 at 8:33 am

We recently got a new MD3000 with 3 1TB HD. Dell said we can purcahse non Dell HD. Has anyone done this? If so, did ti work? How did you get it to work? Any help would be greak
Weiyi Yang

July 7, 2009 at 8:40 pm

I have a Dell PowerEdge 1850 with a Perc 5/e + MD 1000 and a SAS 5/e + MD 3000. It runs Solaris 10 (8/07) with the latest patch cluster (Solaris 10 x86 Recommended Patch Cluster DATE: Jun/26/09). I was using hardware raid-5 on the MD1000 (14x500G SATA + 1 hotspare). With 70% read, 70% random, 8k block, 16 threads, I was able to get 3384KB/s, 423 IOP/s. With hardware raid6 on the MD3000 (14x1T SATA + 1 hotspare), I could get 1243KB/s, 155IOP/s. With raidz2 (14x1T drives exposed as raid0 virtual disk + 1 hotspare), I could get 1590KB/s, 199IOP/s.

Awfully slow it appears. Could it be a kernel problem?
- edison
  
  August 1, 2009 at 1:11 am
  
  i've been trying to track down the problem with these md3000 units. i can't seem to get any amount of decent iop/s using the same hardware you listed…
  
  hopefully this blog gets read again eventually and i can turn to somebody for some assistance, since dell is completely and utterly useless.