|
Audio Asylum Thread Printer Get a view of an entire thread on one page |
For Sale Ads |
204.94.81.82
In Reply to: RAID Array ? posted by AbeCollins on May 11, 2007 at 07:40:24:
This server has a five disk RAID 5 array of 73GB drives. What concerns me is why one drive failed. The drive rebuilt fine and the array is ok now, but will that gremlin pop up again? Unfortunately, I don't have a real spare, so I probably should get one and take some time when we do the migration to run some disk checks.Fault tolerance is a misnomer of sorts. Without having completely redundant systems it's not possible. RAID 6 does sound interesting.
-Rod
Follow Ups:
...RAID 6 is a must. Don't be too concerned that you had a drive failure - quite frankly, you expected it because you designed your system to recover from it. What should worry you is, as Abe alluded, if you have a read failure when trying to recover from a single drive failure, you'd be SOL. Remember:
Yeah, I know. I just hate that it takes so long to recover, but at least it works like it's supposed to work. The next thing we should do when we replace the db server, probably next year, is to slave the db. That's a real tough one. Sure, we've got a backup and backups or the backups, so in theory, we could recover from nearly any disaster and not lose more than a day, but I surely wouldn't want to try it!
-Rod
An even more fault tolerant flavor is RAID 10 (really 1+0). Here multiple RAID 1 drives are all mirrored. In this case, however, it would require eight drives. 4 data + 4 mirrors. You can lose half the array (so long as you don't lose both a primary and its mirror) and it still runs.Another advantage to RAID 10 is that the drives don't have to be striped. You can run the OS on one drive pair and the data on the others. In that way, even if there is a total failure, you don't necessarily lose both the OS and the data.
"Another advantage to RAID 10 is that the drives don't have to be striped."I think you lost me. Wouldn't that make it RAID 1? RAID 0 means a stripe is involved, right?
"You can run the OS on one drive pair and the data on the others. In that way, even if there is a total failure, you don't necessarily lose both the OS and the data."
True.
So basically, you're running RAID 1 for the 2 boot disks and RAID 1 for the two data disks.
You are thinking about RAID 0+1, which like RAID 5 appears to the OS as a single large volume. The objectives with RAID 10 are twofold: independent and simultaneous accesses to multiple drives and fault tolerance. It is a stripe of multiple independent mirrors.The basic text diagram gets mangled in a post. See the diagrams in this link.
rw
Our new server uses basically that kind of scheme with the OS being mirrored and the data being on a separate RAID 5. Now, I'm wondering if I should bite the bullet and reconfigure the data drives. Unfortunately, I don't have the luxury of 8 drives. I've got 4 300GB drives and need to decide what the optimal use would be. Obviously, security is number one as we've only using 100GB or so now and likely won't need more than another 150GB or growth over the life of the server.
-Rod
My eight drive comment relates to the equivalent requirement of your current five drive array on the older server. Regardless of the number of drives used in a RAID 5 array, one's capacity is "lost" to provide the check bit. With 5 drives, you get the capacity of four. To maintain that capacity in a RAID 10 array, that number would be doubled.Having read your latest response to Abe, however, the new array has larger drives where you don't need all that capacity. You could afford to mirror everything which is what RAID 10 is all about. Using four 300 GB drives, that would provide 600 GB in a RAID 10 configuration and 900 GB in RAID 5.
Sounds like you could splurge and get a more robust array using 10.
Let me also take a moment to thank you for making this incredible nut house possible!
Thanks, and thanks for the advice. Unfortunately, Dell doesn't support a real RAID 10. Basically, you get 1 or 5. Though 1 shows a stripe, so I'm not sure how it really works. LSI controller with Dell front end.I went ahead and a testing it as a RAID 5 with hot swap and pulled a drive, hot no less. It now shows the hot swap as rebuilding and the reinserted drive as ready. Hmmm. Once it rebuilds the hot swap, I wonder if ready will become the new hot swap. Of course, I won't find out until tomorrow morning.....and the logical is empty!
-Rod
at least two of my customers have Dell boxes using RAID 10. Strictly speaking, what I'm referring to is RAID 1+0. Admittedly, one system was originally configured as 0+1 and I had to explain to the guy the difference, but it is simply running multiple separate drives that are mirrored.
73GB disks aren't that big by today's standards. It's a little scary that the one disk failed then rebuilt OK. It could be marginal. And when one disk starts to fail, you have to wonder about the others if they're the same age.If it took a while to rebuild that 73GB disk, you can imagine how long the RAID controller might take to rebuild a 500GB or 1TB disk. 1TB disks became available this year (although not widely used in "systems" just yet).
Redundancy to minimize single points of failure can be very expensive. You need at least two of everything! Two servers, each with two power supplies, with separate NIC cards, with separate SCSI / FC controller cards, etc. Plus redundant paths to external RAID arrays and redundant switches and network paths. Even the redundant power supplies in each server are fed from separate AC circuits if possible. Having the equipment in different rooms is even better. Or better yet, in different parts of the country / world. ;-)
Yeah, I know. I'm wondering if it's going to fail again soon and then should I replace all the drives. The server is about 3 years old, so I'd expect it'd be fine for another year or two before we need to consider replacing it to avoid failures.Now, I'm wondering if I should reconfigure the new server. It's a Dell 2850 with split backplane and 6 drives. Two 73GB drives are configured for the system in a mirrored configuration. The data is on a four disk RAID 5 configuration with 300GB drives. I could change that to use 3 drives and set up on as a hot spare. We'd lose 200GB or so of storage, but that's not really an issue as I've got 700GB free now. The downside it that it'd take a few days to rebuild, but I've got time now before we go live on it.
I recommend upgrading the firmware on the hard drives. I have a couple of 2850's in my shop and have experienced the same problem.Assuming you have the Maxtor 73GB 10K drives (there is a different update for the 15K drives):
"Under certain circumstances a hard disk drive may go offline, hard disk drives (HDD), may report offline due to a timeout condition. If the HDD is unable to complete commands, this may result in the controller reporting the HDD off line due to a timeout condition.
Higher than expected failures rates have been reported on the Maxtor Atlas 10K V ULD (unleaded or lead free) SCSI hard disk drives.
If the hard disk drive (HDD) is unable to complete commands, this may result in the controller reporting the HDD offline due to the timeout condition. The primary failure modes have been the HDD failing to successfully rebuild and also failing after a rebuild has completed.
For all ULD part numbers/model numbers listed below, Dell considers this as an "Urgent Update" and recommends proactively updating the firmware to avoid interruptions.
YC951 – Model ATLAS10K5_73WLS - 73GB 10K 68 pin
CD807 – Model ATLAS10K5_146WLS - 146GB 10K 68 pin
FD456 – Model ATLAS10K5_300WLS - 300GB 10K 68 pin
GD084 – Model ATLAS10K5_73SCA - 73GB 10K 80 pin
YC952 – Model ATLAS10K5_146SCA - 146GB 10K 80 pin
CD808 – Model ATLAS10K5_300SCA - 300GB 10K 80 pinThe following leaded part numbers/model numbers can also benefit by having the JT00 revision of the firmware installed but is considered as a "Recommended Update" rather than "Urgent" as listed for the unleaded models.
T4350 – Model ATLAS10K5_73WLS - 73GB 10K 68 pin
U4006 – Model ATLAS10K5_146WLS - 146GB 10K 68 pin
R4784 – Model ATLAS10K5_300WLS - 300GB 10K 68 pin
CC315 – Model ATLAS10K5_73SCA - 73GB 10K 80 pin
FC271 – Model ATLAS10K5_146SCA - 146GB 10K 80 pin
CC317 – Model ATLAS10K5_300SCA - 300GB 10K 80 pin"As always, make sure you have a current backup first.
I use the Dell 2850 and 2950's (plus a pile of older Poweredge 4600s and a few other models) a lot in the systems that I administer at the college, Rod. They're very good machines.You've got what Dell terms a "RAID 1 + 5" configuration (OS on RAID 1, data on RAID 5), which I've been using very successfully on several dozen servers over some five years now. If you stay with that, the only change I would make would be to add a hot spare (extra HD, online and configured to be available for automatic failover) to your array, so that one additional HD is always there. If you're talking real fault tolerance, I assume that you've got Dell's redundant power supplies, ECC RAM, etc., etc. You know that score...
300GB SAS HD's are the minimum that I would use for your application.
RAID 6 is interesting, but still young...and it does have some costs as well as some benefits. (See link below for more.) I'd wait until Dell integrated it...don't remember seeing it as an option on any of my recent Dell Premier site server building systems.
Best wishes to you with the upgrade project. Believe me, I know what's involved in a decision like this....
david
Yeah. I looked at the RAID 6 and while it's interesting, the controller doesn't know anything about it, just 1 or 5.We've also been quite happy with the Dell servers. Our first was a 1300 when the Asylum first started and then we moved to a 2450 and added a 2550, then some 2650s and finally the 2850. I decided to bite the bullet and reconfigure the new one. That was fun! Remind how much I love UNIX.
So now I've got three of the 300GB drives in RAID 5 with the fouth as a hot swap drive. I haven't messed with the RAID controllers much, this one is an LSI controller, Perc 4 something. Anyway, I selected the hot swap drive when I defined the logical drive, so will the controller know to use it automatically? I'm running FreeBSD, so the OS doesn't know anything about the RAID configuration.
-Rod
You could also keep a cold spare on hand. If you disconnect the system from the 'net while rebuilding the array then having a hot spare will make little difference other than to relieve you of having to plug in the new drive.But if your system starts dropping drives that are actually good, then it may be a faulty RAID controller. I'm guessing you don't have a spare controller on hand. If that's the case and you lose the controller, all you can do is wait for Fedex to show up with a replacement.
| ||||||||||||
|
This post is made possible by the generous support of people like you and our sponsors: