![]() ![]() |
Audio Asylum Thread Printer Get a view of an entire thread on one page |
For Sale Ads |
192.19.218.100
Many years ago I built a music server running linux with a bunch of drives in a RAID 5 array (using linux software RAID). The main reason for this was that LARGE disks were still very expensive and my local computer store had a screaming deal on 256G drives. I bought a whole bunch including spares in case one of the drives died.
A couple years later the BIG drives became cheap and I started backing up the RAID array.
Well last week I was in the process of setting up a server on a FitPC2 with one little laptop drive for the whole thing and was going to copy over the library from the old server. I noticed one of the RAID drives was not included in the array!! It said the array was still clean but no longer had any redundancy.
So I took out the bad drive, put in a new one, set it to be a spare and the system automatically rebuilt the array, hurray RAID did what it was supposed to do! Then I started copying the files over and started getting file errors in the copying! I rebooted the server, it said everything was fine. I did a manual check and it said it was fine. I tried the copy and again file errors!
At this point I was VERY glad I had a backup. I unhooked all the RAID drives, put the backup drive and the new drive in the server and did a copy between them, worked perfectly. Put the new drive inthe FitPC and everything worked like a charm. I've played a bunch of the music and it works great.
So moral of the story, DON'T TRUST RAID. This is the first time I've had a RAID array say it was fine but it really wasn't. I'm very glad I had a backaup. Of course I really should have another backup offsite. That's coming very soon.
John S.
Follow Ups:
Audio is not a mission critical operation.
--eNjoY YouRseLf!.....
"Audio is not a mission critical operation."
Depends on your mission. :-)
Tony Lauck
"Diversity is the law of nature; no two entities in this universe are uniform." - P.R. Sarkar
I have an old server with a U160 RAID-10 array. It has an upper mid-range controller with 128MB of RAM, i960 processor, etc. Every year or two since it was brand new it drops some drive offline. At first I believed it, but never once has there been anything wrong with any of the drives or cables.
Thumbs up on the availability, live rebuilding capability, and the durability of the drives themselves. The whole disk setup has been running 365 days a year for nearly 10 years without any real downtime. But RAID arrays always seem a bit high-spirited and need tending.
Disk is cheap, forget Incrementals......Too many Things to go wrong.
Use simple Copy and Paste Software.....
![]()
Cut-Throat
The problem is in figuring out who should be shot, the RAID software authors or the disk drive firmware authors.
There are data recovery firms that specialize in extracting data from failed RAID arrays. They charge more for this service than recovering data from failed individual disk drives. RAID is not really that great when it comes to data integrity, it is more appropriate where availability is the prime requirement.
Did you run periodic surface scans of each drive in the RAID array? If you do this you are more likely to catch a failing drive before double errors happen and data is lost. Or at least this is what I have been led to expect. One of the problems is that drive firmware hides as many errors as possible and this delays visibility of a failing drive until it is too late. Another problem is that drive firmware and its error recovery procedures confuses the RAID software which attempts to configure around a problem, thereby creating a double fault due to the race condition.
Glad you had backups and thanks for reminding us that RAID isn't as safe as one might hope.
Tony Lauck
"Diversity is the law of nature; no two entities in this universe are uniform." - P.R. Sarkar
As soon as striping is involved, it is almost impossible even for a disaster recovery company to reconstruct the files.
I think RAID 1 (mirroring) is a better option as in case of disaster recovery might be doable
The Well Tempered Computer
Even with RAID1 mirroring and data corruption, all you'll end up doing is simply mirroring the corrupt data! And as someone suggested, it's not necessarily the RAID that is at fault. It could be file system issues.
At least he had a good backup so his 'disaster recovery' plan worked.
![]()
BWWWWAAAAAAAA!
![]()
Cut-Throat
The programmers are still responsible. They should have stopped the CEOs from doing this, or quit and publicized the problems. (Assuming they even knew. Most likely the were clueless about how to write software that functions correctly in the presence of inevitable hardware glitches and failures that were one of the reasons underlying RAID technology in the first place.)
Tony Lauck
"Diversity is the law of nature; no two entities in this universe are uniform." - P.R. Sarkar
They should have stopped the CEOs from doing this, or quit and publicized the problems.
~~~~~~~~~~~~~~~~~~~~~~
Not practical or realistic.
If I work for company X and the higher-ups are rushing a product to market with my programming expecting me to quit loose my income, home, family and potentially everything I own including any hopes of a career in programming is crazy. Who would want to hire someone after they turn their former employer in and what type of reference could they expect?
Fact IS this type of thing happens all the time within every product segment imaginable.
Every company has a set of standards they use to ensure the robustness of a product. Are they perfect? of course not, it all comes down to ROI, how much of a return will I get from investing more time and money into Y product. Bean-counters add up the money and give it the thumbs up and out it goes.
Dynobots Audio
Music is the Bridge between Heaven and Earth - 音楽は天国と地球のかけ橋
Yes, the ones that tried to stop him are now in the unemployment line.
You don't seem to have a clue how Corporate America works.
![]()
Cut-Throat
I know full well how it works. That's why I left Corporate America in 1994, never to go back. "GQ Bob" and his minions gave me that lesson.
There are lots of little people who don't have the courage to stand up to their convictions. That includes most Mastering Engineers, who participate in the loudness wars. They are true professionals, i.e. whores.
Tony Lauck
"Diversity is the law of nature; no two entities in this universe are uniform." - P.R. Sarkar
Thanks to a semi-deaf Alex Lifeson controlling the audio and the group's choice of suckass mastering engineers, I don't think we'll ever have another release with the dynamic range of the Mobile Fidelity version of Moving Pictures from RUSH.
Here's the work of a master
.
To infinity and beyond!!!
Tony Lauck
"Diversity is the law of nature; no two entities in this universe are uniform." - P.R. Sarkar
Well, IMO the moral is: configure your system to be notified of raid degradation immediately (email, beeping, blinking led, ...). Running raid5 in degraded state for a longer time will lead to data loss. Unless the filesystem itself implements checksums (ZFS, btrfs), the system has no way to learn a data corruption has occured as no redundant data to compare are available. Therefore a manual check after recovery did not show any errors - the array could have been correctly rebuilt using incorrect data.
BTW did you try to repair your filesystem? As a matter of fact the more likely cause of your problem was not corrupted data by the raid, but a corrupted filesystem (one layer above the raid). In our company we have been using linux raid for over 10 years (both workstations and non-critical servers) and never had any raid crash. We have had a filesystem crash several times though - especially XFS does not like hard reboots without caches synced. I admit the outcome is the same - lost files, to be recovered from backups :)
Yes I did run fschk on the filesystem (ext3) after rebuilding the array, it said everything was fine, but when copying files I still got file errors. At that point since I had backups and was going to a new server I didn't spend any more time trying to debug it.
John S.
There's also something called silent data corruption and bit rot which no traditional RAID or filesystem would detect, and fsck would be unable to fix.
ZFS: For ZFS, data integrity is achieved by using a (Fletcher-based) checksum or a (SHA-256) hash throughout the file system tree.[28] Each block of data is checksummed and the checksum value is then saved in the pointer to that block—rather than at the actual block itself. Next, the block pointer is checksummed, with the value being saved at its pointer. This checksumming continues all the way up the file system's data hierarchy to the root node, which is also checksummed, thus creating a Merkle tree.[28] When a block is accessed, regardless of whether it is data or meta-data, its checksum is calculated and compared with the stored checksum value of what it "should" be. If the checksums match, the data is passed up the programming stack to the process that asked for it. If the values do not match, then ZFS can heal the data if the storage pool has redundancy via ZFS mirroring or RAID.[29] If the storage pool consists of a single disk it is possible to provide such redundancy by specifying "copies=2" (or "copies=3") which means that data will be stored twice (thrice) on the disk, effectively halving (1/3) the storage capacity of the disk.[30] If redundancy exists, then ZFS fetches the second copy of the data (or recreates it via a RAID recovery mechanism), and recalculates the checksum—hopefully reproducing the original value this time. If the data passes the integrity check, the system can then update the first copy with known-good data so that redundancy can be restored."
![]()
Thanks for your posts. They confirm what I had suspected, which is that three years ago I made the wrong choice with my Thecus NAS server to use the XFS file system. It looks like I should have used ZFS.
Tony Lauck
"Diversity is the law of nature; no two entities in this universe are uniform." - P.R. Sarkar
You couldn't (easily) do ZFS on Linux 3 years ago but it was available on OpenSolaris. Times are a changing... This is from the ZFS Wiki, toward the bottom of the page where it lists various OS's and support for ZFS:
Native ZFS on Linux
A native port of ZFS for Linux is in development. This ZFS on Linux port was produced at the Lawrence Livermore National Laboratory (LLNL) under Contract No. DE-AC52-07NA27344 (Contract 44) between the U.S. Department of Energy (DOE) and Lawrence Livermore National Security, LLC (LLNS) for the operation of LLNL. It has been approved for release under LLNL-CODE-403049. As of June 2012 , the port is in release candidate status for version 0.6.0, which supports mounting filesystems.
![]()
I am not familiar with these O/S details. However, at the time I got my Thecus NAS the ZFS file system was one of choices in the menu when configuring the file system. I acquired this system in October 2009.
Tony Lauck
"Diversity is the law of nature; no two entities in this universe are uniform." - P.R. Sarkar
ZFS was probably running in user space through FUSE which takes a performance hit and has other limitations vs a native in-kernel posix layer port of ZFS.
![]()
IMO the NAS used some variant of BSD, perhaps freebsd with native ZFS.Discussions on internet suggest so too.
The famous FreeNAS platform based on freebsd is attractive mostly due to native ZFS support.
Ah, could be. I assumed Linux which may not be the case.
![]()
Unfortunately OpenSolaris' future turned rather precarious around then too.
...the viability of the company that brought us OpenSolaris. I dodged that bullet through the acquisition. ;-)
![]()
You might find the post linked below interesting. For full disclosure, I worked for the company that invented XFS. I presently work for the company that 'owns' ZFS and developed btrfs.
The analytics alone in the ZFS arrays that we have are pretty amazing and very useful for end-users. It's based on the instrumentation probes that are built into the OS that the ZFS arrays are built on, and Dtrace. The real time analytics (charts / graphs / triggers, etc.) are displayed to the user in a web interface.
Old video, about 3 years ago . For best results switch the video to HD 720p and expand to full-screen.
![]()
You worked on ZFS?
I'm impressed! I'm not even a computer nerd but when I read about ZFS and how Apple *almost* used in in Snow Leopard I was disgusted. Basically my understanding was that even NTFS was about 20 years old when it was released, while ZFS incorporates many new technologies that make it far superior to any other file format system.
I hate all file formats -- they all have limitations and are incompatible. If everybody standardized on ZFS, the world would be a better place. Let's hope that some day soon we won't have to use FAT 32 (with a file length limit that it too short to handle DVDs) to share across platforms and that everyone switches to ZFS.
NTFS was part of Microsoft's Windows New Technology (WNT) that was developed under the leadership of Dave Cutler, who had been the chief technologist behind the VMS operating system for Digital Equipment Corporations VAX family of 32 bit computers. Much of the technology inside Dave's head had been developed while he was at DEC. As to other features of ZFS, they can be traced back to the computer science research community and to the disk I/O community, which included many people at Digital Equipment working on I/O architecture and early intelligent disk controllers, which interacted in a smart fashion with operating system software to provide high performance and high availability computer systems and clusters of computer systems.
Reliable file systems date back at least to the 1960's. Digital Equipment had one for it's PDP-10 family of computers in this time period. These systems were designed in such a way as to prevent the file system from becoming corrupt due to system crashes (which were much more common in those days than at present). Later, clever people figured out how to retain this robustness while increasing performance, since early methods of gaining robustness tended to slow the system down in return for reliability. Some of these features appear in ZFS. Prior to the WNT technology, Microsoft file systems were pathetic and those of us familiar with DEC's products looked on those Microsoft systems with disdain.
The error correcting codes on a modern disk system are sufficient to ensure data integrity (no undetected data corruption) but this only applies if the hardware detects data corruption. Unfortunately, this depends on redundant design in logic coupled with very careful signal integrity design and construction. Most computers marketed to home users aren't done to this standard and are subject to single event upset, e.g. a bit change in a non-ECC RAM due to an alpha particle or data corruption caused by noise and cross-talk on signal wires.
Looking alphabetically, note the following: WNT : VMS :: HAL : IBM.
Tony Lauck
"Diversity is the law of nature; no two entities in this universe are uniform." - P.R. Sarkar
1) I first investigated file systems when I got a Mac Mini in order to write the setup instructions for our QB-9 on a Mac. Then I wanted to be able to share files between my Windows machine and the Mac. I did some research and was relatively underwhelmed by both HFS+ and NTFS.
FAT 32 of course truly sucks from a performance standpoint, but since it is something of an "open standard", at least one can share files across different platforms with it. But the first time I tried to save a DVD file on a FAT 32 drive, I was bitterly disappointed.
On the other hand ZFS seemed to solve all of the problems of both HFS+ and NTFS in one fell swoop. The only problem was that nobody actually everr used it in a mainstream OS for PC's. It apparently made it into a developer's release of some Apple server OS and then it got ported to various standalone versions of OSX by enthusiastic computer nerds. I am not smart enough of a computer nerd to figure out how to install it, and then I would have an even bigger compatibility issue with Windows!
I was extremely disappointed that Apple chose not to use ZFS as their mainstream drive format. That alone probably would have switched me to become a "MacHead".
2) Your alphabet trick was very slick. However no matter how neurotically attached to detail that Stanley Kubrick was, I find it hard to believe that this is more than just a coincidence. Feel free to correct me, as I am something of a film buff and a Kubrick fan.
According to Wikipedia, Kubrick and Clarke deny that the HAL - IBM shift was more than a coincidence. We can call it artistic genius coming from the depths of their subconscious, if we like.Be that as it may, I do not believe that the WNT - VMS shift was a coincidence. I didn't hear it directly from Dave Cutler, but I did hear it from people he had worked with while he was at DEC.
Tony Lauck
"Diversity is the law of nature; no two entities in this universe are uniform." - P.R. Sarkar
Edits: 08/19/12
I don't know much about DEC or Dave Cutler, but it is hard to believe that people that are supposed to be so smart can be so stupid.
Who cares if WNT and VMS are shifted by just one letter? Did they sell one extra copy even if this were the case?
I think it is much funnier that WNT stands for Windows New Technology. By the time they released Win 2000, they said that it "incorporated NT Technology". Which meant that it incorporated New Technology Technology. That is just plain dumb. Even worse than going to the "CES Show".
Certainly covers most all the nerds that I've known. Here are two common examples of acronymic stuttering:
IP Protocol (Internet Protocol)
TCP Protocol (Transmission Control Protocol)
Tony Lauck
"Diversity is the law of nature; no two entities in this universe are uniform." - P.R. Sarkar
No, I did not work on ZFS. I just work for the company that invented it.
![]()
I am afraid ZFS is not a panacea. It is a complicated beast, not simple to configure correctly to achieve good performance.
ZFS vdevs and zpools are actually quite simple to configure compared to using traditional volume managers and file systems.
You would have to define what you mean by "good performance". It would obviously depend on the platform that you are using ZFS on.
In the case of enterprise storage arrays for the data center using industry standard client protocols (NFS, SMB, FTP, iSCSI, HTTP, WebDAV, FC, iSER, etc) performance can be stellar, even with less expensive lower rpm SATA drives. ZFS can intelligently leverage RAM and hybrid storage pools using flash memory combined with spinning disk to achieve very cost effective 'high performance'.
Snapshots and rebuild times are also significantly faster than traditional RAID arrays.
I had no input into the design of ZFS. I just happen to work for the company that created it. My role, among others, is to help customers architect solutions around ZFS based products (and other products) that we sell.
You can check out the SPC website for comparative storage array benchmarks.
![]()
What did you do at SGI?
Posted from my O2 RM5200
I was a pre-sales SE at SGI, supporting and training our channel reseller partners and distributor.
XFS (and IRIX) was actually an excellent high performance combo back then. It's bandwidth and IOPs scaled significantly better than Solaris/UFS when striping disks... but that was more than a decade ago. ;-)
SGI O2 was a pretty robust system in it's day. I was around for the birth of the original SGI Indigo workstation.
Why do you have an O2, might I ask?
![]()
I am still a big fan of XFS. It is still undergoing a very active development, especially in recent linux kernels.
Our backup server has hundreds of millions of files (mostly hardlinks), collected over almost 10 years of operation. That filesystem has grown with the hardware, throughout the years has resided on many linux raids of different configurations.
Accidentally this morning one of the drives in that server array died, input/output error even for simple "cat /dev/sdd". A new one is already synchronizing (those missing slots are for two rotating sets of offline backup drives, synchronized once a week in external eSATA bays):
orfeus:~# cat /proc/mdstat
Personalities : [raid1] [raid0]
md6 : active raid1 md4[0]
2178180728 blocks super 1.0 [2/1] [U_]
md4 : active raid0 sdd3[1] sdb1[0]
2178180864 blocks 64k chunks
md7 : active raid1 md6[2] md5[1]
2178180592 blocks super 1.0 [2/1] [_U]
[========> ............] recovery = 40.0% (871529608/2178180592) finish=316.5min speed=68788K/sec
md5 : active raid1 md3[0]
2178180728 blocks super 1.0 [2/1] [U_]
bitmap: 9/9 pages [36KB], 131072KB chunk
md3 : active raid0 sda1[0] sdc3[1]
2178180864 blocks 64k chunks
md2 : active raid1 sdd2[1] sdc2[0]
8787456 blocks [2/2] [UU]
md1 : active raid1 sdd1[1] sdc1[0]
10739328 blocks [5/2] [UU___]
Nostalgia. :)
Back when they were new (ca. '95) I wanted one. I finally picked one up a few years ago.
The UMA architecture always intrigued me, as did the ICE engine. The O2 + R5000 architecture is very elegant. The R10000 versions of the O2 are faster by brute force, but that processor family was always a better fit to the larger boxes. In the O2, I like the R5000.
Mine is a surprisingly usable machine considering its age and processor speed.
You "noticed" one of the drives was bad? How long was the array running in this degraded state?
Moral of the story: RAID based storage is not easily managed by even very experienced computer users. An alarming percentage of users will more likely than not lose the whole array in their attempts to restore it from a failure, rather than the opposite. I think another member posted such a story in this forum recently.
RAID's main purpose is high availability. If that's not the main reason for using it, then you're making a mistake. High availability for a home media library is something that I can't fathom, but I know people get funny ideas about how to do things.
The primary reason for doing the RAID array was to have a large filesystem made out of smaller disks, multiple smaller disks were way cheaper than one large one.
The server runs 24/7. I had done a shutdown and fschk on the filesystem about a month ago and all was well. So the max the array could have been degradded was one month.
John S.
> > RAID based storage is not easily managed by even very
> > experienced computer users. An alarming percentage of
> > users will more likely than not lose the whole array
> > in their attempts to restore it from a failure, rather
> > than the opposite. I think another member posted such a
> > story in this forum recently.
> >
> > RAID's main purpose is high availability. If that's not
> > the main reason for using it, then you're making a mistake.
Wise advice, but it is interesting how often one sees posts in this forum where people are looking at a RAID system as including some type of invincible magic backup system that will protect them from all evil. That's just not the case. In reality, for the home user, RAID makes things more complicated and increases the chance of user error.
(One does see the occasional poster using RAID 0 expressly to increase storage capacity, but that makes external backups more critical, not less.)
Post a Followup:
FAQ |
Post a Message! |
Forgot Password? |
|
||||||||||||||
|
This post is made possible by the generous support of people like you and our sponsors: