Wikipedia:Reference desk/Archives/Computing/2023 August 31

Computing desk
< August 30 << Jul | August | Sep >> Current desk >
Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


August 31

edit

SATA cards on PCI

edit

I've a fairly old mini-tower PC that has 4 on-board SATA ports, two 6GB, two 3GB. In order to increase the resilience and capacity I purchased a USB connected RAID system, but that led to problems with syncronisation at shutdown. I therefore bought a SATA card to sit on the PCI expansion ports. It didn't help that I got caught by shingled drives (Barracuda Compute), but eventually I replaced then with Ironwolf drives. However I still get drives going off line and/or generating excessive errors. When I tried to upgrade from RAID5 (5 + 1 hot spare) to RAID6 the upgrade failed at a critical point and I had to go back to backups. Sometimes it seemed as though one failing drive knocked out an adjacent drive. I've also upped the PSU from 350W to 500W.

Are there any known problems with using a SATA expansion card on a PCI bus for RAID? I've always used Seagate drives, and apart from the not-fit-for-purpose shingling, they have a good reputation, should I go elsewhere? I'm now faced with rebuilding the home PC/server from bare metal again and am even considering replcing the PC altogether. Advice please! Martin of Sheffield (talk) 18:33, 31 August 2023 (UTC)[reply]

By PCI you mean old PCI slot or PCI Express? Ruslik_Zero 20:31, 31 August 2023 (UTC)[reply]
Oops, PCIe. PCI is so old that I didn't think anyone used it at all these days! Martin of Sheffield (talk) 20:49, 31 August 2023 (UTC)[reply]
RAID5 (assuming you are using SATA 3.0) will generate 0.6*5=3 GB/s data stream. If your card is installed in an x1 PCIe slot, you will need at least a PCIe v.5.0 to sustain such a data rate! Or you need a PCIe v.3.0 card installed in an x4 slot. Ruslik_Zero 20:23, 1 September 2023 (UTC)[reply]
Again to clarfify: are you using a dedicated SATA RAID card, (if so, which?) or are you using software RAID through the motherboard controller? A full spec of all the kit you are using, and OS, might be very helpful. I used to be an instructor for Compaq and IBM PC servers which had their own dedicated RAID cards (old PCI) so I have a certain amount of slightly out-of-date experience. If you are using new HDDs you really shouldn't be getting excessive drive failures. PCIe RAID cards have been around for ages, I would really doubt that that is the source of your problem. I searched for ironwolf drives problems failure and came across a number of issues with Seagate Ironwolf drives, especially in a NAS box. Some 'failures' appear not to be proper failures as such, more like a firmware-generated error which gets reported as a fail. There seem to have been various faulty batches of Ironwolf drives. Are yours particularly large capacity? I have used Western Digital drives for many years and have had no problems. What sort of usage do the drives get? Do they mostly just sit there as storage, or do they get heavy reads or writes? There used to be manufacturer-specific software for diagnosing HDD problems; again, what card are you using? MinorProphet (talk) 20:55, 1 September 2023 (UTC)[reply]
Just a simple SATA card. I use Linux MD to build RAIDsets, so as far as the hardware is concerned it it just disk accessed. I've tried a could of x1 4 port cards, currently trying a x4 6 port card. After the RAID5->RAID6 transition failure I rebuilt using two mirror sets instead. Whilst adding a third drive to one set, it took out my root disk, both the /boot and / partitions. Annoying, the root disk wasn't involved in any of this. Attempting to mount the root disk from a rescue system reveals that the superblocks (XFS) have been lost, although the partitioning is visible. Formatting fails with a Quark 0 (eh?) error. I've currently removed all the disks and the PCIe card except for a single 1 TB Baracuda Compute using the on-board SATA which I'm using to rebuild the root of the system from backups. As regards kit details: Mobo will have to wait until I've got at least some life back in it. Ditto the card. Disks range from 1-4TB, so not that massive. OS is AlmaLinux 8.8. Error message details were in /var/log, which is on / which is inaccessible! The Ironwolf disks were obtained over the last year as replacements for the shingled Barracuda Computes. It's possibly worth noting that in the psat disks from Amazon arrived in a nice shock resistin cardboard box. The last one was just in its anti-static bag inside a brown paper bag. Martin of Sheffield (talk) 08:45, 2 September 2023 (UTC)[reply]
If I understand it correctly, you have a boot/root disk (a magnetic-media SATA disk) and a bunch of other disks in software RAID, and creating the RAID caused the uninvolved boot/root disk to be seriously corrupted. It sounds like you're confident you didn't just mistype the mdadm command and unintentionally zap the boot disk - disk misbehaviour doesn't just zap its neighbours. Drives go offline occasionally, and generate errors very occasionally - this shouldn't be a common occurrence. SATA host adapters (AHCI) are already (logically) PCI devices (including those in the motherboard chipset) to having them on a physical PCIe bus isn't anything special or risky. "Overloading" the bus (which I really doubt you are doing, especially with spinning rust disks) would just result in submaximal throughput. Drives with physical damage will (hopefully) show SMART errors, rapidly failing performance, and errors in the syslog (if brought up on a clean machine). They really won't work okay for a bit then go bad then work okay again. So all of this sounds like magic-weird-nonsense behaviour, which makes me suspect it a hardware issue. Some ideas:
  • overtaxing the SATA rails on the PSU, causing the voltage to sag. I know you added a bigger PSU, so that's probably not it, but PSUs assign only a proportion of their output to the SATA output, and a lot of disks might overtax that. The documentation (and usually a sticker on the side of the PSU) can show the limits, and you can tot up the usage of each disk. I have in the past made a little probe cable and used a multimeter to probe this. Unfortunately consumer-grade PSUs don't have useful voltage reporting.
  • a memory error - transitory, recurrent memory errors are a real bugbear to find, and they can make effectively anything break, particularly if you're not using ECC DRAM (which, for purely stupid reasons, mostly isn't available on desktop/soho stuff). Reseat the DIMMs and run Memtest86+ for a long time.
  • a motherboard error (a cracked trace, a dry solderball under the chipset BGA) can cause all kinds of wackiness. Hard to diagnose, effectively unfixable.
I doubt it's the PCIe SATA card (unless it is some very cheap and nasty one) mostly the cheap ones deficiencies are performance. -- Finlay McWalter··–·Talk 15:22, 2 September 2023 (UTC)[reply]
All 6 spindles are spinning rust. I don't think I mistyped the mdadm command, apart from anything else it took out the /boot superblock on partition 1 and the root superblock on an LVM built on partition 5, both at the same time, whilst retaining the partitioning. It would be a pretty weird command that would do that! Your assessment of disk behaviour aligns with my experience before I retired, which is why it is so puzzling. When disks have been taken offline and given an extended SMART, they look perfect.
  • Mobo: ASUS H81M-PLUS, firmware date 01/06/2014 (not sure if that if 1 June or 6 Jan, but it really doesn't matter.
  • Memory 2x DDR3 synchronous 1600 MHz 8 GiB, so no ECC.
  • CPU: Intel Core i3-4150 3.50 GHz, 2cores, 4 threads
Typical error messages:
kernel: ata7.00: failed command: READ FPDMA QUEUED
kernel: ata7.00: cmd 60/00:88:38:8e:d8/06:00:04:00:00/40 tag 17 ncq dma 786432 in#012 res 40/00:88:38:8e:d8/00:00:04:00:00/40 Emask 0x10 (ATA bus error)
kernel: ata7.00: status: { DRDY }
repeated many times per second. Usually they stop after a minute or two, maybe 5. Sometimes this leads to MD declaring a disk faulty. Sometimes it leads to the next disk on the bus throwing a few errors. Occasionally I get a:
kernel: ata7.00: exception Emask 0x10 SAct 0x3f0000 SErr 0x0 action 0x6 frozen
kernel: ata7.00: irq_stat 0x08000000, interface fatal error
Which is why I'm suspicious that the disks are being messed about by the system, not vice versa. At least I now have a boot/root disk running rebuilt from the backups so I can look at error logs and specifications. Before trying to add back the user level disks (/home, /cloud, /virt, /photos etc) I'll give the memory a good thrashing as you suggest. Failing that the only component I haven't changed is the mobo/CPU, so maybe it's time to change that. Many thanks for the support/suggestions, Martin of Sheffield (talk) 21:40, 2 September 2023 (UTC)[reply]
Update: memtest86+ fails and reboots the system after a few seconds, about the end of test1. Hmmm! Martin of Sheffield (talk) 22:46, 2 September 2023 (UTC)[reply]
Wow, bang! Can you reproduce it every time? I've been running memtest86+ since approx. the year 2000 and never came across that. Something appears to be very wrong. I would suggest that the mobo is playing up: rather than replace it straight away, it might be cheaper to get a genuine matched pair of well-known DDR3 DIMMS to check. NB If you are being sold HDDs which just arrive in a static bag you are definitely being ripped off.
In my personal (biased, as advertised above) opinion, software RAID is a load of old wank and always has been: anyone trying to use it except for fun certainly needs their head examining. If you really want RAID in any form whatsoever (and only someone who REALLY wants genuine data redundancy beyond daily backups does), I would go for a decent hardware 6- or 8-port SATA RAID PCIe card, perhaps something costing around £300. But you seem to have almost insuperable hardware problems which, although they might be intellectually interesting to investigate, will almost certainly be time-consuming and perhaps unsolvable. Maybe others have more low-down expertise.
If you have the cash, I would tend to suggest starting over again: new mobo, new RAM unless you can check it out on another mobo (most diags can identify genuine matched pairs), perhaps keep the CPU(?) and most certainly a proper hardware RAID card from a well-known manufacturer if you really think you need it. Of course, you may discover the source of the problem and how to fix it. Best of luck, MinorProphet (talk) 14:56, 3 September 2023 (UTC)[reply]
Very nearly the same every time - reboot at the end of test 1, just once it got as far as the end of pass 1 and then rebooted. I'm suprised at your comments about software RAID. I've never had any problems with it, except for the SATA/USB timing issue. It is certainly recommended over fake-RAID. Let me sum up: 9 year old mobo & CPU, faulty memory, which is maxed out at 16 GiB. This sounds like a funeral service and a new system to me. Grrr. Thanks for the help all. Martin of Sheffield (talk) 17:27, 3 September 2023 (UTC)[reply]
It may simply be a bad contact on a DIMM. I'd take them out, clean the socket and the DIMM contacts with contact cleaner, and firmly reseat (giving each a Paddington stare). -- Finlay McWalter··–·Talk 19:18, 3 September 2023 (UTC)[reply]
I've already reseated. memtest fails at once. Hence I'm looking at spec for a new machine. Martin of Sheffield (talk) 20:17, 3 September 2023 (UTC)[reply]
Re software RAID: call me old-fashioned, but I would go for hardware every time: it's still really quite difficult to hack a physical cable vs. a wireless signal, although that's obviously not the point: but why ask your OS to look after your data redundancy? If Compaq had sold underpants, I would be still wearing them now - still have several T-shirts... Their business ethic towards customers was faultless, essentially: if it goes wrong, we acknowledge it in a Service Bulletin and fix it for free, either immediately or fix-as-fail. The merger with HP was possibly one of the most horrendous crimes against computing ever perpetrated...
...Dearly beloved, we are here to commit Martin of Sheffield's ancient and previously trusty server to the dust. May its true silicon heart continue to pulse through its imminent re-incarnation, and may the inextinguishable spirit of hardware be reborn with purpose anew. Blue skies, MinorProphet (talk) 23:34, 3 September 2023 (UTC)[reply]
:-) Well I've never had a prophet taking a funeral service before, just lowly vicars. BTW and FYI I was with DEC when Compaq took them over, and thence to HP. Martin of Sheffield (talk) 08:43, 4 September 2023 (UTC)[reply]
I chose my WP moniker when I only had a passing acqauintance with the OT. I only later discovered that I was named after one of the MajorProphets (only so called because more of their scribblings have been preserved), and it's not Daniel, Ezekiel or Isiaah. My homily should have ben more along the lines of "Woe unto the servants of Fujitsu and their false priests! Weep, O ye middle managers of the tribe of BigBlue! Besmirch your golden goodbyes and guaranteed pensions with sackcloth and ashes! The glory of HP will cover the whole earth, and there shall be no end to its dominion! Make way for the fatted calf of slaughter! Hurl your burnt offerings into the refining fire of corporate merger! Humble yourselves before the Chief Priestess Carly! All hail! MinorProphet (talk) 21:10, 4 September 2023 (UTC)[reply]
You could try a PCI reconfiguration, or, alternatively you could reset the SATA drive. I usually go with the former. Ассхол МакДелукс (talk) 03:44, 5 September 2023 (UTC)[reply]