We hit a new (and disturbing!) failure mode recently when a production rack that had been up for several months saw every (!) compute sled's service processor become simultaneously unresponsive. Bryan and Adam were joined by the members of the Oxide team who debugged the vexing issue -- and reached its surprising root cause.
In addition to Bryan Cantrill and Adam Leventhal, we were joined by Oxide colleagues, Cliff Biffle, Matt Keeter, and Will Chandler.
Previously, on Oxide and Friends:
OxF s05e03 – Holistic Engineering with Robert MustacchiOxF s04e14 – Rebooting a datacenter: A decade laterOxF s01e26 – The Pragmatism of HubrisOxF s05e20 – Debugger-Driven Development • (omdb) OxF s05e07 – Transparency in Hardware/Software InterfacesOxF s05e31 – FuturelockOxF s05e33 – A Grown-up ZFS Data Corruption Bug Some of the topics we hit on, in the order that we hit them:
hubris #2304: STM32H7 Ethernet driver stops yielding CPU after many packetsgist — Summarizing the Hubris side of investigationsMatt's blog: Hunting a spooky ethernet driver bug If we got something wrong or missed something, please file a PR! Our next show will likely be on Monday at 5p Pacific Time on our Discord server; stay tuned to our Mastodon feeds for details, or subscribe to this calendar. We'd love to have you join us, as we always love to hear from new speakers!
Nyd den ubegrænsede adgang til tusindvis af spændende e- og lydbøger - helt gratis
Dansk
Danmark
