Failover Cluster Validation Hiccups (Or: How to Be a Network Cable Detective)

CrossFitters, Do We (Literally) Drink the Kool Aid?
August 22, 2015
Intel NUC NUC5CPYH: An Adventure with WDS, USB 3.0, and Windows 7
January 25, 2016
Show all

Failover Cluster Validation Hiccups (Or: How to Be a Network Cable Detective)

I don’t get to spend a lot of time at my current company’s offsite DR location. I’m currently building the cluster stack from the ground up, mimicking our production environment almost 1:1. Other than a shittier SAN and one less server in the cluster, the specs are just about the same. The problem is getting it all up and running with what little time I have on location.

Yesterday, I spent a good portion of my time on-site cleaning up some wiring and getting my Hyper-V servers ready to be clustered remotely. Switches were on the correct management VLAN, iDRAC access was working smoothly, everything else should be able to be handled from my desk or laptop in a different ZIP code. Cool.

So today, I plop myself down at my desk, RDP into both Hyper-V nodes, and begin the Failover Cluster wizard. And so my troubles began:

NODE1 - iSCSI1 cannot communicate with NODE2 - iSCSI1

That’s weird. Why would only one set of my iSCSI NICs not see each other? (I have two Quad Port NICs on each server; Intel and Broadcom). After some PING’ing out of specific interfaces, the only conclusion I could come to was I crossed a patch cable somewhere (or worse yet, didn’t click it in all the way!). That would absolutely suck – I would have to drive back out there just to plug a stupid network cable in? Complete waste of a day. Unless….

I popped open the switch’s config and dumped out which ports had jumbo packets detected on their connection. “Spaced” out correctly were 4 ports, 2 for each server, showing 9216 byte frame sizes. Perfect! The cables are plugged in and now I know which ports they’re actually on. Could I really have been so foolish and forgotten to untag those ports for my iSCSI VLAN? No way…

Yup. That was it. Something so simple. Validation passed and I moved on with my life.

So, lessons learned:

  1. Always make sure your switches are manageable offsite.
  2. Always document your switchport usage somewhere (Wiki, JIRA ticket, whatever). I have a small environment where I could visualize what was what. Coupled with my jumbo frame detective work, I was able to figure this one out.
  3. Bonus Lesson: Always make sure you have your iDRAC / out-of-band management working before you leave. You cannot rely on RDP, especially when fixing wonky network settings on switches and clusters. Things will absolutely drop!
Ernie Costa
Ernie Costa
Sysadmin who enjoys golfing and double-unders.

Leave a Reply

Your email address will not be published. Required fields are marked *