Tuesday, 19 August 2014

IBM x3500 SAS Backplane Pinout

IBM x3500 SAS Backplane Pinout

You can get these backplane boards fairly cheap on ebay these days.  Even if you don't have the server it belongs do, they could still be useful for DIY storage projects. The IBM FRU is 44E8783. There are likely other similar IBM parts that use the same pinout, but no guarantees.
Unfortunately as far as I can tell, nobody bothered to document the cabling for them!  As luck would have it, I still have a working one, so I was able to map out the power cable with a multimeter.

This is just a regular backplane, not a SAS expander - one SFF-8087 connector for 4 drives.  The data connector is standard, it's just the power cable that is strange.  It looks similar to an ATX 24 pin power connector, but smaller.  If you are buying one to use in a different device, make sure it comes with the power cable - you may not be able to easily find the right connector.

I measured this by removing the cable from the backplane and measuring the voltage on each pin, with the black probe on the metal of the case (ground).  The picture shows how I numbered the pins.


PinVoltagePinVoltage
15v135v
2Ground14Ground
35v155v
4Ground16Ground
55v173.3v
6Ground18Ground
75v195v
85v205v
9Ground21Ground
1012v2212v
11Ground23Ground
1212v2412v

Monday, 23 June 2014

KB2919355 and LSI Raid Controllers Part 3

Microsoft released a hotfix a few weeks ago, and has finally started pushing it out via windows update.  KB2966870 should be automatically installed before KB2919355 now, so servers shouldn't break anymore.  Yay!

https://support.microsoft.com/kb/2966870

I'm now in the process of updating all my affected machines.  Other people have reported that the update does solve the problem, so hopefully this goes smoothly.

Wednesday, 14 May 2014

KB2919355 and LSI Raid Controllers Part 2

Microsoft has finally released a knowledge base article about this issue - http://support.microsoft.com/kb/2967012

The problem affects a lot of LSI based cards, including the Dell H200 my servers have, as well as some from HP, IBM and Supermicro.

Unfortunately, while there is a temporary workaround, there is no fix yet.  They did however figure out what is causing it:

This problem occurs if the storage controller receives a memory allocation that starts on a 4 gigabyte (GB) boundary. In this situation, the storage driver does not load. Therefore, the system does not detect the boot disk and returns the Stop error message that is mentioned in the "Symptoms" section.

Note This problem may not always occur in this situation. The problem is affected by the computer’s startup process, the driver load sequence, and the memory allocation of the storage controller driver at startup.
 Microsoft's workaround is to limit the system to 4GB of RAM long enough to remove the update.

This is actually an interesting bug.  4GB is the limit for 32 bit memory addresses.  But this is a 64 bit OS.  Those limits should be well behind us.  The driver itself didn't change with KB2919355, so something else in the update triggers this bug.

I'm quite curious to see what they changed that caused this issue, and why the Server 2008 drivers aren't affected by it.  Hopefully Microsoft will be releasing a fix soon.

On the upside, they seem to have extended the cutoff for future updates to June 10th, according to http://blogs.windows.com/windows/b/windowsexperience/archive/2014/05/12/windows-8-1-update-requirement-extended.aspx.  That post doesn't say if it applies to Server 2012R2, or just Windows 8.1.  I'm hoping it applies to both.  People who use WSUS to update their systems have until August before KB2919355 becomes required to receive further updates.  Unfortunately, I am working on rolling out WSUS, but it ready quite yet.

For now all I can do is sit back and hope Microsoft, Dell and/or LSI come out with a fix soon.

Documenting patch panels the easy way

As I discussed previously, network documentation is important. One part of that is documenting all the physical cabling of a building.

I had started drawing out visio diagrams for each of our patch panels, indicating where the cable went, as well as floorplans marking all the ethernet ports.  The floor plans aren't too bad, especially if you can get CAD drawings of the building.  But the patch panels turned out to be a pain - positioning labels over each port on a patch panel is very time consuming.

So I decided to write a plugin for DokuWiki that will do the hard work for me.  I give it a simple description like this:

<patchpanel groups="6" name="Top_of_Rack" ports="24" rows="2">
# Port Label (#COLOR) Comment
1 Uplink #ff0000 Connects to firewall
2 2 Office 101
3 3 Office 102
4 4 Office 103
5 5 Office 104
24 24 Reception desk
</patchpanel>
And I get a drawing like this:

The plugin allows you to specify the name, number of ports, number of rows, and port grouping of the patch panel, so it looks pretty close to the real thing.  For each port, you can specify a label, optional color, and a comment.  The comments are shown as tooltips when you hover over a port.

This saves me a ton of time over updating visio diagrams, and means I don't have to edit the visio diagram, export it as an image, and upload it to our wiki.  Just edit a few lines of text when a port changes.  I'm also planning on writing one to handle switches,

The image is actually an embedded SVG.  The amazing things about scalable vector graphics is that they are text based descriptions of what to draw - so with a little bit of effort, you can dynamically create an image in any programming language, without any special image manipulation libraries.

If you are interested in using this plugin, you'll find it at https://www.dokuwiki.org/plugin:patchpanel

Bug reports and feature requests should be added to the github repo.

Sunday, 27 April 2014

Documentation

Documentation is one of the most overlooked things in IT.  Maybe it's because writing documentation is boring.  Maybe because you don't have time to write it all down before moving on to put out the next fire.

Why do I need any documentation?  I know how it all works...


Documentation is important!  Without it, nobody knows what other people on their team have changed.   Nobody knows how things are setup, and more importantly why they are setup that way.  Bringing on a new person to the IT staff is much easier if they have a document they can reference, rather than ask about every little detail.

In a crisis having good documentation of your backup and restore procedures is essential.  You need to restore things quickly, people will be yelling at you, and it becomes very stressful.  Write yourself a nice step by step guide while you test the backups, so when you need to do it for real you just follow the instructions.

From the IT staff's perspective, having good documentation has direct benefits. You can't take a real vacation if you are the only person who knows how parts of the network work.  Those parts will break, and you will get a phone call.  And if you have a terrible memory like me, you'll come across things that are setup strangely - that you setup yourself, several years and many projects ago.  You'll remember there's a reason you did it that way, but not what the reason was.  That information is just lost.

From a business standpoint, having no documentation is terrible.  IT staff can quit, get sick, or even die.  When someone comes in to take over for them, it will take them weeks to figure out where everything is and how everything runs.  I have personally run into situations where the only thing we could do was rebuild a service from scratch, because nobody knew the passwords and they couldn't be reset.

Ok, I'm sold.  I'll write things down.  Now where to put them?


I have gone through a lot of different systems for keeping documentation.  At one point it was a physical binder in my office.  At another a bunch of text files on my laptop.  Then a Sharepoint site for the IT staff.  All of them had limitations.

The binder never got updated properly.  The text files were hard to navigate, and couldn't easily be shared with the team.  Sharepoint worked ok for awhile.  But what about the notes I need to fix Sharepoint when it breaks...they had to be stored separately, or I wouldn't be able to reach them when I needed them.

Criteria for a good documentation system

Your needs may be different, but this is what I was looking for to store our documentation:
  • Easy to add to and update.  If it's hard or time consuming, nobody will do it.  Documentation is useless if nobody ever updates it.
  • Able to link to other parts of the documentation.  Specific information should only be added ones, but referenced from anywhere that it might be relevant.  You don't want to repeat the same information over and over, since you'll have to find every instance of it when it changes.
  • Basic formatting and pictures.
  • Hosted ourselves.  There are cloud based platforms that do this kind of thing.  But I don't want my data inaccessible if they fail.
  • Accessible to and editable by the whole IT staff, without any extra effort to share our changes.
  • Accessible from anywhere and from any device - sometimes I need to fix things via remote desktop from my phone.
  • Accessible even in the event of a catastrophic network failure.
Those last few points are tricky.  Making it accessible to all the staff from any device is easy - put it on a web server.  But what if the web server dies?  Or I don't have a working internet connection when I really need it?  My solution is to use DokuWiki with Dropbox

Dokuwiki with Dropbox

DokuWiki is a simple open source PHP based wiki.  It's syntax is relatively easy to read and use.  It can store pictures and other things like saved config files.  It has decent access control to make sure only authorized staff can access it.  It takes care of everything except that last bullet point - accessible in a catastrophic network failure.

There are lots of similar wiki packages, but DokuWiki has one very important distinction for my purposes - it does not use a database.  Everything is stored in plain text files in it's data folder.

Dropbox is a great service.  You run a program on each of your computers, and it syncs the files in your Dropbox folder between them.  Your files are also accessible from their website.  It also allows you to share folders with other people, and all the files in that folder will show up in their Dropbox folder on their computer too.  I've been using it for years, and it's dead simple to use.

The best thing about Dropbox is that it keeps a local copy of all your files.  Unlike other cloud storage services, if Dropbox goes down, I still have access to all of my files.  This is important, because it means using Dropbox isn't adding a single point of failure.  If Dropbox stops working, changes start getting synced, but all the existing data is stored on every computer.

Since everything in DokuWiki is a regular text file, no databases, Dropbox can sync it to everyone's computer.  Add in a tiny web server like MicroApache, and you can load it directly from your Dropbox.  DokuWiki On a Stick is a prepackaged version that comes with the web server portion ready to go.  That covers loading it from our work and even computers.  Any file someone edits automatically gets synced to everyone else within a few seconds.

What it doesn't cover is accessing it from mobile devices or computers that don't have Dropbox installed.  That's covered by loading Dropbox onto a webserver running the full Apache, and pointing the DocumentRoot at our shared Dropbox folder.  The biggest trick here is that all the files must be writable by both Dropbox and Apache - either they have to run as the same user, or you have to be in each other's groups, and set their umask to ensure group has write permissions.

Now, in normal operation, I can access the wiki from its web server.  If the web server is unreachable, I can fire it up on my laptop.  If I don't even have my laptop, at the very worst I can read the text files directly with the Dropbox client on my phone - I don't get the pretty formatting, but all the content is there.

I think it works brilliantly.

What about passwords?

I'm a very security conscientious person.  Storing passwords in plain text in a third party service like Dropbox isn't a great idea.  So we do not record passwords in the wiki.

That's what KeePass is for.  It stores passwords for all our services, nicely organized and encrypted.  There is a portable version that runs directly from Dropbox, so all the staff have access to it and it is automatically kept up to date.

KB2919355 and LSI Raid Controllers (Dell H200)

I ran into an interesting problem last week.

I have 3 Dell R715 servers setup as a Server 2012R2 Hyper-V cluster.  I did my normal maintenance of pausing a node and installing windows updates.  One of those updates was KB2919355 - the massive Update 1, that should have been a service pack but isn't for political reasons.  Oh, and if you don't install this update, you won't be getting any more updates starting in May.

After installing KB2919355, I rebooted again, and got a BSOD - INACCESSIBLE_BOOT_VOLUME.

Uh oh.

I messed around with recovery mode, googled for other people having this problem, etc.  Didn't turn up any solutions.  What I did find was some people with LSI RAID cards on SuperMicro machines having the same issue.  I posted on ServerFault to see if anyone had the same problem or knew of a fix.  Nobody did.

Two of the servers have H200 raid controllers.  The third was for some reason ordered with an H700.  Both of the servers with H200 controllers have this issue.  The one with the H700 took the update without any issues.  Time to open a support case with Dell.  I rarely have to use our support contracts, but they are nice to have for times like this - it's pretty clear this isn't something I'm going to be able to fix on my own.  It's either a driver or firmware issue.

Since the server was dead anyways, I tried to reproduce it to see if it was a conflict with something else I had installed.  But reproducing it is simple.
  • Install Server 2012R2
  • Install windows updates.  Keep installing them until it offers KB2919355
  • After installing the update, the server will boot properly.  Everything seems fine.
  • Reboot again, and you get a BSOD on every boot.
Thankfully, the 3rd server had just been installed so we would have room for more VMs.  We haven't actually done that yet, so the cluster can still run on a single machine, though without any redundancy.

Dell support has been very helpful.  But they tell me they built an identical machine in their lab and still can't reproduce the issue.  I can't imagine why - I can trivially reproduce it on both of my machines.

In any case, Dell is exchanging one of the systems and taking it back for testing.  Hopefully they are able to reproduce the problem this time and find a solution.  In the mean time, I have to live without windows updates on these servers.

If you have a Dell server with an H200 raid controller, I'd highly recommend waiting until Dell has this fixed before installing KB2919355.  It may not affect all servers, but the ones it does affect it kills.

Network Monitoring

I have spent the last 2 years or so using Zabbix for monitoring the servers I'm responsible for.  Overall it works pretty well, but lately I've run into some shortcomings that made me investigate alternative solutions.

Particularly, Zabbix stores it's data in MySQL.  The performance data and history keeps growing, until the tables have millions of rows and starts slowing down the web interface to the point that it takes several minutes to load some pages.  It's supposed to have housekeeping processes that clean up old data, but they often don't work properly.  Even after scaling back the frequency of checks and how long they are kept for, I still run into this problem every few months and have to wipe all my historical data.  I could throw more hardware at the problem, but between that and some other limitations (SNMP traps are hard to setup, flapping detection is hard, etc.) I wanted to see what else was out there.

It's been awhile since I've looked at Nagios or any Nagios based solutions.  Open Monitoring Distribution seems like it may work well for me.  It installs on Ubuntu in just a few minutes, and comes with Nagios plus a bunch of addons and plugins that make it easier to configure, faster, and eliminate a lot of the shortcomings of the old Nagios setups I used to work with.  So I intend to set it up along side my existing Zabbix install and see how well it does.

I already have a lot of time and effort invested in Zabbix.  So switching to something that works just as well isn't going to happen.  If I switch, it will be because the new setup works better and offers features Zabbix doesn't provide.

There are quite a few criteria I've decided upon for evaluating a new monitoring system:

Ease of Configuration

How hard is it to get the software setup to start with?  A monitoring system has to be customized to the environment it's monitoring.  Once everything is running, how hard is it to add new hosts and start monitoring them?  Are there templates or other tools that help deciding what to monitor on each host?

How much can be done from a web interface, and what needs to be done from the command line or editing config files?  I'm comfortable with both, but some of my peers would rather have a GUI.

Monitoring Custom Services and Data

I have a lot of hardware and software that often doesn't come with monitoring templates.  One key feature is how easy it is to write scripts to monitor things that don't come included in a typical monitoring solution.  We have a few appliance type devices that don't give the stats I want over SNMP, so require using a web service or other tricks to get at the data.  I also want detailed statistics on the spam filters on my Exchange server, which can only be accessed through powershell.

Windows and Linux Support

I prefer to use the best tool for each job.  As such, we have a mixture of Windows and Linux servers.  A good monitoring system needs to be able to easily monitor both, with custom scripts written in Powershell or VBS for windows.

Particularly, I don't want to have to install PHP, Python, or other software on every Windows server.  I want a native self contained agent.  For custom monitoring, I want to be able to write my checks in whatever programming or scripting language is most convenient for that device, from that operating system.

Good Support for SNMP

Every decent monitoring system supports SNMP.  But few do it well.  Thanks mostly to copyright issues, MIBs are a nightmare no matter what system you use.  But if I have all the MIBs and know what data I want to monitor, I expect it to be easy to setup.

SNMP Traps are a whole separate mess.  The biggest problem is that devices will send a trap when they have a problem, but not send anything when they recover.  So it's difficult to tell if the problem still exists.  It also takes a lot of trial and error to determine which traps you actually care about, and which are considered normal.  I don't expect setting up traps to be very easy, but I do need them to work well once they are setup.

Graphs and Data Storage

I want pretty graphs.  My boss wants pretty graphs.  My boss's boss wants pretty graphs.  I expect graphs to be easy to setup, quick to load, and customizable to show what I need them to.

Storing the data in something like RRD graphs is preferable - they automatically average out older data so the database never actually grows, you just get fewer individual data points for older data.  It might keep data for every minute for the first few weeks, then average them out to every 5 minutes for a month, then every hours for a year.  This also makes displaying graphs for long time spans much faster.  Systems that keep check every minute would have over half a million data points for a 1 year graph if you don't do any averaging.

Good Notifications - Tell Me What's Really Wrong

Another limitation I've run into is dumb notifications.  I have several separate physical sites to monitor.  If the internet goes down at one of them, I'll receive notifications that every service on every server at that site is down.  When that happens, it's easy to guess that it's the internet that's really down, but it does cause my phone to ring non stop.  The ability to tell the monitoring system "If firewall A goes down, you won't be able to reach X, Y and Z." helps prevent that.

Likewise, if something sits at the threshold of it's alerting value, I get sent dozens of alerts.  "It's ok." "It's broken." "It's ok again." "It's broken again." over and over.  It should be easy to setup flapping detection or separate thresholds for problem and recovery to avoid that.

I also need intelligent selection of notification methods.  For example, if the mail server is down, sending me an email probably isn't the best way to inform me of that.  Instead, send me a text message using my phone provider's email gateway.  I also want to feed the notifications into our ticketing system, so we have a single place to see all issues.  With Zabbix, I have some custom scripts for the ticketing system that closes the ticket when it receives a recovery notification.  I want to keep doing that.

Business impact also falls into this category.  I would like to define certain services, like Email, Website, Sharepoint, etc. and tell the monitoring system "If checks X, Y or Z fail, Email is down.  If checks A, B or C fail, Email is up but will be running slow."  This makes it easier for me and my boss to see what end user services are affected, and help prioritize what gets fixed.  In the end, I already know all this information, but it saves a conversation explaining why something isn't working and what is affected - that's time that could be better spent fixing the problem.

API For Getting Data Out of the System

Email alerts are great.  Irritating, but great.  But sometimes there are minor problems I don't want alerts about - software updates waiting, short bursts of high CPU, or temporarily high disk usage.  I want events like that monitored and recorded so I find out of about them, but I don't need to be actively interrupted for them.  To help with that, I built a heads up display that sits over my desk.  It's a TV connected to a raspberry pi, that flips through screens showing graphs and alerts all day.

One thing I like about Zabbix is that it has an API for getting alerts and graphs.  It's not a very good API for some things, but it works.  If I switch monitoring systems, the new one needs to make it easy for me to get data out of it and onto the screen.

Open Source

It's not that I don't want to pay for things.  If there were a paid solution that did everything I want easily, I'd get a cheque cut today.  But it's rare that any software does exactly what you want, how you want it.  The amount of money you spend is rarely a good indicator of how good of a job it will do.

The brilliant thing about open source software is that I can change it.  I've had to make a few minor changes to Zabbix to work with my display system and to monitor everything I want it to.  With open source I can do that.  I generally see it as a last resort - if I make a bunch of custom changes, I have to maintain those changes every time a new version is released.  But sometimes it's the only way to get the job done.  If they are changes that would benefit others, I can even eventually get them included in future releases.  Other people do the same, and contribute changes that are useful to me, and everyone ends up having to do less work to get things going.

Conclusion

I will be testing OMD to see how well it fits these criteria.  Out of all the monitoring systems I've taken a look at, it looks like it will come closest to covering all my criteria.  I'll have to make another post with how well it does.