So a really weird thing happened this weekend. Our infrastructure group did the Windows updates like they normally do and suddenly on Monday one of our active/active/active clusters (which means they are named instances) jobs started failing only for SSIS Packages in the SSIS Catalog with error 0x80131904 and “A network-related or instance specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible.”. Obviously the server is up and working fine, so it’s constrained to just this scenario. Through the normal course of troubleshooting I started to think about anything and everything that could be a root cause.
- First, you go look at the KB’s to see what was in the updates and see if there is anything obvious…nope.
- Service account? No that couldn’t be it since job steps that don’t have SSIS packages work fine.
- Some research said it could be a timeout issue, so we adjust the connection and remote login timeouts and try agin. Dang! Still no luck.
- Maybe it’s the SSIS Catalog itself…let’s recreate it and see. Nope, no good.
- What about corrupted jobs?? Let’s recreate the job from scratch. Nope, same thing.
- Okay, let’s try to outsmart this problem (so we can get our production jobs running!)…let’s recreate this job in our 3rd party scheduling system. Go to do just that and interestingly, the scheduling system suddenly can’t see the projects in the catalog! What the?!
This is is starting to definitely point to a security related update issue. Since we can’t roll back these security updates, we decide to get Microsoft involved to see what the deal is. After several hours of looking through logs (which showed absolutely nothing, no errors, no warnings, nothing), looking at settings and poking around…we finally run a connection test against the “shared memory protocol” and BINGO!
Looks like one of the .net updates started blocking the shared memory protocol, which is what the SQL Agent uses to connect to the SSIS catalog to run the packages. (Actually it’s used for anything running locally) So…how do we fix it?
I wasn’t aware of this, but if we create a SQL Alias from the named instance back to itself using TCP, then the agent uses that to connect to the SSIS Catalog. The ultimate fix is a hot fix for the .net update, but this is a great workaround and this whole experience has been a good practice in troubleshooting.