When Half is Not Enough

Back in the day, when the TCP/IP protocol suite was the new kid on the block, one of the classic issues was how to deal with half-open connections:

A TCP connection is considered “half-open” when one party thinks the connection has been closed and the other party thinks the connection is still open.

One solution to the half-open connection is the old "When in Doubt, Time Out" rule.  That is, if no packets have been sent/received on an existing connection for some preset period of time, unilaterally close the connection.  Not elegant, but works great. 

The Internet has grown exponentially over time and TCP/IP has become one of those things that are just there and works the way it is supposed to 999.999999 percent of the time.  Today’s average PC user really doesn’t think or care about half-open connections as defined above anymore, or even how TCP/IP does what it does!

Why am I writing about this?  Well, over the past week or so connections to two of my web sites has been bouncing, i.e., the monitoring service I use has been reporting intermittent loss of connectivity:

Alert Type: Site Not Available

Result: Failed

Time: August 11, 2009 10:39:23

HostName/URL: portal.hrpr.com

Monitor Name: hrpr.com

Service: http
Alert Type: Site is Available

Result: Ok

Time: August 11, 2009 11:09:24

HostName/URL: portal.hrpr.com

Monitor Name: hrpr.com

Service: http

I opened up a trouble ticket with my host provider and after a few back and forth conversions with them, ended up with this more or less "final" answer to the root cause of my problem.  Although not mentioned below, the implication was that the PHP scripts used by my sites were exceeding the server’s PHP memory usage and/or processing time limit.  And I knew this wasn’t the case:

Well, unfortunately, I have to maintain that the problem lies elsewhere at this time. Your server is rock solid stable, your apache instance is unwavering, and the ONLY reason I’ve found, other than an external connection issue, somewhere between yourself (or your monitoring service) and our servers, is the process watcher entries. If this were our issue, I’d be very quick to point it out. We’re not really about shifting blame here…check out http:// for shining examples of our "we screwed up, we’ll fix it" mentality.

However, in this case, the two possible issues are a bad connection between yourself and our servers, and the process watcher/resource issues. Again, if you want a useful diagnosis, we’ll need to see a traceroute taken at the time that the problem is occurring. I never suggested you had networking issues anywhere else…but incorrect routing, or network issues along the specific path that your connection takes can cause a perceived interruption in service. You may also try checking your site via proxy such as http:// the next time it appears to be down.

Let me know if you have any other questions!

After reading this response, please consider visiting the URL below to comment on its quality.

Thanks!

http://…

To which I replied:

OK, I give up, you win. I’ll continue trying to isolate it on my own.  And a few other choice words I will not repeat here. 

As suggested, I did visit the "URL below" and sent a comment on the quality of the response.  As you might suspect, my comment was somewhat negative.

A few hours later, I got this in an email from the host provider’s support team:

I think I found your issue. I apologize that this was over-looked but this actually isn’t supposed to happen as on our newer servers we have implemented FTP connection limits to stop them from piling up. Basically I saw your user had a huge number of stale FTP connections.

There was about 100 of them. Our process watcher will kill your processes if you have too many processes running and that’s why your processes were being killed (99% of the time they are killed for memory usage not process count). Our process watcher won’t kill FTP or shell connections as its hard to tell if the connections are legit or not but they are still counted towards your total process limit.

I killed all your stale FTP connections so I think you won’t be getting killed by our process watcher anymore (or not nearly as much). It looks like all your processes that were killed were due to the process count limit (not memory).

I apologize that this was over-looked by … but I can understand why as this isn’t even possible on our newer servers because we limit the FTP connections to 7 total by IP or account and again 99% of the time processes are killed for memory usage not process count. Please feel free to email me at … if you are still running into these issues and think this issue might have popped back up.

In 25 words or less, the intermittent connectivity problems were being caused by half-open (stale) connections. 

The host’s FTP server thought the connections were still open and as far as my FTP client was concerned, they had long since been closed.  The nasty side effect of this was that other processes such as those spawned by the HTTP server were getting killed off because of this.

What’s the moral to this story?  Well, there really isn’t one.  I just wanted to write this down somewhere as it brought back memories to me of days gone by.