We had an interesting problem off late.
'Interesting' by the sheer technical nature of it but not very 'interesting' by its effects and aftermath.
Our network had been developing a high latency over a period and for unknown reasons.
And one fine day it just blew out - the latency was so high everything started to timeout.
we started out doing all the wrong things that didn't give us a clue about what was happening!
#1. Checked for faulty cabling
#2. Pinged with our ISP for any issues, and they put their hands up clearly denoting the problem as our brainchild.
#3. Checked out pulling our desktops and laptops from the network but with no avail
Sigh! No clues yet.
and when i was called into the picture, i changed the approach.
we did a logical backtracking from our router back to the internal lan.
And to my nightmarish delight we had this linux box sitting before the router/firewall as a single point, gateway.
Neat isn't it?! Well, the story continues....
We had all our voice + data being dumped through this poor machine and which has started to buckle under the
sheer load of traffic passing through it. Not sure what was the design decision behind this though!!! Its all an Inherited Mess and there's nothing much to be done about it!
and this poor machine was an age old mandriva which none of our SA's dared to tamper with - for all its fragility.
Hence no suffiecient monitoring tools. SIGH!
and with plain netstats and /proc information, found the major problem - there was an unusual flood of UDP packets
and almost all of them UDP packets were being dropped @ the gateway!
Congestion???!! Probably ... but could never find the root cause.
Then started the exercise to track the machine which was flooding. Thanks to one of my responsive SA's, who overcame his dread
to tamper with this system and got an IPTraf instance, installed, up and running.
Thanks to the IPTraf, we found the culprit machine which was flooding and brought that down. which happened to be a
spare / lesser used servers which was not supposed to do anything significant!!!
and The next day ... dawned up with another sudden issue - mails were failing! for no reason! trackback again and
we come to know that the server we brought down had a stupid, undocumented service which had been configured to
JUST FORWARD MAILS! By God!
I HATE THE JACKASS, FIX IT NOW, RECKLESS NETWORK DESIGNERS! AND UNDOCUMENTED SERVICES ON TOP OF THAT!
We brought back things in place just to find the latency issue had returned ...
Mails falied - Mailserver timed out ...
any external browsing had come to a grinding halt...
What to do now!? Where to Turn?! Whom to blame?! How to get this thing fixed!?
It is beyond words to describe those few days when complaints started pouring in and in...
with a mix of people trying to find a vehicle out of systems failure to cover up their non performance....
And mounting the pressure upon the SA's by nagging them every 5-10 mins for an ETA to an unknown problem!!!
That was when it just blew out of proportions....
(...to be continued)
Linux :: Networking :: Tools
Sivaramasundar