<body><script type="text/javascript"> function setAttributeOnload(object, attribute, val) { if(window.addEventListener) { window.addEventListener('load', function(){ object[attribute] = val; }, false); } else { window.attachEvent('onload', function(){ object[attribute] = val; }); } } </script> <div id="navbar-iframe-container"></div> <script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script> <script type="text/javascript"> gapi.load("gapi.iframes:gapi.iframes.style.bubble", function() { if (gapi.iframes && gapi.iframes.getContext) { gapi.iframes.getContext().openChild({ url: 'https://www.blogger.com/navbar.g?targetBlogID\x3d33142127\x26blogName\x3dfishyweb\x26publishMode\x3dPUBLISH_MODE_BLOGSPOT\x26navbarType\x3dBLUE\x26layoutType\x3dCLASSIC\x26searchRoot\x3dhttp://fishyweb.blogspot.com/search\x26blogLocale\x3den_US\x26v\x3d2\x26homepageUrl\x3dhttp://fishyweb.blogspot.com/\x26vt\x3d6110685601352612137', where: document.getElementById("navbar-iframe-container"), id: "navbar-iframe" }); } }); </script>

Know your network... before you ask someone

  1. Start with your routers/firewalls
  2. Understand your subnet(s).
  3. Understand your VPNs.
  4. Take a stock of your IP addresses - for each server/other boxes
  5. Install cfg2html (www.cfg2html.com)
  6. run cfg2html - this'll extract all that info. about your servers
  7. Spend time and read through the same - in detail. Now you'll have more relevant questions to ask.
  8. Know the resource limitations of each of your servers
  9. Know what services &amp; applications are running on which servers
  10. Find the init-startup/configuration files relevant to the running services / applications
  11. Draw the dependency tree between servers / lans / vpns
  12. Decipher &amp; Visualize the gateways and routing tables on each of the servers
  13. Get all the possible access credentials to the routers/switches/servers/boxes
  14. Understand the physical network (if you have direct access) &amp; take a firsthand physical inventory in-spite of whatever documentation you may be given
  15. Study the connectivity between servers to switches and other devices
  16. Study the Power Supply / UPS &amp; any data connectivity to other devices
  17. Study the physical connectivity of cables between boxes to switches (data ports)
  18. Have a set of basic troubleshooting tools handy - system &amp; network tools
  19. Make sure syslog is enabled and all system related info. is logged appropriately
  20. Know the log locations of all your services/apps running in the servers - this is the first place you'd look to find any anomalies
  21. Use log analyzers - and make it a practice/daily routine to check your logs daily
  22. Perform weekly health checkups across all the boxes
  23. Most importantly, study the backup/restore policies and know by hand what is being backed-up and what is not
  24. Make it a part of your daily routine to check the storage spaces across boxes
  25. Know your crons - whats running where
  26. Make sure your servers use uniform versions of packages/softwares
  27. Keep your servers clean - perform a sweep at-least once a fortnight to remove unwanted files from the servers
  28. Know your filesystem - what is where for quick access - keep the structure simple and less cluttered
  29. Maintain daily activity logs
  30. Configure your servers to save the history to a file - will be of mighty help in times of disasters
  31. Secure your boxes - with proper user management
  32. Practice to use Sudo
  33. Know all the alternate remote access methods to get on to a server in case if ssh fails or with network issues (ex. IPKvm, iLo2 etc.)
  34. Golden rule : Never Panic when put in a high pressure situation. It will only get things worse.
  35. Learn to use common sense. :) this is the most difficult part! Unfortunately, it isn't taught in any schools but only at the university of life.
  36. Research and put in place monitoring tools of your choice and comfort. You should have a dashboard / master console, that provides you a periodical feeler about what is happening within your tech. ecosystem.
  37. Employ watchdog monitors/tools like monit, to automate fail-over recoveries in case of critical services

You can leave your response or bookmark this post to del.icio.us by using the links below.
Comment | Bookmark | Go to end