It seems there are an infinite number of ways for a web site to break, making it difficult to systematically check all of them. The longer I work on web sites, the more surprised I am at the creativity of complex systems in finding ways to fail.
I cannot make an exhaustive list here, so I’ll concentrate on the most common problems capable of crashing a web site. If you guard against these routine problems, the problems that eventually get you will be worthy opponents. If you find patterns of failure not mentioned here, please write firstname.lastname@example.org. I would be interested to hear about them.
The most likely cause of a system failure is a full disk. A good system administrator will watch disk usage closely and offload to backup storage, such as tape, at regular intervals.
Log files use up disk space quickly. Web server log files, SQL*Net log files, JDBC log files, and application server log files are all the disk equivalents of memory leaks. One good preventative measure is to keep log files on a different filesystem from the operating system. The web server may still hang when the log filesystem is full, but the machine itself is less likely to hang.
If a web server or other critical process needs more file descriptors than are allotted to it, it will hang or give an error until it gets what it needs. File descriptors are used to keep track of open files and open sockets, both of which are critical ...