From ... From: Erik Naggum Subject: Re: Politeness and language growth Date: 1999/01/01 Message-ID: <3124203114550548@naggum.no>#1/1 X-Deja-AN: 427766947 References: <3124184876945775@naggum.no> <368CFA20.7C6A4855@flash.net> mail-copies-to: never Organization: Naggum Software; +47 8800 8879; http://www.naggum.no Newsgroups: comp.lang.lisp * rusty craine | btw Erick I am interested in your 99.95% uptime and how your run your | test application in parallel with production. Does your application not | have a "dayend" process for backup, archive, billing etc or do you count | this as uptime also? the uptime statistics are for the hours of regular operation, that is from 04:00 through 23:00 on weekdays and from 18:00 through 23:00 on Sundays. outside those hours, the system can be taken down for (major) upgrades. we have start of day reset at 00:00 which basically unloads all the heavy objects (they are loaded back from disk if needed) and runs a global GC. there's also a start of week reset at Sunday 00:00. (both are run manually if the system needs downtime in the off hours.) backup is not disruptive. other administrative functions run concurrently with normal system operation. by the way, 99.95% uptime is for the entire period of operation. we had 99.996% uptime until the backup operator failed to call me while I was at the Lisp Conforence, mistakenly thinking that my sleep was more important than getting the system back on track -- unfortunately, I don't know what caused this problem. it could have been that Emacs got problems with the way it stupidly fails to close X connections after the last frame on a display is closed -- Emacs gets seriously confused when a remote X server goes down. other than this fairly major incident, the downtime has been the time it has taken to correct a problem when some process invoked the Allegro CL debugger due to some simple coding mistakes or some unexpected problems with the input from one of the many clients we talk to. even with outright bugs that would have terminated any other application, recovery has taken a few minutes at worst. thus, we have extremely fine- grained control over the failure modes, and the excellent multiprocessing debugging environment in Allegro CL makes it possible to survive almost all kinds of problems. the source of several small downtime incidents has actually been isolated to a problem in Linux: select(2) spuriously returns, lying that a socket in listen(2) mode would not block on an accept(2) call. (it lies in other situations, too, but they are much less of a "permanent block".) figuring out which socket to connect to to unblock it falls in the "life is too long to know this stuff well" category. the solution was a second call to select(2) with that socket, only, and ignoring failed ACCEPT- -CONNECTION calls. what's really interesting here is that I get _way_ more control over low-level issues like this in Allegro CL than I would have done in C, as I would have to _build_ an abstraction/protection layer on top of the "standard" socket interface in C to get the ability to make small changes to it -- and Allegro CL already has that, so it was merely a matter of tweaking the existing "abstractive layer". #:Erik -- if people came with documentation, could men get the womanual?