From ...
From: Erik Naggum <erik@naggum.no>
Subject: Re: Politeness and language growth
Date: 1999/01/01
Message-ID: <3124203114550548@naggum.no>#1/1
X-Deja-AN: 427766947
References: <wlewis-ya023580003112981750460001@news.binc.net> <3124184876945775@naggum.no> <368CFA20.7C6A4855@flash.net>
mail-copies-to: never
Organization: Naggum Software; +47 8800 8879; http://www.naggum.no
Newsgroups: comp.lang.lisp

* rusty craine <rccraine@flash.net>
| btw Erick  I am interested in your 99.95% uptime and how your run your
| test application in parallel with production.  Does your application not
| have a "dayend" process for backup, archive, billing etc or do you count
| this as uptime also?

  the uptime statistics are for the hours of regular operation, that is
  from 04:00 through 23:00 on weekdays and from 18:00 through 23:00 on
  Sundays.  outside those hours, the system can be taken down for (major)
  upgrades.  we have start of day reset at 00:00 which basically unloads
  all the heavy objects (they are loaded back from disk if needed) and runs
  a global GC.  there's also a start of week reset at Sunday 00:00.  (both
  are run manually if the system needs downtime in the off hours.)  backup
  is not disruptive.  other administrative functions run concurrently with
  normal system operation.

  by the way, 99.95% uptime is for the entire period of operation.  we had
  99.996% uptime until the backup operator failed to call me while I was at
  the Lisp Conforence, mistakenly thinking that my sleep was more important
  than getting the system back on track -- unfortunately, I don't know what
  caused this problem.  it could have been that Emacs got problems with the
  way it stupidly fails to close X connections after the last frame on a
  display is closed -- Emacs gets seriously confused when a remote X server
  goes down.  other than this fairly major incident, the downtime has been
  the time it has taken to correct a problem when some process invoked the
  Allegro CL debugger due to some simple coding mistakes or some unexpected
  problems with the input from one of the many clients we talk to.  even
  with outright bugs that would have terminated any other application,
  recovery has taken a few minutes at worst.  thus, we have extremely fine-
  grained control over the failure modes, and the excellent multiprocessing
  debugging environment in Allegro CL makes it possible to survive almost
  all kinds of problems.

  the source of several small downtime incidents has actually been isolated
  to a problem in Linux: select(2) spuriously returns, lying that a socket
  in listen(2) mode would not block on an accept(2) call.  (it lies in
  other situations, too, but they are much less of a "permanent block".)
  figuring out which socket to connect to to unblock it falls in the "life
  is too long to know this stuff well" category.  the solution was a second
  call to select(2) with that socket, only, and ignoring failed ACCEPT-
  -CONNECTION calls.  what's really interesting here is that I get _way_
  more control over low-level issues like this in Allegro CL than I would
  have done in C, as I would have to _build_ an abstraction/protection
  layer on top of the "standard" socket interface in C to get the ability
  to make small changes to it -- and Allegro CL already has that, so it was
  merely a matter of tweaking the existing "abstractive layer".

#:Erik
-- 
  if people came with documentation, could men get the womanual?