From ... From: Erik Naggum Subject: Re: Politeness and language growth Date: 1999/01/08 Message-ID: <3124822115465877@naggum.no>#1/1 X-Deja-AN: 430340911 References: <3124184876945775@naggum.no> <368CFA20.7C6A4855@flash.net> <3124203114550548@naggum.no> mail-copies-to: never Organization: Naggum Software; +47 8800 8879; http://www.naggum.no Newsgroups: comp.lang.lisp * Andi Kleen | This happens when an connection attempt arrives, but it gets removed | again before the program accepts it - this happens e.g. when an ICMP | packet arrives exactly in the right window. this doesn't make sense as you explain it. could you be more specific? (assume that I know all the protocol specs by heart, actually helped write the Internet Host Requirements RFCs, and sometimes do network debugging for a living.) I don't see how you are supposed to be able to "remove" a connection attempt (by which I assume you mean a TCP SYN) and presumably _before_ the SYN ACK can see a RST from the other end. which type of ICMP packet do you mean could interfere here? | if an application cannot accept any blocking it is supposed to set the | listening socket to non blocking mode first, then the accept'ed sockets | will be non blocking too and return EAGAIN when the connection | disappeared. The application has to handle this inherent race by | retrying. this scenario seems extremely unlikely to cause blocking calls. yet it happens. with both TCP logs and ICMP logs on the system in question, select returns an apparently random fd from either the input or output set without evidence of activity. in the test case where this has happened to me, the machine does _nothing_ apart from waiting for connections on 64 sockets, input on 64 sockets, and output on 64 sockets. there is _no_ activity that should trigger and make select return, yet it does, and there is, invariably, nothing there. the network only has the usual background noises, to which it does not seem to relate in any way. | In what other situations do you think does select "lie" too? it has lied to me about write not blocking on sockets. just today, I had to shut down the whole application and restart it because we couldn't wait for my finding the bug during production hours. uptime requirements are extremely tight. we were down for 3 minutes, and during that time, all of nine people had been yanked out of whatever concentration the have left and asked to participate in fixing the problem. the problem is actually two-fold: select is used a number of fds, but when a false positive is returned, due to system internals, it keeps retrying on only the one that failed for a while before giving up on it, and returning to normal operation. sometimes, it doesn't give up. I have looked over the Allegro CL code and I can find no fault with it. I have coded around this by asking select if he lied about the return value, and if it doesn't confirm the return value, I ignore it and retry. (I have replaced the functions FILESYS-WRITE-BYTES and -STRING, and FILESYS-READ-BYTES and -BUFFER.) I have found _nothing_ to support any theory of external influences of any kind, although I'm sure there's _something_, like a clock ticking or some other apparently-innocuous element. I wish I understood what's going on, but I have stopped my research at select, and have coded around select, and that appears to remove the problem, for whatever reason. if and when I have more time, I'd like to dig even deeper, but this has already cost us some delays and I have some catching up and some code cleanup to do. #:Erik