From ... From: Erik Naggum Subject: Re: strings and characters Date: 2000/03/16 Message-ID: <3162184639382952@naggum.no>#1/1 X-Deja-AN: 598183029 References: mail-copies-to: never Content-Type: text/plain; charset=us-ascii X-Complaints-To: newsmaster@eunet.no X-Trace: oslo-nntp.eunet.no 953195853 20654 195.0.192.66 (16 Mar 2000 08:37:33 GMT) Organization: Naggum Software; vox: +47 8800 8879; fax: +47 8800 8601; http://www.naggum.no User-Agent: Gnus/5.0803 (Gnus v5.8.3) Emacs/20.5 Mime-Version: 1.0 NNTP-Posting-Date: 16 Mar 2000 08:37:33 GMT Newsgroups: comp.lang.lisp * Tim Bradshaw | The particular thing I don't understand is what type a literal string | has. It looks at first sight as if it should be something capable of | holding any CHARACTER, but I'm not really sure if that's right. It looks | to me as if it might be possible read things such that it's OK to return | something that can only hold a subtype of CHARACTER in some cases. strings _always_ contain a subtype of character. e.g., an implementation that supports bits will have to discard them from strings. the only array type that can contain all character objects has element-type t. | I'm actually more concerned with the flip side of this -- if almost all | the time I get some `good' subtype of CHARACTER (probably BASE-CHAR?) | but sometimes I get some ginormous multibyte unicode thing or something, | because I need to be able I have to deal with some C code which is | blithely assuming that unsigned chars are just small integers and strings | are arrays of small integers and so on in the usual C way, and I'm not | sure that I can trust my strings to be the same as its strings. this is not a string issue, it's an FFI issue. if you tell your FFI that you want to ship a string to a C function, it should do the conversion for you if it needs to be performed. if you can't trust your FFI to do the necessary conversions, you need a better FFI. | I realise that people who care about character issues are probably | laughing at me at this point, but my main aim is to keep everything as | simple as I can, and especially I don't want to have to keep copying my | strings into arrays of small integers (which I was doing at one point, | but it's too hairy). if you worry about these things, your life is already _way_ more complex than it needs to be. a string is a string. each element of the string is a character. stop worrying beyond this point. C and Common Lisp agree on this fundamental belief, believe it or not. your _quality_ Common Lisp implementation will ensure that whatever invariants are maintained in _each_ environment. | The practical question I guess is -- are there any implementations which | do currently have really big characters in strings? yes, and not only that -- it's vitally important that strings take up no more space than they need. a system that doesn't support both base-string (of base-char) and string (of extended-char) when it attempts to support Unicode will fail in the market -- Europe and the U.S. simply can't tolerate the huge growth in memory consumption from wantonly using twice as much as you need. Unicode even comes with a very intelligent compression technique because people realize that it's a waste of space to use 16 bits and more for characters in a given character set group. | I know there's an international Allegro, so those might have horrors in | them. sure, but in the same vein, it might also have responsible, intelligent people behind it, not neurotics who fail to realize that customers have requirements that _must_ be resolved. Allegro CL's international version deals very well with conversion between the native system strings and its internal strings. I know -- not only do I run the International version in a test environment that needs wide characters _internally_, the test environment can't handle Unicode or anything else wide at all, and it's never been a problem. incidentally, I don't see this as any different from whether you have a simple-base-string, a simple-string, a base-string, or a string. if you _have_ to worry, you should be the vendor or implementor of strings, not the user. if you are the user and worry, you either have a problem that you need to take up with your friendly programmer-savvy shrink, or you call your vendor and ask for support. I don't see this as any different from whether an array has a fill-pointer or not, either. if you hand it to your friendly FFI and you worry about the length of the array with or without fill-pointer, you're simply worrying too much, or you have a bug that needs to be fixed. "might have horrors"! what's next? monster strings under your bed? #:Erik