From ...
From: Erik Naggum <erik@naggum.no>
Subject: Re: strings and characters
Date: 2000/03/16
Message-ID: <3162184639382952@naggum.no>#1/1
X-Deja-AN: 598183029
References: <ey3hfe73nm4.fsf@cley.com>
mail-copies-to: never
Content-Type: text/plain; charset=us-ascii
X-Complaints-To: newsmaster@eunet.no
X-Trace: oslo-nntp.eunet.no 953195853 20654 195.0.192.66 (16 Mar 2000 08:37:33 GMT)
Organization: Naggum Software; vox: +47 8800 8879; fax: +47 8800 8601; http://www.naggum.no
User-Agent: Gnus/5.0803 (Gnus v5.8.3) Emacs/20.5
Mime-Version: 1.0
NNTP-Posting-Date: 16 Mar 2000 08:37:33 GMT
Newsgroups: comp.lang.lisp

* Tim Bradshaw <tfb@cley.com>
| The particular thing I don't understand is what type a literal string
| has.  It looks at first sight as if it should be something capable of
| holding any CHARACTER, but I'm not really sure if that's right.  It looks
| to me as if it might be possible read things such that it's OK to return
| something that can only hold a subtype of CHARACTER in some cases.

  strings _always_ contain a subtype of character.  e.g., an implementation
  that supports bits will have to discard them from strings.  the only
  array type that can contain all character objects has element-type t.

| I'm actually more concerned with the flip side of this -- if almost all
| the time I get some `good' subtype of CHARACTER (probably BASE-CHAR?)
| but sometimes I get some ginormous multibyte unicode thing or something,
| because I need to be able I have to deal with some C code which is
| blithely assuming that unsigned chars are just small integers and strings
| are arrays of small integers and so on in the usual C way, and I'm not
| sure that I can trust my strings to be the same as its strings.

  this is not a string issue, it's an FFI issue.  if you tell your FFI that
  you want to ship a string to a C function, it should do the conversion
  for you if it needs to be performed.  if you can't trust your FFI to do
  the necessary conversions, you need a better FFI.

| I realise that people who care about character issues are probably
| laughing at me at this point, but my main aim is to keep everything as
| simple as I can, and especially I don't want to have to keep copying my
| strings into arrays of small integers (which I was doing at one point,
| but it's too hairy).

  if you worry about these things, your life is already _way_ more complex
  than it needs to be.  a string is a string.  each element of the string
  is a character.  stop worrying beyond this point.  C and Common Lisp
  agree on this fundamental belief, believe it or not.  your _quality_
  Common Lisp implementation will ensure that whatever invariants are
  maintained in _each_ environment.

| The practical question I guess is -- are there any implementations which
| do currently have really big characters in strings?

  yes, and not only that -- it's vitally important that strings take up no
  more space than they need.  a system that doesn't support both
  base-string (of base-char) and string (of extended-char) when it attempts
  to support Unicode will fail in the market -- Europe and the U.S. simply
  can't tolerate the huge growth in memory consumption from wantonly using
  twice as much as you need.  Unicode even comes with a very intelligent
  compression technique because people realize that it's a waste of space
  to use 16 bits and more for characters in a given character set group.

| I know there's an international Allegro, so those might have horrors in
| them.

  sure, but in the same vein, it might also have responsible, intelligent
  people behind it, not neurotics who fail to realize that customers have
  requirements that _must_ be resolved.  Allegro CL's international version
  deals very well with conversion between the native system strings and its
  internal strings.  I know -- not only do I run the International version
  in a test environment that needs wide characters _internally_, the test
  environment can't handle Unicode or anything else wide at all, and it's
  never been a problem.

  incidentally, I don't see this as any different from whether you have a
  simple-base-string, a simple-string, a base-string, or a string.  if you
  _have_ to worry, you should be the vendor or implementor of strings, not
  the user.  if you are the user and worry, you either have a problem that
  you need to take up with your friendly programmer-savvy shrink, or you
  call your vendor and ask for support.  I don't see this as any different
  from whether an array has a fill-pointer or not, either.  if you hand it
  to your friendly FFI and you worry about the length of the array with or
  without fill-pointer, you're simply worrying too much, or you have a bug
  that needs to be fixed.

  "might have horrors"!  what's next?  monster strings under your bed?

#:Erik