Subject: Re: search engine
From: rpw3@rigden.engr.sgi.com (Rob Warnock)
Date: 1997/11/07
Newsgroups: comp.lang.scheme
Message-ID: <63uunr$18g7r@fido.asd.sgi.com>

Jean-Philippe Theberge  <yesod@cam.org> wrote:
+---------------
| I'm trying to write a small search engine for a web site in scheme (siod)
| Everything goes well except that the search take too many cpu time and
| ressource. I'll be happy if somebody can help me point out where the
| problem is...
+---------------

Your main problem is that you're doing basically what amounts to

	% grep {user_pattern} `cat {path}/.webpages`

but doing it all in SIOD (which is a fast-starting but fairly slow-running
interpreter) instead of "grep" (tightly-coded C). I mean, things like the
ones I've marked are real killers:

    (define liste-des-pages
      (file->list "/home/alanroy/public_html/yesod/test/.webpages"))
    ...
     (let*
         (liste-des-match
	    ...
            (mapcar
             (lambda (x)
               (let* ((nom-unix (string-append unix-prefix x))
                      ...
[*]==>                      (contenu (file->string nom-unix))
[*]==>                      (titre (if (string-search "<title>"
[*]==>                                               (string-downcase contenu))
		      ...
[*]==>                       (strbreakup (string-downcase contenu) "<title>"))
                 (if (string-search (string-downcase requete)
[*]==>                              (string-downcase contenu))
		...
             liste-des-pages))))
	     ...

Looks to me like you're sucking all of the web pages in your site
(or at least all of the ones named in the file ".webpages") into
SIOD's memory -- that "(let* (.. (contenu (file->string nom-unix))))" --
and then down-casing each file at *least* three times(!!!) without caching
the results, and then passing over the contents three *more* times(!!)
with "strbreakup" just to extract the HTML title. [No, one of those
is the actual search. But still...]

Need I go on?

There's nothing wrong with using SIOD as you've done to prototype your
app, but now that you know it works, you need to re-think your search
and string-matching algorithms completely, preferably writing the most
time-consuming ones as small C library routines linked together with SIOD.

[I assume you will be sticking with "SIOD" rather than mentioning any of a
number of Scheme implementations that run a lot faster, because SIOD *is* very
small and fast-starting, and that's a valuable quality in a CGI-BIN program.]

For example, there's no need whatsoever (even when written in SIOD Scheme)
to "string-downcase" all of every file twice and "strbreakup" all of every
file three times just to extract the HTML title.  SIOD already *has* regular
expression searching [in the "regex" extension library it -- SIOD's "regexec"
returns a list of (start . end) pairs of indices to the matching substrings],
and a simple search something like this would go a long way toward speeding
things up, I suspect:

	(define title-pattern (regcomp "<title>(.*)</title>" REG_ICASE))
	...
	...other stuff...
	   ...(let* (...
		     (match (regexec title-pattern contenu))
		     (titre (if (pair? match))
			    (substring line
				       (car (nth 1 match))
				       (cdr (nth 1 match)))
			    "Sans titre"))
		... )

And for the searches of the body, modifying SIOD to add a case-independent
"string-search-icase" [by copying and making small mods to the existing C
source for the "string-search" function] would let you change this:

                 (if (string-search (string-downcase requete)
                                    (string-downcase contenu))
		     ... )
to this:
                 (if (string-search-icase requete contenu)
		     ... )

and save yet another copy (and garbage collection) of all of every file.

Better still would be to not suck every file into Scheme objects, but
to write a hand-coaded "grepper" C routine that you could call from your
SIOD driver. The "grepper" could search for both the user pattern and
the HTML title in one pass, using one (small) fixed-size I/O buffer
and C library regular expression routines. That would take the load off
of SIOD's garbage collector *and* keep its memory footprint small, both
important for performance in a web server.

Finally, web site searching goes a *LOT* faster if, say, you previously
prepare an inverted index of all the words in all your files (which only
needs to be done once for each day some web page changes), and have your
search engine first search for a match in the index before looking in the
full pages. But that's a another commercial research topic all its own...


-Rob

-----
Rob Warnock, 7L-551		rpw3@sgi.com   http://reality.sgi.com/rpw3/
Silicon Graphics, Inc.		Phone: 650-933-1673 [New area code!]
2011 N. Shoreline Blvd.		FAX: 650-933-4392
Mountain View, CA  94043	PP-ASEL-IA