SWISH 1.2.1 FAQ

SWISH 1.2.1

FAQ (Frequently Asked Questions)

[Index] [Previous Chapter] [Next Chapter]
Questions ...

Swish crashes and burns on a certain file. What can I do?
How do I allow users on the Web to search my indexes?
I want to make my own gateway program.
How can I index all my compressed files?
Can I index 8-bit text?
How can I index phrases?
How can I implement keywords in my documents?
I want to generate a list of files to be indexed and pass it to swish.
I run out of memory trying to index my files.
How can I speed up indexing and/or shrink my index file size?
When should I consider merging indexes?
What other features are planned?

... and Answers
Swish crashes and burns on a certain file. What can I do?
You can use a FileRules operation to exclude the particular file name, or pathname, or its title. If there are serious problems in indexing certain types of files, they may not have valid text in them (they may be binary files, for instance). You can use NoContents to exclude that type of file.

How do I allow users on the Web to search my indexes?
Good question. You will need a gateway CGI program that presents users with a search form and options, calls swish with these options, and returns the data to them in a nice HTML format. Swish is not meant to do this. One swish-compatible gateway you can currently use is W4AIS available at http://www.rru.com/~meo/useful/www.html#w4ais.

I want to make my own gateway program.
Great! Good gateways can be made that take advantage of swish's features. If you do make one, even a simple one, please let me know and I can include it in the distribution.

How can I index all my compressed files?
Swish doesn't currently have the capability to do on-the-fly filtering of files. In the meantime, first index the uncompressed data, compress it, and using a ReplaceRules operation, change the suffix of indexed files to .Z or whatever is appropriate. That way users can retrieve the compressed information.

Can I index 8-bit text?
Yes, if the text uses the HTML equivalents for the ISO-Latin-1 (ISO8859-1) character set. Upon indexing swish will convert all numbered entities it finds (such as ©) to named entities (such as ©). To search for words including these codes, type the named entity (if it exists) in place of the 8-bit character. Swish will also convert entities to ASCII equivalents, so words that might look like this in HTML: resumé can be searched as this: resume. Please read the README file included with the distribution for information on changing these options.

How can I index phrases?
Currently the only way to do this is to use the HTML entity   or   (non-breaking space) to represent a space in your HTML. It will then be indexed with a space. To search for the phrase, you'd have to enter   to represent a space also.
How can I implement keywords in my documents?
In your HTML files you can put keywords in comments, such as:
  
...then when you search, swish should be called with the -t c option, such as:
  swish -t c -w keywords computer
All documents that contains the words keywords and computer in their comments will then be returned. Swish has an option in the source code that you can define to give more relevance to the words inside comments; if you're doing keywords in this fashion, you may want to use that option.
I want to generate a list of files to be indexed and pass it to swish.
One thing you can do is make a simple script to generate a configuration file full of IndexDir directives. For instance, make a separate file called files.conf and put something like this in it:
  IndexDir /this_is_file_1/file.html
  IndexDir /usr/local/www
  IndexDir file2.html /some/directory/
  ...
Then call swish like this (assuming you're using a main swish.conf file):
  swish -c swish.conf files.conf
I run out of memory trying to index my files.
It's true that indexing can take up a lot of memory! One thing you can do is make many indices of smaller content instead of trying to do everything at once. You can then merge all the smaller pieces together.

How can I speed up indexing and/or shrink my index file size?
Go through your installation and configuration with a fine toothed comb. Look at your runtime configuration file (these may also be compiled in as defaults by modifying the config.h files.):

Are you indexing file types you don't really care about? (IndexOnly)
Are you indexing only the names of files whose contents you don't care about, such as binary files, images, etc? (NoContents)
Are you skipping files and/or directories which you mighet prefer to ignore? (FileRules)
Are your limits for words to ignore because they are too frequent low enough? (IgnoreLimit)
Are you ignoring words you know should be ignored? (IgnoreWords) For instance, if your site involves heavy duty science or any topic where you are primarily interested in items which appear on only a few pages, you might set this very low.
Are you unnecessarily following symbolic links? (FollowSymLinks)
Are your limits for words to ignore because they are too frequent low enough? (IgnoreLimit)
Are you searching too deep for TITLE tags? (TitleTopLines) For instance, if you know that TITLE tags are never more than 4 lines deep, set this value to 4.
Can you eliminate smaller words? Larger words? (MinWordLimit, MaxWordLimit) The minimum word size is usually most helpful. For instance, do you really need to index three letter words? Four letter words? But consider both.
Can you ignore things that aren't really words, such as those which are all vowels (may not apply in Hawaii), all consonants (may not apply in Wales), or all digits? What about things with long strings of vowels, consonants or digits? (IgnoreAllV, IgnoreAllC, IgnoreAllN, IgnoreRowV, IgnoreRowC, IgnoreRowN) What about the number of times single character can repeat? (IgnoreSame)
Are you unnecessarily indexing HTML tags? (IndexTags)
Are you unnecessarily indexing ASCII entities such as &?? (AsciiEntities)
Finally, a few parameters may currently be configured only by modifying config.h .

Have you defined the characters that comprise a ``word'' in your index narrowly enough? (WORDCHARS, BEGINCHARS, ENDCHARS) For instance, if you don't care about strings with special characters such as é in them, you can eliminate the characters "#;0123456789" (and possibly "&") from WORDCHARS, and probably "&" from BEGINCHARS and ";" from ENDCHARS. If you don't care about variable names in code or other names that end in digits, you could remove "0123456789" from ENDCHARS as well.

Change these to minimize the words indexed, recompile, and test by creating a new index and searching against it. (You may wish to create this as a separate index from your production index until you settle on final parameters. Do not install a newly configured swish binary until you are satisfied that it works correctly; test in the build directory.

When should I consider merging indexes?
The most obvious time is when you need sub-indexes. For instance, each user might want to have their own index, but you might also want an index spanning all users.
Merging can also help reduce indexing time.
Merging is necessary when you have sets of pages with differing indexing requirements. For instance, a global index spanning user pages (words of all sizes), introductory material (primarily small words) and sophisticated technical material (longer words) would best be implemented by creating an index for each area, then merging them.

What other features are planned?
These are things I have been thinking about (some of these came from Kevin). They are highly dependent on my schedule.

Parse improvements
More optimization
A proximity search feature
Regular exporessions in searches
Stemming and soundex matches
File filtering
A server implementation
Distributed server implementation, including swish: scheme, or an implementation to the wais: scheme, or perhaps work on a global index: scheme
Search META tags
Interact with other indexing and meta-indexing systems
Easier building, installation and configuration
[Index] [Previous Chapter] [Next Chapter]
Last update: 18/Aug/1998