Newsreaders and e-mail programs find themselves with the unenviable task of managing a great deal of messages. There are three organizational schemes in wide use:
Unfortunately, the big downside of the one-message-one-file approach is that it is far slower than the other two options. When it’s time to display the contents of a folder that contains thousands of messages, it can take an eternity to open each file, read the headers, close it, and go onto the next file. I learned that the hard way, as I was writing the BeOS version of Pineapple News. When I started working on the Mac OS X version, I knew I had to bite the bullet and use cache databases.
There are several types of folders that can contain message files: newsgroup folders, special purpose folders such as Outbox and Drafts, and storage folders you create yourself, to save message files that interest you. Every folder that can contain message files will have exactly one cache database. The cache database is stored in two files on-disk, Cache.didx and Cache.ddat. The cache database will contain exactly one record for each message in the folder. Each record stores a small amount of information about the associated message file. Instead of having to open each message file in turn and read all the headers, the program can open only two files and blast through them in a fraction of the time.
There is nothing in the cache databases that cannot be easily replaced. They just duplicate some of the information found in the message files themselves, so that the data can be accessed quickly. If the program is called upon to read the contents of a message directory and it discovers that there’s no cache database, it will create one. It will be slow, because the program will have to read every message file, but it only has to be done once. After that, it can consult the database instead.
Since a cache database duplicates data found in the associated message files, it is possible for the two to get out of sync. For example, let’s say that a folder contains three messages, and that the cache database contains up-to-date records for all three. Then you close the program, navigate to that folder with Finder, and delete one of the three messages. Now there’s only two messages, but there’s still three records in the cache database. If you were to view this folder in Pineapple News, it will still think there are three messages, and that’s how many it will show you in the headers view. If you click on the missing message, the program will not be able to display it for you, of course.
I wrote the program such that it does its darnedest to detect problems on its own, but there’s only so much it can do. With too many checks, it would be easy to lose all the speed gains that made cache databases attractive in the first place. So you have to work with the program and not cause too much trouble. In the scenario above, the correct thing to do would be to delete the message file and the cache database. That way, when the program is called upon to display that folder in the future, it will have no choice but to recreate the cache database. An even better approach would have been to delete the message from inside Pineapple News, so it can update the cache database as well.
If you suspect that a folder’s cache database has gotten out of sync with its messages, you can force the program to recreate it. From the File menu, pick “Folder,” then “Reindex.” It might take awhile, but that doesn’t matter much, because it happens in the background. You are free to go look at some other newsgroup or folder while the troublesome message directory is being reindexed.
Pineapple News will only reindex one message directory at a time. If you select a second message directory and reindex it before the first one has finished, the request will go into a queue, and it will be dealt with when the first operation is finished. There is no limit on the number of reindex requests that can be queued up at once. Keep in mind, however, that reindexing can take a long time. If you quit the program before it has finished, all queued reindex requests will be discarded.
The slowest, most annoying thing about the BeOS version of Pineapple News was how long it took to read folders with thousands of messages. Cache databases have mostly solved that problem. The second-most annoying thing about that earlier program was how long it took to read certain types of message files. Specifically, messages that contain binary attachment data, or MIME alternative sections. It's not at all uncommon for a message with a binary attachment to have 10,000 lines. The program would have to read every single line before it would have enough information to put up the attachment button. (See the help topic Message Attachments for more information.) Once again, the solution is caching.
Once a message file has been completely downloaded, the program “parses” it, looking for MIME text and HTML sections, MIME attachments, uuencode attachments, or yenc attachments. If the message has sections or attachments, that information will be recorded in proprietary X-Pineapple-Section headers. There will be one header for each MIME section or attachment found in the message. The header records what type of section was found, along with the exact byte offset in the file where the section starts. That way, when the program needs to access that section later on, it can jump right to it, rather than reading the entire file.
Unfortunately, whenever you’ve got two copies of identical data, it’s possible for them to get out of sync. For example, you could invalidate a message file’s cached section data by opening it in a text editor and adding a few new lines at the top. That would change the byte offsets of everything that comes after the spot where you added new text. If the file has attachments, the program would lose its ability to decode them. If it has MIME text and HTML sections, the program would not even be able to display the message properly.
I added the reindex feature to fix stale cache databases. The equivalent feature for fixing a message file’s section data is the reparse command. It adds or repairs the necessary pineapple state headers, and updates type and creator codes, if a message file doesn’t already have them.
Reparsing is also useful for importing messages from other programs into the Pineapple News message store. If you’ve got messages to import, save them in text files that have the .pmsg extension. Pineapple messages normally have LF line ends, but the program is very tolerant on this point. Imported messages can have any type of line ends: CR, LF, or CRLF. You can even have more than one type of line ending in a single message file and the program will still read it properly. Move the messages into saved folders, reparse them, and they’ll be indistinguishable from ones that were downloaded natively.
It’s possible to reparse just one or two messages at a time. First, select the messages you want to reparse. From the Message menu, pick “Reparse,” then “Selected.” This is not recommended, though. Reparsing a message file may change its state data in such a way that would require its cache database entry to be re-written, but parsing in ones and twos will not update the cache database. Orchestrating the data exchange necessary to make such a feature work is just too much trouble for something that almost nobody will ever use.
Instead, you are encouraged to reparse an entire folder-full of messages at once. From the File menu, pick “Folder,” then “Reparse.” The program will reparse every message in the current folder, then reindex the entire folder as well, which ensures that every message will have its cache database entry rewritten. Just as with reindexing, you can queue up as many folders as you like for reparsing, which will take place in the background.
Fair warning: reparsing takes a long, long time. Let’s consider a worst-case example. Say you’ve got a folder with 4,000 messages, each one 10,000 lines long. I wouldn’t be surprised if such a folder takes eight hours to reparse. This is something that should almost never have to be done in a system that’s operating properly, so there is little incentive for me to optimize it.