Wednesday, August 18, 2010

Processing OpenOffice.org dictionary files using Lazarus

In this article we demonstrate how to processing OpenOffice.org dictionary files using Lazarus – FPC IDE. To use the OpenOffice.org dictionary first we need to extract AFF file and DIC files from the OXT (OpenOffice.org extension) file. This can be easily done using 7Zip or any other generally available file archiving utility. (Only thing we need to do is change OXT file extension to ZIP and extract the contents)

To process this dictionary (DIC) file we need to use affix table defined in the AFF file. In this sample code we implement the complete AFF and DIC file processor for English (United States) dictionary of the OpenOffice.org.

Our processing of this affix file in this sample application is based on the following rules,

AFF file generally consist with some conditional modules as follows,

SFX T N 4
SFX T 0 st e
SFX T y iest [^aeiou]y
SFX T 0 est [aeiou]y
SFX T 0 est [^ey]

In the first line "SFX" means suffix. In En(US) dictionary this may be either SFX or PFX.

T is the name of the module (and this helps us to establish the link between DIC and AFF file)

Digit 4 indicates the number of rules for the given condition.

Once read the conditional header you need to cross product the rule set with the given word.

Rule set of the given condition is decode as follows,

  • SFX : as previously described SFX is a suffix.
  • 0 : This indicates strip off character and in here 0 means NULL.
  • st : Suffix for the give word
  • e : This represents the logical part of the rule. In here "e" means target word might need to be end with the character "e".

Example for this rule is : Late > Latest

Likewise you need to apply all these rules to the root word and make all other possibilities for the word.

For example root word "happy" may have 4 forms, such as,
Happier, Happiest, Happiness and Unhappy.

This sample application is developed using Lazarus with minimum amount of system dependencies to demonstrate the above decoding process. With some minor adjustments this can be easily deployed to the Linux and Mac OS X also.

All the source codes and binaries of this sample application are available to download in here. This sample application is deployed under the terms and conditions of GNU GPL Version 3.0.

You can obtain more details about AFF and DIC files from the OpenOffice.org Lingucomponent Project.
Post a Comment