My Tech Tip: Lucene IndexWriter optimize() behavior change since version 3.0.3

Lucene is a powerful full-text index and search development tool written in JAVA. Over more than ten years, Lucene has evolved to version 3 (stable version).

Recently, I upgraded Lucene library (the core jar file) from version 3.0.2 to version 3.0.3 (which was released in December 2010) in my project. The purpose of the upgrading is just for keeping up and sticking with a more bug-free release.

However, after upgrading, I noticed that the optimized index folder contains more index files than previously using version 3.0.2. That means the index merge during IndexWriter.optimize() stops at some point. I am not sure if the un-merged index file may cause any performance degradation during index search, but i am not satisfied with the fact that many files stay in the index folder (although it's not too many).

After reading the change document for Lucene 3.0.3 release, I realized that some changes had been made to avoid high disk usage during indexing. The original change log item from Lucene 3.0.3 is stated as follows:

LUCENE-2773: LogMergePolicy accepts a double noCFSRatio (default = 0.1), which means any time a merged segment is greater than 10% of the index size, it will be left in non-compound format even if compound format is on. This change was made to reduce peak transient disk usage during optimize which increased due to LUCENE-2762.

Since my index is not very very big and I don't care about the 'peak transient disk usage", I still want the index to be created and optimized in a cleaner way. This means the merge should be still continued to form a whole compound format.

Obviously I now need to change the default Lucene indexing and optimizing behavior by adding extra code in my project. The following is my tweak:

    // an IndexWriter instance created through method getWriter
    IndexWriter writer = getWriter(dir);
    // The tweak starts here
    MergePolicy mp = writer.getMergePolicy();
    if (mp instanceof LogByteSizeMergePolicy) {
        LogByteSizeMergePolicy lbsmp = (LogByteSizeMergePolicy) mp;
        lbsmp.setNoCFSRatio(1.0);
    }

Compiled, deployed, run. Hooray! The cleaner index folder is back!

My Tech Tip

Pages

Thursday, February 17, 2011

Lucene IndexWriter optimize() behavior change since version 3.0.3

1 comment: