Lucene first impressions

lucene_logo_green_300

Recently I have to implement a search for XML tags and attributes. To be able to achieve that, a search engine is required which can index content and can execute queries over it. After some refinement of the available options, I found that Lucene is pretty good. It is an open source Apache project which is being develop for a long time with a great community.

A few words about how Lucene works or how I use it 🙂
Lucene is composed of two major modules: Indexer and Searcher. Both of them works over directory called index. In some abstraction, the indexer represent a writer of the index and the searcher – a reader.
The search mechanism is pretty simple. First the indexer must index (write in the index) some content and after that the searcher can perform queries returning found results.

Lucene works with objects called Document. They are the units of search and index. The index can be consisted from one or more Documents. Indexing involves adding documents to an IndexWriter and searching involves retrieving documents via IndexSearcher.

Indexing

The documents in Lucene context are an abstraction, for example if I want to create index with searchable users, every user has to be added to the index as document. But how Lucene will know how to find the correct user? The answer is fields. The document consists of one or more fields. The field represents name-value pair. If we continue with the example, to be able to search for an user by username and email, the document which will be indexed must have username and email field and their corresponding values.
In summary, indexing involves creation of Documents containing one or more fields, and adding these documents to an IndexWriter.

Searching

Searching requires already built index. It involves creating a Qeury and passing it to an IndexSearcher, which returns a list of Hits (found results).

Code example

The example is very simple, it creates IndexWriter providing indexing and IndexSearcher providing searching. The example is written in Kotlin because I’m interested in this language and wanted to try it, but maybe I’ll add another post for that 🙂

First the project skeleton, I’m using Gradle as build tool. To be able to compile Kotlin, some dependencies are needed along with Lucene dependnecies (using Maven Central):

The example has a few packages:
1. …document – the classes under this package create Lucene Docuemnt objects
2. …indexer – create IndexWriter
3. …searcher – create IndexSearcher
4. A test which adds docuemtns in the index with IndexWrite and after that performs search query with IndexSearcher

Documentator
Creates Docuemnt object with fields expected as a DocuemntFiled list parameter (imeplemted as Kotlin data class, can be repalced with simple Map)

fun createDocument(fields: List<DocumentField>): Document {
    val document: Document = Document()
    for ((fieldName, fieldValue) in fields) {
        document.add(TextField(fieldName, fieldValue, Field.Store.YES))
    }
    return document
}

Indexer
The Indexer is a wrapper of an IndexWriter, it provides a few needed operations – add to the index, commit the documents into the index, delete the index and open/close the index. For the creation of the index only a location/path is needed.

override fun openIndex() {
    index = FSDirectory.open(indexLocation)
    indexWriter = IndexWriter(index, IndexWriterConfig())
}

override fun add(documents: List<Document>) {
    indexWriter.addDocuments(documents)
}

override fun commit() {
    indexWriter.commit()
}

override fun delete() {
    indexWriter.deleteAll()
}

override fun closeIndex() {
    index.close()
}

Searcher
The Searcher is a wrapper of an IndexSearcher, it provides a method which performs queries. For the creation of the searcher an already existing index location is needed.

override fun execute(query: Query): MutableList<Document> {
    val directory: Directory = FSDirectory.open(indexLocation)
    val indexSearcher: IndexSearcher = IndexSearcher(DirectoryReader.open(directory))
    val documents:MutableList<Document> = mutableListOf()
    indexSearcher.search(query, 100).scoreDocs.mapTo(documents) { indexSearcher.doc(it.doc) }
    return documents
}

The test
1. Creates a few doduments with fields

val documentFields1 = listOf(
        DocumentField(SearchConstants.FIELD_NAME, "name1"),
        DocumentField(SearchConstants.FIELD_TYPE, "type1"),
        DocumentField(SearchConstants.FIELD_VERSION, "version1"),
        DocumentField(SearchConstants.FIELD_NAMESPACE, "namespace1")
)
val document1: Document = Documentator.createDocument(documentFields1)

val documentFields2 = listOf(
        DocumentField(SearchConstants.FIELD_NAME, "name2"),
        DocumentField(SearchConstants.FIELD_TYPE, "type2"),
        DocumentField(SearchConstants.FIELD_VERSION, "version2"),
        DocumentField(SearchConstants.FIELD_NAMESPACE, "namespace2")
)
val document2: Document = Documentator.createDocument(documentFields2)

val documentFields3 = listOf(
        DocumentField(SearchConstants.FIELD_NAME, "name3"),
        DocumentField(SearchConstants.FIELD_TYPE, "type3"),
        DocumentField(SearchConstants.FIELD_VERSION, "version3"),
        DocumentField(SearchConstants.FIELD_NAMESPACE, "namespace3")
)
val document3: Document = Documentator.createDocument(documentFields3)

2. Creates an index

val indexer: Indexer = Indexer.create(indexLocation)
indexer.openIndex()
indexer.delete() // Clean previous index

3. Commits the documents into the index

indexer.add(listOf(document1, document2, document3))
indexer.commit()
indexer.closeIndex()

4. Creates searcher and performs query – the search will be performed over the name field

val searcher: Searcher = Searcher.create(indexLocation)
for(document in searcher.execute(WildcardQuery(Term(SearchConstants.FIELD_NAME, "name*")))) {
    println(document)
}

The result from the wildcard query: “name*” is:

Document<stored,indexed,tokenized<name:name1> stored,indexed,tokenized<type:type1> stored,indexed,tokenized<version:version1> stored,indexed,tokenized<namespace:namespace1>>
Document<stored,indexed,tokenized<name:name2> stored,indexed,tokenized<type:type2> stored,indexed,tokenized<version:version2> stored,indexed,tokenized<namespace:namespace2>>
Document<stored,indexed,tokenized<name:name3> stored,indexed,tokenized<type:type3> stored,indexed,tokenized<version:version3> stored,indexed,tokenized<namespace:namespace3>>

If I change the query to “name1” the result is:

Document<stored,indexed,tokenized<name:name1> stored,indexed,tokenized<type:type1> stored,indexed,tokenized<version:version1> stored,indexed,tokenized<namespace:namespace1>>

If I change the field to namespace and the value to namespace3:

Document<stored,indexed,tokenized<name:name3> stored,indexed,tokenized<type:type3> stored,indexed,tokenized<version:version3> stored,indexed,tokenized<namespace:namespace3>>

For more information about the Lucene query types and how build them: http://lucene.apache.org/core/6_4_1/queries/index.html

As a conclusion, I can say that Lucene is pretty simple and intuitive for use. A search functionality can be implemented really fast. The engine is powerful and performs queries quickly. I’ll glad to share your opinion.

The full example can be found at GitHub 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *