Apache Lucene is a high-performance and full-featured text search engine library written entirely in Java from the Apache Software Foundation. It is a technology suitable for nearly any application that requires full-text search, especially in a cross-platform environment. In this article, we will see some exciting features of Apache Lucene. A step-by-step example of documents indexing and searching will be shown too.
Apache Lucene Features
Lucene offers powerful features like scalable and high-performance indexing of the documents and search capability through a simple API. It utilizes powerful, accurate and efficient search algorithms written in Java. Most importantly, it is a cross-platform solution. Therefore, it’s popular in both academic and commercial settings due to its performance, reconfigurability, and generous licensing terms. The Lucene home page is http://lucene.apache.org.
Lucene provides search over documents; where a document is essentially a collection of fields. A field consists of a field name that is a string and one or more field values. Lucene does not in any way constrain document structures. Fields are constrained to store only one kind of data, either binary, numeric, or text data. There are two ways to store text data: string fields store the entire item as one string; text fields store the data as a series of tokens. Lucene provides many ways to break a piece of text into tokens as well as hooks that allow you to write custom tokenizers. Lucene has a highly expressive search API that takes a search query and returns a set of documents ranked by relevancy with documents most similar to the query having the highest score.
The Lucene API consists of a core library and many contributed libraries. The top-level package isorg.apache.lucene
. As of now, Lucene 6, the Lucene distribution contains approximately two dozen package-specific jars, these cuts down on the size of an application at a small cost to the complexity of the build file. In a nutshell, the features of Lucene can be described as follows:
Scalable and High-Performance Indexing
- Small RAM requirements — only 1MB heap.
- Incremental indexing as fast as batch indexing.
- Index size roughly 20-30% the size of text indexed.
Powerful, Accurate, and Efficient Search Algorithms
- Provides ranked searching — i.e. best results returned first.
- Supports many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more.
- Provides fielded searching (e.g. title, author, contents).
- Supports sorting by any field.
- Supports multiple-index searching with merged results.
- It allows simultaneous update and searching.
- Has flexible faceting, highlighting, joins and result grouping.
- It is fast, memory-efficient and typo-tolerant suggesters.
- Provides pluggable ranking models, including the Vector Space Model and Okapi BM25.
- Provides configurable storage engine (codecs).
Cross-platform solution
- Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs
- It is 100%-pure Java
- Implementations in other programming languages are available that are index-compatible.
How Does Apache Lucene Work?
In this section, we will see how does Apache Lucene work towards documents indexing and searching.
A Lucene Index Is an Inverted Index
Lucene manages an index over a dynamic collection of documents and provides very rapid updates to the index as documents are added to and deleted from the collection. An index may store a heterogeneous set of documents, with any number of different fields that may vary by a document in arbitrary ways. Lucene indexes terms, which means that Lucene search searches over terms. A term combines a field name with a token. The terms created from the non-text fields in the document are pairs consisting of the field name and the field value. The terms created from text fields are pairs of field name and token.
The Lucene index provides a mapping from terms to documents. This is called an inverted index because it reverses the usual mapping of a document to the terms it contains. The inverted index provides the mechanism for scoring search results: if a number of search terms all map to the same document, then that document is likely to be relevant.
Lucene Index Fields
Conceptually, Lucene provides indexing and search over documents, but implementation-wise, all indexing and search are carried out over fields. A document is a collection of fields. Each field has three parts: name, type, and value. At search time, the supplied field name restricts the search to particular fields. For example, a MEDLINE citation can be represented as a series of fields: one field for the name of the article, another field for name of the journal in which it was published, another field for the authors of the article, a pub-date field for the date of publication, a field for the text of the article’s abstract, and another field for the list of topic keywords drawn from Medical Subject Headings (MeSH). Each of these fields is given a different name, and at search time, the client could specify that it was searching for authors or titles or both, potentially restricting to a date range and set of journals by constructing search terms for the appropriate fields and values.
Indexing Documents
Document indexing consists of first constructing a document that contains the fields to be indexed or stored, then adding that document to the index. The key classes involved in indexing are,oal.index.IndexWriter
which is responsible for adding documents to an index, and, oal.store.Directory
which is the storage abstraction used for the index itself. Directories provide an interface that’s similar to an operating system’s file system. A Directory
contains any number of sub-indexes called segments. Maintaining the index as a set of segments allows Lucene to rapidly update and delete documents from the index.
Document Search and Search Ranking
The Lucene search API takes a search query and returns a set of documents ranked by relevancy with documents most similar to the query having the highest score. Lucene provides a highly configurable hybrid form of search that combines exact boolean searches with softer, more relevance-ranking-oriented vector-space search methods. All searches are field specific because Lucene indexes terms and a term is composed of a field name and a token.
An Example of Document Indexing and Searching
In this section, we will see a step-by-step example that shows document indexing and searching with Apache Lucene.
Step 1: Loading Required APIs and Packages
package com.example.lucene;import java.io.File;import org.apache.lucene.analysis.SimpleAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.queryParser.QueryParser;import org.apache.lucene.search.IndexSearcher;import org.apache.lucene.search.Query;import org.apache.lucene.search.ScoreDoc;import org.apache.lucene.search.TopDocs;import org.apache.lucene.store.Directory;import org.apache.lucene.util.Version;import java.io.FileReader;import java.io.IOException;import org.apache.lucene.document.Field;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.store.FSDirectory;
Step-2: File Indexing
At first select the index directory where the indexer will be saved and then select the data directory as follows:
File indexDir = new File("C:/Exp/Index/")File dataDir = new File("C:/Users/rezkar/Downloads/lucene-6.3.0/lucene-6.3.0/");
Now select the suffix of the files that you intend to search after indexed:
String suffix = "jar";
Since we will be searching the files with extension say "java", so call the Lucene File Indexer as follows:
SimpleFileIndexer indexer = new SimpleFileIndexer();
Now create an index and let's see how many files got indexed:
int numIndex = indexer.index(indexDir, dataDir, suffix); System.out.println("Numer of total files got indexed: " + numIndex);
Herethe index() method goes as follows:
private int index(File indexDir, File dataDir, String suffix) throws Exception { IndexWriter indexWriter = new IndexWriter( FSDirectory.open(indexDir), new SimpleAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED); indexWriter.setUseCompoundFile(false); indexDirectory(indexWriter, dataDir, suffix); int numIndexed = indexWriter.maxDoc(); indexWriter.optimize(); indexWriter.close(); return numIndexed; }
As the above code create the index and write in the index directory that we have selected above of course after applying a simple analysis suing the SimpleAnalyzer() method. Finally, the method returns the number of the files that have been indexed. As if you see carefully, the indexDirectory() method takes 3 parameters: the index writer that writes the index by analyzing the files having the extension .java from the data directory. The indexDirectory() method goes as follows:
private void indexDirectory(IndexWriter indexWriter, File dataDir, String suffix) throws IOException { File[] files = dataDir.listFiles(); for (int i = 0; i < files.length; i++) { File f = files[i]; if (f.isDirectory()) { indexDirectory(indexWriter, f, suffix); } else { indexFileWithIndexWriter(indexWriter, f, suffix); } } }
According to the above code segment, the indexer indexes either all the files inside a sub-directory or all the files in the data directory using the indexFileWithIndexWriter() method that goes as follows:
private void indexFileWithIndexWriter(IndexWriter indexWriter, File f, String suffix) throws IOException { if (f.isHidden() || f.isDirectory() || !f.canRead() || !f.exists()) { return; } if (suffix!=null && !f.getName().endsWith(suffix)) { return; } System.out.println("Indexing file:... " + f.getCanonicalPath()); Document doc = new Document(); doc.add(new Field("contents", new FileReader(f))); doc.add(new Field("filename", f.getCanonicalPath(), Field.Store.YES, Field.Index.ANALYZED)); indexWriter.addDocument(doc); }
After successful indexing, you should observe the following output:
Indexing file:... C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\common\lucene-analyzers-common-6.3.0.jarIndexing file:... C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\icu\lib\icu4j-56.1.jarIndexing file:... C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\morfologik\lucene-analyzers-morfologik-6.3.0.jar...Indexing file:... C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\test-framework\lucene-test-framework-6.3.0.jarNumer of total files indexed: 60
Step 3: Search the Files
In this step, we will search all the name of the files that we indexed in the previous step. The workflow for this step goes as follows:
1. Show the index directory.
2. Input the search query -i.e. say "jar".
3. Use the SimpleSearcher API of Lucene.
4. Perform the search operation.
5. Print the result.
Technically, these five steps can be performed using the following code segment:
public static void main(String[] args) throws Exception { File indexDir = new File("C:/Exp/Index/"); String query = "lucene"; int hits = 100; SimpleSearcher searcher = new SimpleSearcher(); searcher.searchIndex(indexDir, query, hits); }
Here, searchIndex() is a user-defined method that actually searches the file searching that goes as follows:
private void searchIndex(File indexDir, String queryStr, int maxHits) throws Exception { Directory directory = FSDirectory.open(indexDir); IndexSearcher searcher = new IndexSearcher(directory); QueryParser parser = new QueryParser(Version.LUCENE_30, "contents", new SimpleAnalyzer()); Query query = parser.parse(queryStr); TopDocs topDocs = searcher.search(query, maxHits); ScoreDoc[] hits = topDocs.scoreDocs; for (int i = 0; i < hits.length; i++) { int docId = hits[i].doc; Document d = searcher.doc(docId); System.out.println(d.get("filename")); } System.out.println("Found " + hits.length); }
This method searches the files and prints the names of the files. For the sample data directory, you can download the Apache Lucene distribution version 6.3.0 from here. On successful execution of the above method, you should observe the output as follows:
C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\uima\lib\WhitespaceTokenizer-2.3.1.jarC:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\benchmark\lib\xercesImpl-2.9.1.jarC:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\replicator\lib\jetty-continuation-9.3.8.v20160314.jarC:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\morfologik\lib\morfologik-fsa-2.1.1.jarC:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\queryparser\lucene-queryparser-6.3.0.jarC:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\uima\lucene-analyzers-uima-6.3.0.jarC:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\spatial-extras\lib\slf4j-api-1.7.7.jarC:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\morfologik\lib\morfologik-polish-2.1.1.jarC:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\morfologik\lib\morfologik-stemming-2.1.1.jarC:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\benchmark\lib\spatial4j-0.6.jarC:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\demo\lucene-demo-6.3.0.jarC:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\replicator\lib\commons-logging-1.1.3.jarC:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\spatial-extras\lib\spatial4j-0.6-tests.jarC:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\spatial-extras\lib\spatial4j-0.6.jarFound 14
Conclusion
In this article, I tried to cover some essential features of Lucene. Putting the above code fragments together into a full application is left as an exercise to the reader.
Nevertheless, if this does not work, readers can download the source code, a sample data folder, and the maven friendly pom.XML file from my GitHub repository here.
Any kind of feedback is welcomed. Happy reading!
Lucene Apache Lucene Document Database Open source Data (computing) Cross platform Inverted index operating system
Opinions expressed by DZone contributors are their own.
FAQs
How do you use Lucene for indexing? ›
- Create a method to get a lucene document from a text file.
- Create various types of fields which are key value pairs containing keys as names and values as contents to be indexed.
- Set field to be analyzed or not. ...
- Add the newly created fields to the document object and return it to the caller method.
- Flush by RAM usage instead of document count.
- For Lucene <= 2.2: call writer. ramSizeInBytes() after every added doc then call flush() when it's using too much RAM. ...
- For Lucene >= 2.3: IndexWriter can flush according to RAM usage itself. Call writer.
Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.
How do I search Lucene? ›Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries). To perform a single character wildcard search use the "?" symbol. To perform a multiple character wildcard search use the "*" symbol. You can also use the wildcard searches in the middle of a term.
What is Lucene search example? ›Lucene is a full-text search library in Java which makes it easy to add search functionality to an application or website. It does so by adding content to a full-text index.
Is Lucene still relevant? ›From my experience, yes. Lucene is a "production" state of art library and Solr/Elasticsearch is very used in many scenarios. This expertise is very on demand. My company personally migrated from ElasticSearch to https://vespa.ai/ and could not be happier.
Why is my indexing taking so long? ›Slow indexing. Indexing helps Windows Search do its job faster. However, this process can use up many of your computer's resources and take a long time to complete. The reason for this is because your laptop typically has many files, and the more files there are to index, the longer it takes to complete the process.
How do I fix slow indexing? ›- Go to Settings by right-clicking the Windows Start button.
- In the left-sidebar, click System.
- In the right-hand pane, click Troubleshoot.
- Go to Other troubleshooters.
- Locate Search and Indexing troubleshooter.
- Click on the Run button next to it.
Indexing makes columns faster to query by creating pointers to where data is stored within a database. Imagine you want to find a piece of information that is within a large database. To get this information out of the database the computer will look through every row until it finds it.
What database does Lucene use? ›CrateDB – open source, distributed SQL database built on Lucene.
Is Lucene a NoSQL database? ›
Apache Solr is a subproject of Apache Lucene, which is the indexing technology behind most recently created search and index technology. Solr is a search engine at heart, but it is much more than that. It is a NoSQL database with transactional support.
What is the difference between Solr and Lucene? ›Lucene is a full-text search engine library, whereas Solr is a full-text search engine web application built on Lucene. One way to think about Lucene and Solr is as a car and its engine. The engine is Lucene; the car is Solr. A wide array of companies (Ford, Salesforce, etc.)
Is Google search based on Lucene? ›Both are better suited for developing a search engine and both are based on Lucene.
How is Lucene query different from Elasticsearch? ›Lucene is a Java library. You can include it in your project and refer to its functions using function calls. Elasticsearch is built over Lucene and provides a JSON based REST API to refer to Lucene features. Elasticsearch provides a distributed system on top of Lucene.
What is the difference between Lucene and Elasticsearch? ›Apache Lucene is an open source and free text search engine library written in Java. It is a technology suitable for applications that requires full-text search, and is available cross-platform. Elasticsearch is an enterprise search tool from Elastic in Mountain View, California. A free and open source product.
What are the basic concepts of Lucene? ›The fundamental concepts in Lucene are index, document, field and term. An index contains a sequence of documents. A document is a sequence of fields. A field is a named sequence of terms.
Why do we use Lucene? ›Apache Lucene™ is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting, nearest-neighbor search across high-dimensionality vectors, spell correction or query suggestions.
What is Lucene index data structure? ›Lucene uses a well-known index structure called an inverted index. Quite simply, and probably unsurprisingly, an inverted index is an inside-out arrangement of documents in which terms take center stage. Each term refers to the documents that contain it.
What companies use Lucene? ›AOL is using Solr to power its channels. www.aol.com | Apple is using Solr. www.apple.com |
---|---|
Intuit is using Solr. www.intuit.com | Salesforce is using Solr. www.salesforce.com |
Ford is using Solr. www.ford.com | Smithsonian is using Solr as cross catalog faceted search. www.si.edu/ |
ElasticSearch. ElasticSearch is an open-source search and analytics engine capable of resolving an increasing number of requests. Today ElasticSearch is the most commonly used search engine. It can rapidly index frequently changing data in less than 1 sec, making it one of the fastest search engines.
Why would you use Solr over Lucene? ›
Unlike Lucene, Solr is a web application (WAR) which can be deployed in any servlet container, e.g. Jetty, Tomcat, Resin, etc. Solr can be installed and used by non-programmers. Lucene cannot.
What's the Lucene index? ›A Lucene Index Is an Inverted Index
Lucene indexes terms, which means that Lucene search searches over terms. A term combines a field name with a token. The terms created from the non-text fields in the document are pairs consisting of the field name and the field value.
Lucene or Apache Lucene is an open-source Java library used as a search engine. Elasticsearch is built on top of Lucene. Elasticsearch converts Lucene into a distributed system/search engine for scaling horizontally.
How does Lucene calculate score? ›Lucene uses a combination of the Vector Space Model (VSM) and the Boolean model of information Retrieval to determine how relevant a document is to a user's query. It assigns a default score between 0 and 1 to all search results, depending on multiple factors related to document relevancy.