Apache Lucene® is a widely used Java full-text search engine. This section describes how VMware GemFire integrates with Apache Lucene. We assume that the reader is familiar with Apache Lucene’s indexing and search functionalities.
The Apache Lucene integration:
- Enables users to create Lucene indexes on data stored in GemFire
- Provides high availability of indexes using GemFire’s HA capabilities to store the indexes in memory
- Colocates indexes with data
- For persistent regions, persists Lucene indexes to disk
- Updates the indexes asynchronously to minimize impacting write latency
- Provides scalability by partitioning index data
For more details, see Javadocs for the classes and interfaces that implement Apache Lucene indexes and searches, including LuceneService
, LuceneSerializer
, LuceneIndexFactory
, LuceneQuery
, LuceneQueryFactory
, LuceneQueryProvider
, and LuceneResultStruct
.
You can interact with Apache Lucene indexes through a Java API, through the gfsh
command-line utility, or by means of the cache.xml
configuration file.
Key Points
- Apache Lucene indexes are supported only on partitioned regions. Replicated region types are not supported.
- Lucene indexes reside on servers. You cannot create a Lucene index on a client.
- A Lucene index applies to only one region. Multiple indexes can be defined for a single region.
- Heterogeneous objects in a single region are supported.
Creating a Lucene Index
Note: Create the Lucene index before creating the region.
When you create a Lucene index, you must provide three pieces of information:
- The name of the index you wish to create
- The name of the region to be indexed and searched
- The names of the fields you wish to index
You must specify at least one field to be indexed.
If the object value for the entries in the region comprises a primitive type value without a field name, then use __REGION_VALUE_FIELD
to specify the field to be indexed. __REGION_VALUE_FIELD
serves as the field name for entry values of all primitive types, including String
, Long
, Integer
, Float
, and Double
.
Each field has a corresponding analyzer to extract terms from text. When no analyzer is specified, the org.apache.lucene.analysis.standard.StandardAnalyzer
is used.
The index has an associated serializer that renders the indexed object as a Lucene document comprised of searchable fields. The default serializer is a simple one that handles top-level fields, but does not render collections or nested objects.
GemFire supplies a built-in serializer, FlatFormatSerializer()
, that handles collections and nested objects. See Using FlatFormatSerializer to Index Fields within Nested Objects for more information regarding Lucene indexes for nested objects.
As a third alternative, you can create your own serializer, which must implement the LuceneSerializer
interface.
Creating a Lucene Index: Java API Example
The following example uses the Java API to create a Lucene index with two fields. No analyzers are specified, so the default analyzer handles both fields. No serializer is specified, so the default serializer is used.
// Get LuceneServiceLuceneService luceneService = LuceneServiceProvider.get(cache); // Create the index on fields with default analyzer// prior to creating the regionluceneService.createIndexFactory() .addField("name") .addField("zipcode") .create(indexName, regionName); Region region = cache.createRegionFactory(RegionShortcut.PARTITION) .create(regionName);
Creating a Lucene Index: Gfsh Example
In gfsh, use the create lucene index command to create Lucene indexes.
The following example creates an index with two fields. The default analyzer handles both fields, and the default serializer is used.
gfsh>create lucene index --name=indexName --region=/orders --field=customer,tags
The next example creates an index, specifying a custom analyzer for the second field. “DEFAULT” in the first analyzer position specifies that the default analyzer will be used for the first field.
gfsh>create lucene index --name=indexName --region=/orders --field=customer,tags --analyzer=DEFAULT,org.apache.lucene.analysis.bg.BulgarianAnalyzer
Creating a Lucene Index: XML Example
This XML configuration file specifies a Lucene index with three fields and three analyzers:
<cache xmlns="http://geode.apache.org/schema/cache" xmlns:lucene="http://geode.apache.org/schema/lucene" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://geode.apache.org/schema/cache http://geode.apache.org/schema/cache/cache-1.0.xsd http://geode.apache.org/schema/lucene http://geode.apache.org/schema/lucene/lucene-1.0.xsd" version="1.0"> <region name="region" refid="PARTITION"> <lucene:index name="myIndex"> <lucene:field name="a" analyzer="org.apache.lucene.analysis.core.KeywordAnalyzer"/> <lucene:field name="b" analyzer="org.apache.lucene.analysis.core.SimpleAnalyzer"/> <lucene:field name="c" analyzer="org.apache.lucene.analysis.standard.ClassicAnalyzer"/> <lucene:field name="d" /> </lucene:index> </region></cache>
Using FlatFormatSerializer to Index Fields within Nested Objects
GemFire supplies a built-in serializer, org.apache.geode.cache.lucene.FlatFormatSerializer
that renders collections and nested objects as searchable fields, which you can access using the syntax fieldnameAtLevel1.fieldnameAtLevel2
for both indexing and querying.
For example, in the following data model, the Customer object contains both a Person object and a collection of Page objects. The Person object also contains a Page object.
public class Customer implements Serializable { private String name; private Collection<String> phoneNumbers; private Collection<Person> contacts; private Page[] myHomePages; ......}public class Person implements Serializable { private String name; private String email; private int revenue; private String address; private String[] phoneNumbers; private Page homepage; .......}public class Page implements Serializable { private int id; // search integer in int format private String title; private String content; ......}
The FlatFormatSerializer
creates one document for each parent object, adding an indexed field for each data field in a nested object, identified by its qualified name. Similarly, collections are flattened and treated as tokens in a single field. For example, the FlatFormatSerializer
could convert a Customer object, with the structure described above, into a document containing fields such as name
, contacts.name
, and contacts.homepage.title
based on the indexed fields specified at index creation. Each segment is a field name, not a field type, because a class (such as Customer) could have more than one field of the same type (such as Person).
The serializer creates and indexes the fields you specify when you request index creation. The example below demonstrates how to index the name
field and the nested fields contacts.name
, contacts.email
, contacts.address
, contacts.homepage.title
.
// Get LuceneServiceLuceneService luceneService = LuceneServiceProvider.get(cache); // Create Index on fields, some are fields in nested objects:luceneService.createIndexFactory().setLuceneSerializer(new FlatFormatSerializer()) .addField("name") .addField("contacts.name") .addField("contacts.email") .addField("contacts.address") .addField("contacts.homepage.title") .create("customerIndex", "Customer"); // Create regionRegion CustomerRegion = ((Cache)cache).createRegionFactory(shortcut).create("Customer");
The gfsh equivalent of the above Java code uses the create lucene index
command, with options specifying the index name, region name, field names, and the FlatFormatSerializer
, specified using its fully qualified name,org.apache.geode.cache.lucene.FlatFormatSerializer
:
gfsh>create lucene index --name=customerIndex --region=Customer --field=name,contacts.name,contacts.email,contacts.address,contacts.homepage.title --serializer=org.apache.geode.cache.lucene.FlatFormatSerializer
The syntax for querying a nested field is the same as for a top level field, but with the additional qualifying parent field name, such as contacts.name:Jones77*
. This distinguishes which “name” field is intended when there can be more than one “name” field at different hierarchical levels in the object.
Java query:
LuceneQuery query = luceneService.createLuceneQueryFactory() .create("customerIndex", "Customer", "contacts.name:Jones77*", "name"); PageableLuceneQueryResults<K,Object> results = query.findPages();
gfsh query:
gfsh>search lucene --name=customerIndex --region=Customer --queryString="contacts.name:Jones77*" --defaultField=name
Queries
Querying a Lucene Index: Gfsh Example
For details, see the gfsh search lucene command reference page.
gfsh>search lucene --name=indexName --region=/orders --queryString="Jones*" --defaultField=customer
Querying a Lucene Index: Java API Example
LuceneQuery<String, Person> query = luceneService.createLuceneQueryFactory() .create(indexName, regionName, "name:John AND zipcode:97006", defaultField);Collection<Person> results = query.findValues();
Destroying an Index
Since a region-destroy operation does not cause the destruction of any Lucene indexes, destroy any Lucene indexes prior to destroying the associated region.
Destroying a Lucene Index: Java API Example
luceneService.destroyIndex(indexName, regionName);
An attempt to destroy a region with a Lucene index will result in an IllegalStateException
, issuing an error message similar to:
java.lang.IllegalStateException: The parent region [/orders] in colocation chain cannot be destroyed, unless all its children [[/indexName#_orders.files]] are destroyed...
Destroying a Lucene Index: Gfsh Example
For details, see the gfsh destroy lucene index command reference page.
The error message that results from an attempt to destroy a region prior to destroying its associated Lucene index will be similar to:
Region /orders cannot be destroyed because it defines Lucene index(es) [/ordersIndex]. Destroy all Lucene indexes before destroying the region.
Changing an Index
Changing an index requires rebuilding it. Implement these steps to change an index:
- Export all region data.
- Destroy the Lucene index.
- Destroy the region.
- Create a new index.
- Create a new region without the user-defined business logic callbacks.
-
Import the region data with the option to turn on callbacks. The callbacks will be to invoke a Lucene async event listener to index the data. The
gfsh import data
command will be of the form:gfsh>import data --region=myReg --member=M3 --file=myReg.gfd --invoke-callbacks=true
If the API is used to import data, the code to set the option to invoke callbacks will be similar to this code fragment:
``` preRegion region = ...;File snapshotFile = ...;RegionSnapshotService service = region.getSnapshotService();SnapshotOptions options = service.createOptions();options.invokeCallbacks(true);service.load(snapshotFile, SnapshotFormat.GEMFIRE, options);```
- Alter the region to add the user-defined business logic callbacks.
Additional Gfsh Commands
See the gfsh describe lucene index command reference page for the command that prints details about a specific index.
See the gfsh list lucene index command reference page for the command that prints details about the Lucene indexes created for all members.
- Join queries between regions are not supported.
- Lucene indexes are stored in on-heap memory only.
- Lucene queries from within transactions are not supported. On an attempt to query from within a transaction, a
LuceneQueryException
is thrown, issuing an error message on the client (accessor) similar to:
Exception in thread "main" org.apache.geode.cache.lucene.LuceneQueryException: Lucene Query cannot be executed within a transaction...
- Lucene indexes must be created prior to creating the region. If an attempt is made to create a Lucene index after creating the region, the error message is similar to:
Member | Status---------------------------- | ------------------------------------------------------192.0.2.0(s2:97639)<v2>:1026 | Failed: The lucene index must be created before region192.0.2.0(s3:97652)<v3>:1027 | Failed: The lucene index must be created before region192.0.2.0(s1:97626)<v1>:1025 | Failed: The lucene index must be created before region
- The order of server creation with respect to index and region creation is important. The cluster configuration service cannot work if servers are created after index creation, but before region creation, as Lucene indexes are propagated to the cluster configuration after region creation. To start servers at multiple points within the start-up process, use this ordering:
- start server(s)
- create Lucene index
- create region
- start additional server(s)
- An invalidate operation on a region entry does not invalidate a corresponding Lucene index entry. A query on a Lucene index that contains values that have been invalidated can return results that no longer exist. Therefore, do not combine entry invalidation with queries on Lucene indexes.
- Lucene indexes are not supported for regions that have eviction configured with a local destroy. Eviction can be configured with overflow to disk, but only the region data is overflowed to disk, not the Lucene index. On an attempt to create a region with eviction configured to do local destroy (with a Lucene index), an
UnsupportedOperationException
is thrown, issuing an error message similar to:
[error 2017/05/02 16:12:32.461 PDT <main> tid=0x1] java.lang.UnsupportedOperationException:Exception in thread "main" java.lang.UnsupportedOperationException: Lucene indexes on regions with eviction and action local destroy are not supported...
-
Be aware that using the same field name in different objects where the field has different data types may have unexpected consequences. For example, if an index on the field SSN has the following entries
Object_1 object_1
has String SSN = “1111”Object_2 object_2
has Integer SSN = 1111Object_3 object_3
has Float SSN = 1111.0
Integers and floats will not be converted into strings. They remain as
IntPoint
andFloatPoint
within Lucene. The standard analyzer will not try to tokenize these values. The standard analyzer will only try to break up string values. So, a string search for “SSN: 1111” will returnobject_1
. AnIntRangeQuery
forupper limit : 1112
andlower limit : 1110
will returnobject_2
, and aFloatRangeQuery
withupper limit : 1111.5
andlower limit : 1111.0
will returnobject_3
. - Backups should only be made for regions with Lucene indexes when there are no puts, updates, or deletes in progress. A backup might cause an inconsistency between region data and a Lucene index. Both the region operation and the associated index operation cause disk operations, yet those disk operations are not done atomically. Therefore, if a backup were taken between the persisted write to a region and the resulting persisted write to the Lucene index, then the backup represents inconsistent data in the region and Lucene index.
FAQs
What is Apache Lucene used for? ›
Apache Lucene™ is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting, nearest-neighbor search across high-dimensionality vectors, spell correction or query suggestions.
What is the difference between Lucene and Elasticsearch? ›Apache Lucene is an open source and free text search engine library written in Java. It is a technology suitable for applications that requires full-text search, and is available cross-platform. Elasticsearch is an enterprise search tool from Elastic in Mountain View, California. A free and open source product.
What is Lucene and how does it work? ›Lucene is a full-text search library in Java which makes it easy to add search functionality to an application or website. It does so by adding content to a full-text index.
Does Lucene use TF IDF? ›VSM does not require weights to be Tf-idf values, but Tf-idf values are believed to produce search results of high quality, and so Lucene is using Tf-idf.
Why is Lucene so fast? ›Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.
Is Lucene part of Elasticsearch? ›Lucene or Apache Lucene is an open-source Java library used as a search engine. Elasticsearch is built on top of Lucene. Elasticsearch converts Lucene into a distributed system/search engine for scaling horizontally.
Why use Elasticsearch instead of Lucene? ›Elasticsearch is built over Lucene and provides a JSON based REST API to refer to Lucene features. Elasticsearch provides a distributed system on top of Lucene. A distributed system is not something Lucene is aware of or built for. Elasticsearch provides this abstraction of distributed structure.
Is Lucene still relevant? ›From my experience, yes. Lucene is a "production" state of art library and Solr/Elasticsearch is very used in many scenarios. This expertise is very on demand. My company personally migrated from ElasticSearch to https://vespa.ai/ and could not be happier.
Why use Lucene? ›Simply put, Lucene uses an “inverted indexing” of data – instead of mapping pages to keywords, it maps keywords to pages just like a glossary at the end of any book. This allows for faster search responses, as it searches through an index, instead of searching through text directly.
Is Lucene a NoSQL database? ›Apache Solr is a subproject of Apache Lucene, which is the indexing technology behind most recently created search and index technology. Solr is a search engine at heart, but it is much more than that. It is a NoSQL database with transactional support.
What are the basic concepts of Lucene? ›
The fundamental concepts in Lucene are index, document, field and term. An index contains a sequence of documents. A document is a sequence of fields. A field is a named sequence of terms.
How does Lucene work internally? ›So when Lucene is searching internally, it makes a two-phase query. The first phase is to list the DocId's found to contain the given Term, and the second phase is to find the Doc based on the DocId. Lucene provides functionality to search by Term as well as to query on the basis of DocId.
What algorithm does Lucene use? ›By default, Lucene uses the TF-IDF and BM25 algorithms. Relevance is scored when data is written and searched. Scoring during data writing is called index-time boosting.
Which is better TF-IDF or word2vec? ›Some key differences between TF-IDF and word2vec is that TF-IDF is a statistical measure that we can apply to terms in a document and then use that to form a vector whereas word2vec will produce a vector for a term and then more work may need to be done to convert that set of vectors into a singular vector or other ...
Who uses Apache Lucene? ›Company | Website | Company Size |
---|---|---|
University of California-Berkeley | berkeley.edu | >10000 |
Blackfriars Group | blackfriarsgroup.com | >10000 |
Therap Services LLC | therapservices.net | 200-500 |
CrateDB – open source, distributed SQL database built on Lucene.
What is the difference between Solr and Lucene? ›Lucene is a full-text search engine library, whereas Solr is a full-text search engine web application built on Lucene. One way to think about Lucene and Solr is as a car and its engine. The engine is Lucene; the car is Solr. A wide array of companies (Ford, Salesforce, etc.)
Does Neo4j use Lucene? ›The underlying engine used by Neo4j is Apache Lucene, a free and open-source information retrieval software library.
Does Netflix use Elasticsearch? ›Overview. With 700-800 production nodes spread across 100 Elasticsearch clusters, Netflix is pushing the envelope when it comes to extracting real-time insights on a massive scale.
Does Kibana use Lucene? ›You can use KBL or Lucene in Kibana.
What is better than Elasticsearch? ›
Solr has more advantages when it comes to the static data, because of its caches and the ability to use an uninverted reader for faceting and sorting – for example, e-commerce. On the other hand, Elasticsearch is better suited – and much more frequently used – for timeseries data use cases, like log analysis use cases.
Why Elasticsearch should not be used as a database? ›It is a search engine not a database
Most databases are ACID compliant. Elasticsearch is not which means it is inherently riskier to use it like a database. Among other idiosyncrasies, Elasticsearch offers atomicity only on a per-document basis, not on a transaction basis.
- You are looking for catering to transaction handling.
- You are planning to do a highly intensive computational job in the data store layer.
- You are looking to use this as a primary data store. ...
- You are looking for an ACID compliant data store.
- You are looking for a durable data store.
Disadvantages of Elasticsearch
Sometimes, the problem of split-brain situations occurs in Elasticsearch. Unlike Apache Solr, Elasticsearch does not have multi-language support for handling request and response data. Elasticsearch is not a good data store as other options such as MongoDB, Hadoop, etc.
Both are better suited for developing a search engine and both are based on Lucene.
What language is Lucene written in? ›Apache Lucene
How old is Lucene? ›The software was first written in Java back in 1999 by Doug Cutting before the platform joined the Apache Software Foundation in 2001. To this day it is still one of the most active projects within the Apache Foundation family.
Does MongoDB use Lucene? ›Amazon and MongoDB both use Lucene every day, and the most important use case is no doubt application search, in which the engine is primarily used by humans.
Is Lucene a search engine? ›Lucene is a full-text search library in Java which makes it easy to add search functionality to an application or website. Lucene search engine is used for full-text search in several IBM® Control Desk application such as Global Search, Self-Service center, Service Portal etc.
How do you use Lucene for indexing? ›- Create a method to get a lucene document from a text file.
- Create various types of fields which are key value pairs containing keys as names and values as contents to be indexed.
- Set field to be analyzed or not. ...
- Add the newly created fields to the document object and return it to the caller method.
What are the 4 types of NoSQL databases? ›
- Document databases.
- Key-value stores.
- Column-oriented databases.
- Graph databases.
NoSQL database systems are distributed, non-relational databases that also use non-SQL language and mechanisms in working with data. NoSQL databases can be found in companies like Amazon, Google, Netflix, and Facebook that are dependent on large volumes of data not suited to relational databases.
What does Lucene mean in Java? ›Lucene Core is a Java library providing powerful indexing and search features, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities. The PyLucene sub project provides Python bindings for Lucene Core.
How does Lucene store data? ›Lucene spreads its index across several on-disk files, each format purpose-built for a specific use case. These files are organized into logical “segments” that represents subsets of documents across the corpus.
What is a Lucene query? ›Lucene is a query language that can be used to filter messages in your PhishER inbox. A query written in Lucene can be broken down into three parts: Field The ID or name of a specific container of information in a database. If a field is referenced in a query string, a colon ( : ) must follow the field name.
How to use Lucene search engine in Java? ›- Step 1 - Create Java Project. The first step is to create a simple Java Project using Eclipse IDE. ...
- Step 2 - Add Required Libraries. Let us now add Lucene core Framework library in our project. ...
- Step 3 - Create Source Files. ...
- Step 4 - Data & Index directory creation. ...
- Step 5 - Running the program.
Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation.
What type of index is Lucene? ›A Lucene Index Is an Inverted Index
An index may store a heterogeneous set of documents, with any number of different fields that may vary by a document in arbitrary ways. Lucene indexes terms, which means that Lucene search searches over terms. A term combines a field name with a token.
Flushing normally occurs automatically when you index a large amount of data that wouldn't be feasible to keep in memory. You only commit when you have reached a state that you want to really persist.
How does Lucene calculate score? ›Lucene uses a combination of the Vector Space Model (VSM) and the Boolean model of information Retrieval to determine how relevant a document is to a user's query. It assigns a default score between 0 and 1 to all search results, depending on multiple factors related to document relevancy.
What is the maximum score in Lucene? ›
There isn't really a maximum score. When Lucene does it's scoring, it basically sums a set of scores together to give a total score.
What is Lucene's practical scoring? ›The Practical Scoring Function
When a document matches the query, Lucene calculates the score by combining the score of each matching term. This scoring calculation is done by the practical scoring function. score(q,d) is the relevance score of document d for query q. queryNorm(q) is the query normalization factor.
BERT's bidirectional encoding strategy allows it to ingest the position of a each word in a sequence and incorporate that into that word's embedding, while Word2Vec embeddings aren't able to account for word position.
What is the disadvantage of TF-IDF? ›However, TF-IDF has several limitations: – It computes document similarity directly in the word-count space, which may be slow for large vocabularies. – It assumes that the counts of different words provide independent evidence of similarity. – It makes no use of semantic similarities between words.
Why is bag of words better than TF-IDF? ›Bag of Words just creates a set of vectors containing the count of word occurrences in the document (reviews), while the TF-IDF model contains information on the more important words and the less important ones as well. Bag of Words vectors are easy to interpret.
Is Lucene written in Java? ›Apache Lucene
Does Netflix use Solr? ›Solr (pronounced like “solar”) is an open source platform designed for indexing and searching data. Created in 2004, it's widely used as a data search tool at a variety of high-profile companies, including Best Buy, eBay and Netflix.
Should I use Solr or Lucene? ›Many people new to Lucene and Solr will ask the obvious question: Should I use Lucene or Solr? The answer is simple: if you're asking yourself this question, in 99% of situations, what you want to use is Solr. A simple way to conceptualize the relationship between Solr and Lucene is that of a car and its engine.
Is Solr obsolete? ›Spring Data Solr has been discontinued and is currently only maintained in its current version for support reasons until early 2023 at which point it is going to be moved to the Spring Attic entirely.
Does Amazon use Solr? ›Apache Solr packaged by Bitnami
Bitnami has partnered with AWS to make Apache Solr available in the Amazon Web Services.
What companies use Lucene? ›
AOL is using Solr to power its channels. www.aol.com | Apple is using Solr. www.apple.com |
---|---|
Intuit is using Solr. www.intuit.com | Salesforce is using Solr. www.salesforce.com |
Ford is using Solr. www.ford.com | Smithsonian is using Solr as cross catalog faceted search. www.si.edu/ |
43 companies reportedly use Lucene in their tech stacks, including Twitter, Slack, and Evernote.
Is Lucene a NoSQL? ›Apache Solr is a subproject of Apache Lucene, which is the indexing technology behind most recently created search and index technology. Solr is a search engine at heart, but it is much more than that. It is a NoSQL database with transactional support.
What is the difference between Apache Lucene and Solr? ›Lucene is a full-text search engine library, whereas Solr is a full-text search engine web application built on Lucene. One way to think about Lucene and Solr is as a car and its engine. The engine is Lucene; the car is Solr.
What is better than ElasticSearch? ›Solr has more advantages when it comes to the static data, because of its caches and the ability to use an uninverted reader for faceting and sorting – for example, e-commerce. On the other hand, Elasticsearch is better suited – and much more frequently used – for timeseries data use cases, like log analysis use cases.