Skip to main contentSkip to user menuSkip to navigation

Apache Lucene

Master Apache Lucene: full-text search engine library, inverted indexes, scoring, and search optimization.

40 min readIntermediate
Not Started
Loading...

What is Apache Lucene?

Apache Lucene is a powerful, full-featured text search engine library written in Java. Originally created by Doug Cutting, Lucene provides the foundation for many popular search applications including Elasticsearch, Apache Solr, and Amazon CloudSearch. It excels at indexing and searching large volumes of text data with sophisticated relevance scoring and query capabilities.

Lucene uses inverted indexes to achieve fast search performance, supporting complex queries, faceted search, highlighting, and real-time indexing. Its flexible architecture and extensive API make it the go-to choice for building custom search solutions that require fine-grained control over indexing and search behavior.

Lucene Performance Calculator

1220.7GB
Index Size
12ms
Query Time
5,000
Docs/sec Indexing
1000
Queries/sec

Memory Needed: 1619MB

Indexing Time: 3 min

Data Size: 4883GB

Lucene Core Components

Document & Fields

Core data model for representing searchable content.

• Document container
• Field types (text, keyword, numeric)
• Index/store/analyze options
• Multi-valued fields

Analyzer

Text processing pipeline for indexing and searching.

• Tokenization
• Lowercase filtering
• Stop word removal
• Stemming/lemmatization

IndexWriter

Manages document indexing and index updates.

• Document addition/update/delete
• Segment management
• Commit strategies
• Real-time indexing

IndexSearcher

Executes queries and returns ranked results.

• Query execution
• Relevance scoring
• Result ranking
• Field retrieval

Lucene Query Types

Term and Boolean Queries

Basic building blocks for precise matching and logical combinations.

Basic Query Examples
// Term query - exact match
TermQuery termQuery = new TermQuery(new Term("title", "lucene"));

// Boolean query - combine multiple conditions
BooleanQuery.Builder boolQuery = new BooleanQuery.Builder();
boolQuery.add(new TermQuery(new Term("title", "search")), BooleanClause.Occur.MUST);
boolQuery.add(new TermQuery(new Term("content", "engine")), BooleanClause.Occur.SHOULD);

Phrase and Proximity Queries

Search for exact phrases or terms within specified distances.

Phrase Query Examples
// Exact phrase query
PhraseQuery.Builder phraseQuery = new PhraseQuery.Builder();
phraseQuery.add(new Term("content", "apache"));
phraseQuery.add(new Term("content", "lucene"));

// Proximity query - terms within 5 positions
phraseQuery.setSlop(5);

Range and Wildcard Queries

Search within numeric ranges or use pattern matching.

Advanced Query Examples
// Numeric range query
Query rangeQuery = IntPoint.newRangeQuery("price", 100, 1000);

// Wildcard query - pattern matching
WildcardQuery wildcardQuery = new WildcardQuery(new Term("title", "search*"));

// Fuzzy query - edit distance
FuzzyQuery fuzzyQuery = new FuzzyQuery(new Term("content", "lucene"), 2);

Real-World Lucene Implementations

LinkedIn

Uses Lucene-based search across profiles, jobs, and content with 800M+ members.

  • • People search with fuzzy matching
  • • Job recommendation engine
  • • Content discovery and news feed
  • • Real-time indexing of profile updates

Twitter

Powers real-time search across billions of tweets with custom Lucene optimizations.

  • • Real-time tweet indexing
  • • Trending topic detection
  • • @ mention and hashtag search
  • • Distributed search architecture

Stack Overflow

Elasticsearch (Lucene-based) powers search across 50M+ programming questions.

  • • Code search with syntax highlighting
  • • Tag-based filtering and faceting
  • • Similar question recommendations
  • • Full-text search across Q&A content

Wikipedia

Uses Lucene through Elasticsearch for searching across 6M+ articles in multiple languages.

  • • Multi-language search support
  • • Auto-complete suggestions
  • • Category and infobox search
  • • Cross-language search capabilities

Lucene Best Practices

✅ Do

  • • Use appropriate analyzers for your content
  • • Optimize field types (stored vs indexed)
  • • Implement proper segment merging strategies
  • • Use NRT (Near Real-Time) search when needed
  • • Cache frequently used queries and filters
  • • Monitor index size and performance metrics

❌ Don't

  • • Create unnecessarily large documents
  • • Use wildcard queries with leading wildcards
  • • Ignore index optimization and cleanup
  • • Store large binary data in index fields
  • • Create too many small segments
  • • Use string-based queries without proper escaping
No quiz questions available
Questions prop is empty