Skip to main contentSkip to user menuSkip to navigation

langid.py

Master langid.py: automatic language detection, multilingual text processing, and language identification.

20 min readBeginner
Not Started
Loading...

What is langid.py?

langid.py is a standalone language identification tool written in Python. Developed by Marco Lui, it provides fast and accurate language detection for text content using a pre-trained machine learning model based on Naive Bayes classification with n-gram features. The library is designed to be lightweight, dependency-free, and ready to use out of the box.

Supporting 97 languages out of the box, langid.py is particularly popular in web applications, content management systems, and data processing pipelines where quick language identification is needed. It excels at handling diverse text content including social media posts, web pages, documents, and user-generated content across multiple languages and scripts.

langid.py Performance Calculator

1,000
Texts/sec
99%
Accuracy
83%
Avg Confidence
7MB
Memory Usage

Processing Time: 1.0ms

High Confidence: 93%

Batch Throughput: 1,500/sec

langid.py Core Features

Multi-Language Support

Pre-trained support for 97 languages across diverse script families.

• 97 languages out-of-the-box
• Latin, Cyrillic, Arabic scripts
• CJK (Chinese, Japanese, Korean)
• Indian and Southeast Asian languages
• European minority languages

Fast Processing

Optimized for speed with lightweight models and efficient algorithms.

• Sub-millisecond processing
• Batch processing support
• Minimal memory footprint
• No external dependencies
• Production-ready performance

Confidence Scoring

Probabilistic output with confidence scores for reliability assessment.

• Probability distributions
• Confidence thresholds
• Uncertainty quantification
• Top-k language predictions
• Statistical validation

Easy Integration

Simple API design with both command-line and programmatic interfaces.

• Python API
• Command-line interface
• Single-function calls
• Cross-platform compatibility
• Minimal setup required

langid.py Usage Examples

Basic Language Detection

Simple language identification with confidence scores.

Basic Usage
import langid

# Set languages to detect (optional, default is all 97)
langid.set_languages(['en', 'es', 'fr', 'de', 'it'])

# Detect language of a single text
text = "Hello, how are you today?"
lang, confidence = langid.classify(text)
print(f"Language: {lang}, Confidence: {confidence:.4f}")

# Detect multiple texts
texts = [
    "Hello world",
    "Bonjour le monde", 
    "Hola mundo",
    "Hallo Welt"
]

for text in texts:
    lang, conf = langid.classify(text)
    print(f"'{text}' -> {lang} ({conf:.3f})")

# Reset to all languages
langid.set_languages(None)

Batch Processing & Filtering

Process multiple texts efficiently with confidence filtering.

Batch Processing
import langid

def batch_language_detection(texts, confidence_threshold=0.8):
    """Process multiple texts with confidence filtering."""
    results = []
    
    for text in texts:
        if len(text.strip()) < 20:  # Skip very short texts
            results.append(('unknown', 0.0, 'too_short'))
            continue
            
        lang, confidence = langid.classify(text)
        
        if confidence >= confidence_threshold:
            results.append((lang, confidence, 'reliable'))
        else:
            results.append((lang, confidence, 'uncertain'))
    
    return results

# Example usage
web_comments = [
    "This is an excellent product, highly recommended!",
    "Ce produit est vraiment fantastique, je le recommande vivement!",
    "Este producto es excelente, lo recomiendo mucho!",
    "xyz123",  # Too short
    "混合语言 mixed language content here",  # Mixed language
]

results = batch_language_detection(web_comments)
for i, (text, (lang, conf, status)) in enumerate(zip(web_comments, results)):
    print(f"Text {i+1}: {lang} ({conf:.3f}) - {status}")
    
# Filter reliable detections only
reliable_results = [(text, lang) for text, (lang, conf, status) 
                   in zip(web_comments, results) if status == 'reliable']

print(f"\nReliable detections: {len(reliable_results)}/{len(web_comments)}")

Web Application Integration

Integration with web frameworks for real-time language detection.

Flask Web Application
from flask import Flask, request, jsonify
import langid
import time

app = Flask(__name__)

# Configure langid for common web languages
COMMON_LANGUAGES = ['en', 'es', 'fr', 'de', 'it', 'pt', 'ru', 'zh', 'ja', 'ko']
langid.set_languages(COMMON_LANGUAGES)

@app.route('/detect-language', methods=['POST'])
def detect_language():
    """API endpoint for language detection."""
    try:
        data = request.get_json()
        text = data.get('text', '').strip()
        
        if not text:
            return jsonify({'error': 'No text provided'}), 400
            
        if len(text) < 10:
            return jsonify({'error': 'Text too short for reliable detection'}), 400
        
        start_time = time.time()
        language, confidence = langid.classify(text)
        processing_time = (time.time() - start_time) * 1000  # Convert to ms
        
        return jsonify({
            'language': language,
            'confidence': round(confidence, 4),
            'processing_time_ms': round(processing_time, 2),
            'text_length': len(text),
            'reliable': confidence >= 0.8
        })
        
    except Exception as e:
        return jsonify({'error': str(e)}), 500

@app.route('/batch-detect', methods=['POST'])
def batch_detect():
    """Batch language detection endpoint."""
    try:
        data = request.get_json()
        texts = data.get('texts', [])
        
        if not isinstance(texts, list):
            return jsonify({'error': 'Texts must be a list'}), 400
            
        if len(texts) > 1000:  # Limit batch size
            return jsonify({'error': 'Batch size too large (max 1000)'}), 400
        
        start_time = time.time()
        results = []
        
        for text in texts:
            if isinstance(text, str) and len(text.strip()) >= 10:
                lang, conf = langid.classify(text.strip())
                results.append({
                    'text': text[:100] + '...' if len(text) > 100 else text,
                    'language': lang,
                    'confidence': round(conf, 4)
                })
            else:
                results.append({
                    'text': text[:100] + '...' if len(str(text)) > 100 else str(text),
                    'language': 'unknown',
                    'confidence': 0.0
                })
        
        total_time = (time.time() - start_time) * 1000
        
        return jsonify({
            'results': results,
            'total_processing_time_ms': round(total_time, 2),
            'texts_processed': len(texts),
            'average_time_per_text_ms': round(total_time / len(texts), 2) if texts else 0
        })
        
    except Exception as e:
        return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
    app.run(debug=True)

Real-World langid.py Implementations

Wikipedia

Uses langid.py for automated language detection in user contributions and content organization.

  • • Multi-language content categorization
  • • User contribution language tagging
  • • Cross-language article linking
  • • Processing millions of edits daily

Mozilla

Integrates langid.py in Firefox for automatic webpage language detection and translation features.

  • • Webpage language identification
  • • Translation service integration
  • • User interface localization
  • • Supporting 90+ languages

Common Crawl

Employs langid.py for large-scale web content language classification and dataset preparation.

  • • Web crawl language categorization
  • • Multi-lingual dataset creation
  • • Processing petabytes of text data
  • • Research dataset preparation

Archive.org

Leverages langid.py for historical document language identification and digital archive organization.

  • • Historical document classification
  • • Digital archive language tagging
  • • Book and manuscript categorization
  • • Multi-decade content processing

langid.py Performance Optimization

Speed Optimization

  • • Limit language set to expected languages only
  • • Cache frequently processed texts
  • • Use batch processing for multiple texts
  • • Pre-filter very short texts (<20 chars)
  • • Implement text preprocessing pipelines
  • • Consider parallel processing for large datasets

Accuracy Optimization

  • • Use confidence thresholds (0.8+ recommended)
  • • Combine multiple text samples when possible
  • • Handle mixed-language content appropriately
  • • Validate results with domain-specific rules
  • • Consider text preprocessing (cleaning, normalization)
  • • Monitor accuracy on your specific data types

langid.py Best Practices

✅ Do

  • • Set specific language subsets when possible
  • • Use confidence scores to filter unreliable results
  • • Handle very short texts (<20 chars) separately
  • • Test accuracy on your specific data domain
  • • Implement fallback strategies for low confidence
  • • Clean text before processing (remove markup, etc.)

❌ Don't

  • • Ignore confidence scores in production systems
  • • Assume 100% accuracy for all text types
  • • Use for dialect or regional variant detection
  • • Process mixed-language content without consideration
  • • Apply to heavily coded or technical text
  • • Rely solely on langid.py for critical language decisions
No quiz questions available
Questions prop is empty