What is langid.py?
langid.py is a standalone language identification tool written in Python. Developed by Marco Lui, it provides fast and accurate language detection for text content using a pre-trained machine learning model based on Naive Bayes classification with n-gram features. The library is designed to be lightweight, dependency-free, and ready to use out of the box.
Supporting 97 languages out of the box, langid.py is particularly popular in web applications, content management systems, and data processing pipelines where quick language identification is needed. It excels at handling diverse text content including social media posts, web pages, documents, and user-generated content across multiple languages and scripts.
langid.py Performance Calculator
Processing Time: 1.0ms
High Confidence: 93%
Batch Throughput: 1,500/sec
langid.py Core Features
Multi-Language Support
Pre-trained support for 97 languages across diverse script families.
• Latin, Cyrillic, Arabic scripts
• CJK (Chinese, Japanese, Korean)
• Indian and Southeast Asian languages
• European minority languages
Fast Processing
Optimized for speed with lightweight models and efficient algorithms.
• Batch processing support
• Minimal memory footprint
• No external dependencies
• Production-ready performance
Confidence Scoring
Probabilistic output with confidence scores for reliability assessment.
• Confidence thresholds
• Uncertainty quantification
• Top-k language predictions
• Statistical validation
Easy Integration
Simple API design with both command-line and programmatic interfaces.
• Command-line interface
• Single-function calls
• Cross-platform compatibility
• Minimal setup required
langid.py Usage Examples
Basic Language Detection
Simple language identification with confidence scores.
import langid
# Set languages to detect (optional, default is all 97)
langid.set_languages(['en', 'es', 'fr', 'de', 'it'])
# Detect language of a single text
text = "Hello, how are you today?"
lang, confidence = langid.classify(text)
print(f"Language: {lang}, Confidence: {confidence:.4f}")
# Detect multiple texts
texts = [
"Hello world",
"Bonjour le monde",
"Hola mundo",
"Hallo Welt"
]
for text in texts:
lang, conf = langid.classify(text)
print(f"'{text}' -> {lang} ({conf:.3f})")
# Reset to all languages
langid.set_languages(None)
Batch Processing & Filtering
Process multiple texts efficiently with confidence filtering.
import langid
def batch_language_detection(texts, confidence_threshold=0.8):
"""Process multiple texts with confidence filtering."""
results = []
for text in texts:
if len(text.strip()) < 20: # Skip very short texts
results.append(('unknown', 0.0, 'too_short'))
continue
lang, confidence = langid.classify(text)
if confidence >= confidence_threshold:
results.append((lang, confidence, 'reliable'))
else:
results.append((lang, confidence, 'uncertain'))
return results
# Example usage
web_comments = [
"This is an excellent product, highly recommended!",
"Ce produit est vraiment fantastique, je le recommande vivement!",
"Este producto es excelente, lo recomiendo mucho!",
"xyz123", # Too short
"混合语言 mixed language content here", # Mixed language
]
results = batch_language_detection(web_comments)
for i, (text, (lang, conf, status)) in enumerate(zip(web_comments, results)):
print(f"Text {i+1}: {lang} ({conf:.3f}) - {status}")
# Filter reliable detections only
reliable_results = [(text, lang) for text, (lang, conf, status)
in zip(web_comments, results) if status == 'reliable']
print(f"\nReliable detections: {len(reliable_results)}/{len(web_comments)}")
Web Application Integration
Integration with web frameworks for real-time language detection.
from flask import Flask, request, jsonify
import langid
import time
app = Flask(__name__)
# Configure langid for common web languages
COMMON_LANGUAGES = ['en', 'es', 'fr', 'de', 'it', 'pt', 'ru', 'zh', 'ja', 'ko']
langid.set_languages(COMMON_LANGUAGES)
@app.route('/detect-language', methods=['POST'])
def detect_language():
"""API endpoint for language detection."""
try:
data = request.get_json()
text = data.get('text', '').strip()
if not text:
return jsonify({'error': 'No text provided'}), 400
if len(text) < 10:
return jsonify({'error': 'Text too short for reliable detection'}), 400
start_time = time.time()
language, confidence = langid.classify(text)
processing_time = (time.time() - start_time) * 1000 # Convert to ms
return jsonify({
'language': language,
'confidence': round(confidence, 4),
'processing_time_ms': round(processing_time, 2),
'text_length': len(text),
'reliable': confidence >= 0.8
})
except Exception as e:
return jsonify({'error': str(e)}), 500
@app.route('/batch-detect', methods=['POST'])
def batch_detect():
"""Batch language detection endpoint."""
try:
data = request.get_json()
texts = data.get('texts', [])
if not isinstance(texts, list):
return jsonify({'error': 'Texts must be a list'}), 400
if len(texts) > 1000: # Limit batch size
return jsonify({'error': 'Batch size too large (max 1000)'}), 400
start_time = time.time()
results = []
for text in texts:
if isinstance(text, str) and len(text.strip()) >= 10:
lang, conf = langid.classify(text.strip())
results.append({
'text': text[:100] + '...' if len(text) > 100 else text,
'language': lang,
'confidence': round(conf, 4)
})
else:
results.append({
'text': text[:100] + '...' if len(str(text)) > 100 else str(text),
'language': 'unknown',
'confidence': 0.0
})
total_time = (time.time() - start_time) * 1000
return jsonify({
'results': results,
'total_processing_time_ms': round(total_time, 2),
'texts_processed': len(texts),
'average_time_per_text_ms': round(total_time / len(texts), 2) if texts else 0
})
except Exception as e:
return jsonify({'error': str(e)}), 500
if __name__ == '__main__':
app.run(debug=True)
Real-World langid.py Implementations
Wikipedia
Uses langid.py for automated language detection in user contributions and content organization.
- • Multi-language content categorization
- • User contribution language tagging
- • Cross-language article linking
- • Processing millions of edits daily
Mozilla
Integrates langid.py in Firefox for automatic webpage language detection and translation features.
- • Webpage language identification
- • Translation service integration
- • User interface localization
- • Supporting 90+ languages
Common Crawl
Employs langid.py for large-scale web content language classification and dataset preparation.
- • Web crawl language categorization
- • Multi-lingual dataset creation
- • Processing petabytes of text data
- • Research dataset preparation
Archive.org
Leverages langid.py for historical document language identification and digital archive organization.
- • Historical document classification
- • Digital archive language tagging
- • Book and manuscript categorization
- • Multi-decade content processing
langid.py Performance Optimization
Speed Optimization
- • Limit language set to expected languages only
- • Cache frequently processed texts
- • Use batch processing for multiple texts
- • Pre-filter very short texts (<20 chars)
- • Implement text preprocessing pipelines
- • Consider parallel processing for large datasets
Accuracy Optimization
- • Use confidence thresholds (0.8+ recommended)
- • Combine multiple text samples when possible
- • Handle mixed-language content appropriately
- • Validate results with domain-specific rules
- • Consider text preprocessing (cleaning, normalization)
- • Monitor accuracy on your specific data types
langid.py Best Practices
✅ Do
- • Set specific language subsets when possible
- • Use confidence scores to filter unreliable results
- • Handle very short texts (<20 chars) separately
- • Test accuracy on your specific data domain
- • Implement fallback strategies for low confidence
- • Clean text before processing (remove markup, etc.)
❌ Don't
- • Ignore confidence scores in production systems
- • Assume 100% accuracy for all text types
- • Use for dialect or regional variant detection
- • Process mixed-language content without consideration
- • Apply to heavily coded or technical text
- • Rely solely on langid.py for critical language decisions