Updated dual model system to fully support both M2M100 and NLLB-200: **NLLB-200 Model (204 languages)** - Added all 204 FLORES-200 language codes to nllb200_lang_codes dictionary - Updated language code mappings with FLORES-200 format (xxx_Yyyy) - Added 24+ NLLB-exclusive languages including: - Southeast Asian: Acehnese, Balinese, Banjar, Buginese, Minangkabau - South Asian: Assamese, Awadhi, Bhojpuri, Chhattisgarhi, Magahi, Maithili, Meitei, Odia, Santali - African: Akan, Bambara, Bemba, Chokwe, Dyula, Fon, Kikuyu, Kimbundu, Kongo, Luba-Kasai, Luo, Mossi, Nuer - Arabic dialects: Mesopotamian, Najdi, Moroccan, Egyptian, Tunisian, South/North Levantine - European regional: Asturian, Friulian, Latgalian, Ligurian, Limburgish, Lombard, Norwegian Nynorsk/Bokmål, Occitan, Sardinian, Sicilian, Silesian, Venetian - Other: Dzongkha, Fijian, Guarani, Kabyle, Kabuverdianu, Papiamento, Quechua, Samoan, Sango, Shan, Tamasheq, Tibetan, Tok Pisin **Updated Files** - app/translator.py: Complete NLLB-200 language mappings (204 languages) - app/main.py: Added display names for all 204+ language codes - README.md: Updated with dual model system, NLLB-200 details, license info - CLAUDE.md: Updated developer documentation with model architecture **Testing** - Verified M2M100: 105 languages working ✅ - Verified NLLB-200: 204 languages working ✅ - Tested NLLB-exclusive languages (Bemba, Fon, etc.) ✅ **License Information** - M2M100: Apache 2.0 - Commercial use allowed - NLLB-200: CC-BY-NC 4.0 - Non-commercial only 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
6.9 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
This is a multilingual translation API service built with FastAPI and Hugging Face Transformers. It provides any-to-any translation between up to 204 languages using Facebook's M2M100 and NLLB-200 models.
Dual Model System:
- M2M100 (default): 105 languages, Apache 2.0 License, commercial use allowed
- NLLB-200 (optional): 204 languages, CC-BY-NC 4.0 License, non-commercial only
Development Commands
Local Development
# Setup virtual environment and install dependencies
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
# Run the development server (with auto-reload)
python run.py
# Or run with uvicorn directly
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
Docker Development
# Build and run with Docker Compose
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down
# Rebuild after code changes
docker-compose up -d --build
Testing the API
# Health check
curl http://localhost:8001/health
# Translate Malay to English (M2M100, default)
curl -X POST "http://localhost:8001/api/translate" \
-H "Content-Type: application/json" \
-d '{"text": "Selamat pagi", "source_lang": "ms", "target_lang": "en"}'
# Translate English to Korean (M2M100)
curl -X POST "http://localhost:8001/api/translate" \
-H "Content-Type: application/json" \
-d '{"text": "Good morning", "source_lang": "en", "target_lang": "ko", "model": "m2m100"}'
# Translate English to Bemba (NLLB-200 exclusive language)
curl -X POST "http://localhost:8001/api/translate" \
-H "Content-Type: application/json" \
-d '{"text": "Welcome", "source_lang": "en", "target_lang": "bem", "model": "nllb200"}'
# Get supported languages for M2M100
curl http://localhost:8001/api/supported-languages?model=m2m100
# Get supported languages for NLLB-200
curl http://localhost:8001/api/supported-languages?model=nllb200
Architecture
Core Components
-
app/main.py - FastAPI application with endpoint definitions
- Lifespan events handle model preloading on startup
- CORS middleware configured for cross-origin requests
- Main endpoints: root (
/), health (/health), translate (/api/translate), supported-languages (/api/supported-languages) - Includes lang_names dictionary with display names for all 204+ language codes
-
app/translator.py - Translation service singleton
- Manages loading and caching of both M2M100 and NLLB-200 models
- Automatically detects and uses GPU if available (CUDA)
- Supports lazy loading - models are loaded on first use or preloaded at startup
- Model support:
- M2M100:
facebook/m2m100_418M(105 languages) - NLLB-200:
facebook/nllb-200-distilled-600M(204 languages, FLORES-200 format)
- M2M100:
- Language code mapping for both models (m2m100_lang_codes, nllb200_lang_codes)
-
app/models.py - Pydantic schemas for request/response validation
TranslationRequest: Validates input (text, source_lang, target_lang, model)TranslationResponse: Structured output with metadataHealthResponse: Health check response- Model parameter accepts "m2m100" (default) or "nllb200"
-
app/config.py - Configuration management using pydantic-settings
- Loads settings from environment variables or
.envfile - Default values provided for all settings
- Loads settings from environment variables or
Translation Flow
- Request received at
/api/translateendpoint - Pydantic validates request schema (including optional model parameter)
- TranslationService selects model based on
modelparameter (m2m100 or nllb200) - Language codes are validated against the selected model's supported languages
- Model is loaded if not already cached in memory
- Text is tokenized with model-specific language codes:
- M2M100: Uses simple codes (e.g., "en", "ko")
- NLLB-200: Uses FLORES-200 format (e.g., "eng_Latn", "kor_Hang")
- Translation generated using the model
- Response includes original text, translation, and model metadata
Model Caching
- Models are downloaded to
MODEL_CACHE_DIR(default:./models/) - Once downloaded, models persist across restarts
- In Docker, use volume mount to persist models
- First translation request may be slow due to model download:
- M2M100: ~1.6GB
- NLLB-200: ~2.5GB
- Both models can be cached simultaneously
Device Selection
The translator automatically detects GPU availability:
- CUDA GPU: Used automatically if available for faster inference
- CPU: Fallback option, slower but works everywhere
Configuration
Environment variables (see .env.example):
API_HOST/API_PORT: Server bindingMODEL_CACHE_DIR: Where to store downloaded modelsMAX_LENGTH: Maximum token length for translation (default 512)ALLOWED_ORIGINS: CORS configuration
Common Tasks
Adding New Language Codes
The system currently supports all 105 M2M100 languages and all 204 NLLB-200 languages. To add new language code mappings:
-
For M2M100: Update
m2m100_lang_codesdictionary inapp/translator.py- Format:
"user_code": "m2m100_code"(e.g.,"en": "en")
- Format:
-
For NLLB-200: Update
nllb200_lang_codesdictionary inapp/translator.py- Format:
"user_code": "flores_code"(e.g.,"en": "eng_Latn") - Reference: https://github.com/facebookresearch/flores/blob/main/flores200/README.md
- Format:
-
Display Names: Add entries to
lang_namesdictionary inapp/main.py- Format:
"code": {"name": "English Name", "native": "Native Name"}
- Format:
Modifying Translation Behavior
Translation parameters are in app/translator.py in the translate() method:
- Adjust
max_lengthin tokenizer call to handle longer texts - Modify generation parameters passed to
model.generate()for different translation strategies - Model-specific behavior:
- M2M100: Uses
tokenizer.get_lang_id()for target language - NLLB-200: Uses
tokenizer.convert_tokens_to_ids()for target language
- M2M100: Uses
Production Deployment
For production use:
- Set
reload=Falseinrun.pyor use production-ready uvicorn command - Configure proper
ALLOWED_ORIGINSinstead of "*" - Add authentication middleware if needed
- Consider using multiple workers:
uvicorn app.main:app --workers 4 - Mount persistent volume for
models/directory in Docker
API Documentation
When the server is running, interactive API documentation is available at:
- Swagger UI: http://localhost:8001/docs (Docker) or http://localhost:8000/docs (local)
- ReDoc: http://localhost:8001/redoc (Docker) or http://localhost:8000/redoc (local)
Model Licenses
IMPORTANT: Be aware of licensing when deploying:
- M2M100: Apache 2.0 License - Commercial use allowed ✅
- NLLB-200: CC-BY-NC 4.0 License - Non-commercial use only ⚠️
Always use M2M100 for commercial applications. Only use NLLB-200 for research, education, or personal non-commercial projects.