Add complete NLLB-200 support with all 204 FLORES-200 languages

Updated dual model system to fully support both M2M100 and NLLB-200:

**NLLB-200 Model (204 languages)**
- Added all 204 FLORES-200 language codes to nllb200_lang_codes dictionary
- Updated language code mappings with FLORES-200 format (xxx_Yyyy)
- Added 24+ NLLB-exclusive languages including:
  - Southeast Asian: Acehnese, Balinese, Banjar, Buginese, Minangkabau
  - South Asian: Assamese, Awadhi, Bhojpuri, Chhattisgarhi, Magahi, Maithili, Meitei, Odia, Santali
  - African: Akan, Bambara, Bemba, Chokwe, Dyula, Fon, Kikuyu, Kimbundu, Kongo, Luba-Kasai, Luo, Mossi, Nuer
  - Arabic dialects: Mesopotamian, Najdi, Moroccan, Egyptian, Tunisian, South/North Levantine
  - European regional: Asturian, Friulian, Latgalian, Ligurian, Limburgish, Lombard, Norwegian Nynorsk/Bokmål, Occitan, Sardinian, Sicilian, Silesian, Venetian
  - Other: Dzongkha, Fijian, Guarani, Kabyle, Kabuverdianu, Papiamento, Quechua, Samoan, Sango, Shan, Tamasheq, Tibetan, Tok Pisin

**Updated Files**
- app/translator.py: Complete NLLB-200 language mappings (204 languages)
- app/main.py: Added display names for all 204+ language codes
- README.md: Updated with dual model system, NLLB-200 details, license info
- CLAUDE.md: Updated developer documentation with model architecture

**Testing**
- Verified M2M100: 105 languages working 
- Verified NLLB-200: 204 languages working 
- Tested NLLB-exclusive languages (Bemba, Fon, etc.) 

**License Information**
- M2M100: Apache 2.0 - Commercial use allowed
- NLLB-200: CC-BY-NC 4.0 - Non-commercial only

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
jungwoo choi
2025-11-11 16:19:50 +09:00
parent 5a99d081ab
commit 578be1fd55
4 changed files with 387 additions and 250 deletions

View File

@ -4,7 +4,11 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
## Project Overview
This is a Malaysian language translation API service built with FastAPI and Hugging Face Transformers. It provides bidirectional translation between Malay (Bahasa Melayu) and English using Helsinki-NLP's OPUS-MT neural machine translation models.
This is a multilingual translation API service built with FastAPI and Hugging Face Transformers. It provides any-to-any translation between up to 204 languages using Facebook's M2M100 and NLLB-200 models.
**Dual Model System:**
- **M2M100 (default)**: 105 languages, Apache 2.0 License, commercial use allowed
- **NLLB-200 (optional)**: 204 languages, CC-BY-NC 4.0 License, non-commercial only
## Development Commands
@ -43,17 +47,28 @@ docker-compose up -d --build
```bash
# Health check
curl http://localhost:8000/health
curl http://localhost:8001/health
# Translate Malay to English
curl -X POST "http://localhost:8000/api/translate" \
# Translate Malay to English (M2M100, default)
curl -X POST "http://localhost:8001/api/translate" \
-H "Content-Type: application/json" \
-d '{"text": "Selamat pagi", "source_lang": "ms", "target_lang": "en"}'
# Translate English to Malay
curl -X POST "http://localhost:8000/api/translate" \
# Translate English to Korean (M2M100)
curl -X POST "http://localhost:8001/api/translate" \
-H "Content-Type: application/json" \
-d '{"text": "Good morning", "source_lang": "en", "target_lang": "ms"}'
-d '{"text": "Good morning", "source_lang": "en", "target_lang": "ko", "model": "m2m100"}'
# Translate English to Bemba (NLLB-200 exclusive language)
curl -X POST "http://localhost:8001/api/translate" \
-H "Content-Type: application/json" \
-d '{"text": "Welcome", "source_lang": "en", "target_lang": "bem", "model": "nllb200"}'
# Get supported languages for M2M100
curl http://localhost:8001/api/supported-languages?model=m2m100
# Get supported languages for NLLB-200
curl http://localhost:8001/api/supported-languages?model=nllb200
```
## Architecture
@ -63,18 +78,23 @@ curl -X POST "http://localhost:8000/api/translate" \
1. **app/main.py** - FastAPI application with endpoint definitions
- Lifespan events handle model preloading on startup
- CORS middleware configured for cross-origin requests
- Three main endpoints: root (`/`), health (`/health`), translate (`/api/translate`)
- Main endpoints: root (`/`), health (`/health`), translate (`/api/translate`), supported-languages (`/api/supported-languages`)
- Includes lang_names dictionary with display names for all 204+ language codes
2. **app/translator.py** - Translation service singleton
- Manages loading and caching of translation models
- Manages loading and caching of both M2M100 and NLLB-200 models
- Automatically detects and uses GPU if available (CUDA)
- Supports lazy loading - models are loaded on first use or preloaded at startup
- Model naming convention: `Helsinki-NLP/opus-mt-{source}-{target}`
- Model support:
- M2M100: `facebook/m2m100_418M` (105 languages)
- NLLB-200: `facebook/nllb-200-distilled-600M` (204 languages, FLORES-200 format)
- Language code mapping for both models (m2m100_lang_codes, nllb200_lang_codes)
3. **app/models.py** - Pydantic schemas for request/response validation
- `TranslationRequest`: Validates input (text, source_lang, target_lang)
- `TranslationRequest`: Validates input (text, source_lang, target_lang, model)
- `TranslationResponse`: Structured output with metadata
- `LanguageCode` enum: Only "ms" and "en" are supported
- `HealthResponse`: Health check response
- Model parameter accepts "m2m100" (default) or "nllb200"
4. **app/config.py** - Configuration management using pydantic-settings
- Loads settings from environment variables or `.env` file
@ -83,18 +103,25 @@ curl -X POST "http://localhost:8000/api/translate" \
### Translation Flow
1. Request received at `/api/translate` endpoint
2. Pydantic validates request schema
3. TranslationService determines appropriate model based on language pair
4. Model is loaded if not already cached in memory
5. Text is tokenized, translated, and decoded
6. Response includes original text, translation, and model metadata
2. Pydantic validates request schema (including optional model parameter)
3. TranslationService selects model based on `model` parameter (m2m100 or nllb200)
4. Language codes are validated against the selected model's supported languages
5. Model is loaded if not already cached in memory
6. Text is tokenized with model-specific language codes:
- M2M100: Uses simple codes (e.g., "en", "ko")
- NLLB-200: Uses FLORES-200 format (e.g., "eng_Latn", "kor_Hang")
7. Translation generated using the model
8. Response includes original text, translation, and model metadata
### Model Caching
- Models are downloaded to `MODEL_CACHE_DIR` (default: `./models/`)
- Once downloaded, models persist across restarts
- In Docker, use volume mount to persist models
- First translation request may be slow due to model download (~300MB per model)
- First translation request may be slow due to model download:
- M2M100: ~1.6GB
- NLLB-200: ~2.5GB
- Both models can be cached simultaneously
### Device Selection
@ -112,20 +139,28 @@ Environment variables (see `.env.example`):
## Common Tasks
### Adding New Language Pairs
### Adding New Language Codes
To add support for additional languages:
The system currently supports all 105 M2M100 languages and all 204 NLLB-200 languages. To add new language code mappings:
1. Check if Helsinki-NLP has an OPUS-MT model for the language pair at https://huggingface.co/Helsinki-NLP
2. Update `app/models.py` - Add new language code to `LanguageCode` enum
3. Update `app/translator.py` - Add model mapping in `_get_model_name()` method
4. Update `app/main.py` - Add language info to `/api/supported-languages` endpoint
1. **For M2M100**: Update `m2m100_lang_codes` dictionary in `app/translator.py`
- Format: `"user_code": "m2m100_code"` (e.g., `"en": "en"`)
2. **For NLLB-200**: Update `nllb200_lang_codes` dictionary in `app/translator.py`
- Format: `"user_code": "flores_code"` (e.g., `"en": "eng_Latn"`)
- Reference: https://github.com/facebookresearch/flores/blob/main/flores200/README.md
3. **Display Names**: Add entries to `lang_names` dictionary in `app/main.py`
- Format: `"code": {"name": "English Name", "native": "Native Name"}`
### Modifying Translation Behavior
Translation parameters are in `app/translator.py` in the `translate()` method:
- Adjust `max_length` in tokenizer call to handle longer texts
- Modify generation parameters passed to `model.generate()` for different translation strategies
- Model-specific behavior:
- M2M100: Uses `tokenizer.get_lang_id()` for target language
- NLLB-200: Uses `tokenizer.convert_tokens_to_ids()` for target language
### Production Deployment
@ -139,5 +174,14 @@ For production use:
## API Documentation
When the server is running, interactive API documentation is available at:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- Swagger UI: http://localhost:8001/docs (Docker) or http://localhost:8000/docs (local)
- ReDoc: http://localhost:8001/redoc (Docker) or http://localhost:8000/redoc (local)
## Model Licenses
**IMPORTANT**: Be aware of licensing when deploying:
- **M2M100**: Apache 2.0 License - Commercial use allowed ✅
- **NLLB-200**: CC-BY-NC 4.0 License - Non-commercial use only ⚠️
Always use M2M100 for commercial applications. Only use NLLB-200 for research, education, or personal non-commercial projects.