Add complete NLLB-200 support with all 204 FLORES-200 languages

Updated dual model system to fully support both M2M100 and NLLB-200: **NLLB-200 Model (204 languages)** - Added all 204 FLORES-200 language codes to nllb200_lang_codes dictionary - Updated language code mappings with FLORES-200 format (xxx_Yyyy) - Added 24+ NLLB-exclusive languages including: - Southeast Asian: Acehnese, Balinese, Banjar, Buginese, Minangkabau - South Asian: Assamese, Awadhi, Bhojpuri, Chhattisgarhi, Magahi, Maithili, Meitei, Odia, Santali - African: Akan, Bambara, Bemba, Chokwe, Dyula, Fon, Kikuyu, Kimbundu, Kongo, Luba-Kasai, Luo, Mossi, Nuer - Arabic dialects: Mesopotamian, Najdi, Moroccan, Egyptian, Tunisian, South/North Levantine - European regional: Asturian, Friulian, Latgalian, Ligurian, Limburgish, Lombard, Norwegian Nynorsk/Bokmål, Occitan, Sardinian, Sicilian, Silesian, Venetian - Other: Dzongkha, Fijian, Guarani, Kabyle, Kabuverdianu, Papiamento, Quechua, Samoan, Sango, Shan, Tamasheq, Tibetan, Tok Pisin **Updated Files** - app/translator.py: Complete NLLB-200 language mappings (204 languages) - app/main.py: Added display names for all 204+ language codes - README.md: Updated with dual model system, NLLB-200 details, license info - CLAUDE.md: Updated developer documentation with model architecture **Testing** - Verified M2M100: 105 languages working ✅ - Verified NLLB-200: 204 languages working ✅ - Tested NLLB-exclusive languages (Bemba, Fon, etc.) ✅ **License Information** - M2M100: Apache 2.0 - Commercial use allowed - NLLB-200: CC-BY-NC 4.0 - Non-commercial only 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 16:19:50 +09:00
parent 5a99d081ab
commit 578be1fd55
4 changed files with 387 additions and 250 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -4,7 +4,11 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

 ## Project Overview

-This is a Malaysian language translation API service built with FastAPI and Hugging Face Transformers. It provides bidirectional translation between Malay (Bahasa Melayu) and English using Helsinki-NLP's OPUS-MT neural machine translation models.
+This is a multilingual translation API service built with FastAPI and Hugging Face Transformers. It provides any-to-any translation between up to 204 languages using Facebook's M2M100 and NLLB-200 models.
+
+**Dual Model System:**
+- **M2M100 (default)**: 105 languages, Apache 2.0 License, commercial use allowed
+- **NLLB-200 (optional)**: 204 languages, CC-BY-NC 4.0 License, non-commercial only

 ## Development Commands

@ -43,17 +47,28 @@ docker-compose up -d --build

 ```bash
 # Health check
-curl http://localhost:8000/health
+curl http://localhost:8001/health

-# Translate Malay to English
-curl -X POST "http://localhost:8000/api/translate" \
+# Translate Malay to English (M2M100, default)
+curl -X POST "http://localhost:8001/api/translate" \
  -H "Content-Type: application/json" \
  -d '{"text": "Selamat pagi", "source_lang": "ms", "target_lang": "en"}'

-# Translate English to Malay
-curl -X POST "http://localhost:8000/api/translate" \
+# Translate English to Korean (M2M100)
+curl -X POST "http://localhost:8001/api/translate" \
  -H "Content-Type: application/json" \
-  -d '{"text": "Good morning", "source_lang": "en", "target_lang": "ms"}'
+  -d '{"text": "Good morning", "source_lang": "en", "target_lang": "ko", "model": "m2m100"}'
+
+# Translate English to Bemba (NLLB-200 exclusive language)
+curl -X POST "http://localhost:8001/api/translate" \
+  -H "Content-Type: application/json" \
+  -d '{"text": "Welcome", "source_lang": "en", "target_lang": "bem", "model": "nllb200"}'
+
+# Get supported languages for M2M100
+curl http://localhost:8001/api/supported-languages?model=m2m100
+
+# Get supported languages for NLLB-200
+curl http://localhost:8001/api/supported-languages?model=nllb200
 ```

 ## Architecture
@ -63,18 +78,23 @@ curl -X POST "http://localhost:8000/api/translate" \
 1. **app/main.py** - FastAPI application with endpoint definitions
   - Lifespan events handle model preloading on startup
   - CORS middleware configured for cross-origin requests
-   - Three main endpoints: root (`/`), health (`/health`), translate (`/api/translate`)
+   - Main endpoints: root (`/`), health (`/health`), translate (`/api/translate`), supported-languages (`/api/supported-languages`)
+   - Includes lang_names dictionary with display names for all 204+ language codes

 2. **app/translator.py** - Translation service singleton
-   - Manages loading and caching of translation models
+   - Manages loading and caching of both M2M100 and NLLB-200 models
   - Automatically detects and uses GPU if available (CUDA)
   - Supports lazy loading - models are loaded on first use or preloaded at startup
-   - Model naming convention: `Helsinki-NLP/opus-mt-{source}-{target}`
+   - Model support:
+     - M2M100: `facebook/m2m100_418M` (105 languages)
+     - NLLB-200: `facebook/nllb-200-distilled-600M` (204 languages, FLORES-200 format)
+   - Language code mapping for both models (m2m100_lang_codes, nllb200_lang_codes)

 3. **app/models.py** - Pydantic schemas for request/response validation
-   - `TranslationRequest`: Validates input (text, source_lang, target_lang)
+   - `TranslationRequest`: Validates input (text, source_lang, target_lang, model)
   - `TranslationResponse`: Structured output with metadata
-   - `LanguageCode` enum: Only "ms" and "en" are supported
+   - `HealthResponse`: Health check response
+   - Model parameter accepts "m2m100" (default) or "nllb200"

 4. **app/config.py** - Configuration management using pydantic-settings
   - Loads settings from environment variables or `.env` file
@ -83,18 +103,25 @@ curl -X POST "http://localhost:8000/api/translate" \
 ### Translation Flow

 1. Request received at `/api/translate` endpoint
-2. Pydantic validates request schema
-3. TranslationService determines appropriate model based on language pair
-4. Model is loaded if not already cached in memory
-5. Text is tokenized, translated, and decoded
-6. Response includes original text, translation, and model metadata
+2. Pydantic validates request schema (including optional model parameter)
+3. TranslationService selects model based on `model` parameter (m2m100 or nllb200)
+4. Language codes are validated against the selected model's supported languages
+5. Model is loaded if not already cached in memory
+6. Text is tokenized with model-specific language codes:
+   - M2M100: Uses simple codes (e.g., "en", "ko")
+   - NLLB-200: Uses FLORES-200 format (e.g., "eng_Latn", "kor_Hang")
+7. Translation generated using the model
+8. Response includes original text, translation, and model metadata

 ### Model Caching

 - Models are downloaded to `MODEL_CACHE_DIR` (default: `./models/`)
 - Once downloaded, models persist across restarts
 - In Docker, use volume mount to persist models
- First translation request may be slow due to model download (~300MB per model)
+- First translation request may be slow due to model download:
+  - M2M100: ~1.6GB
+  - NLLB-200: ~2.5GB
+- Both models can be cached simultaneously

 ### Device Selection

@ -112,20 +139,28 @@ Environment variables (see `.env.example`):

 ## Common Tasks

-### Adding New Language Pairs
+### Adding New Language Codes

-To add support for additional languages:
+The system currently supports all 105 M2M100 languages and all 204 NLLB-200 languages. To add new language code mappings:

-1. Check if Helsinki-NLP has an OPUS-MT model for the language pair at https://huggingface.co/Helsinki-NLP
-2. Update `app/models.py` - Add new language code to `LanguageCode` enum
-3. Update `app/translator.py` - Add model mapping in `_get_model_name()` method
-4. Update `app/main.py` - Add language info to `/api/supported-languages` endpoint
+1. **For M2M100**: Update `m2m100_lang_codes` dictionary in `app/translator.py`
+   - Format: `"user_code": "m2m100_code"` (e.g., `"en": "en"`)
+
+2. **For NLLB-200**: Update `nllb200_lang_codes` dictionary in `app/translator.py`
+   - Format: `"user_code": "flores_code"` (e.g., `"en": "eng_Latn"`)
+   - Reference: https://github.com/facebookresearch/flores/blob/main/flores200/README.md
+
+3. **Display Names**: Add entries to `lang_names` dictionary in `app/main.py`
+   - Format: `"code": {"name": "English Name", "native": "Native Name"}`

 ### Modifying Translation Behavior

 Translation parameters are in `app/translator.py` in the `translate()` method:
 - Adjust `max_length` in tokenizer call to handle longer texts
 - Modify generation parameters passed to `model.generate()` for different translation strategies
+- Model-specific behavior:
+  - M2M100: Uses `tokenizer.get_lang_id()` for target language
+  - NLLB-200: Uses `tokenizer.convert_tokens_to_ids()` for target language

 ### Production Deployment

@ -139,5 +174,14 @@ For production use:
 ## API Documentation

 When the server is running, interactive API documentation is available at:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
+- Swagger UI: http://localhost:8001/docs (Docker) or http://localhost:8000/docs (local)
+- ReDoc: http://localhost:8001/redoc (Docker) or http://localhost:8000/redoc (local)
+
+## Model Licenses
+
+**IMPORTANT**: Be aware of licensing when deploying:
+
+- **M2M100**: Apache 2.0 License - Commercial use allowed ✅
+- **NLLB-200**: CC-BY-NC 4.0 License - Non-commercial use only ⚠️
+
+Always use M2M100 for commercial applications. Only use NLLB-200 for research, education, or personal non-commercial projects.