Add complete NLLB-200 support with all 204 FLORES-200 languages
Updated dual model system to fully support both M2M100 and NLLB-200: **NLLB-200 Model (204 languages)** - Added all 204 FLORES-200 language codes to nllb200_lang_codes dictionary - Updated language code mappings with FLORES-200 format (xxx_Yyyy) - Added 24+ NLLB-exclusive languages including: - Southeast Asian: Acehnese, Balinese, Banjar, Buginese, Minangkabau - South Asian: Assamese, Awadhi, Bhojpuri, Chhattisgarhi, Magahi, Maithili, Meitei, Odia, Santali - African: Akan, Bambara, Bemba, Chokwe, Dyula, Fon, Kikuyu, Kimbundu, Kongo, Luba-Kasai, Luo, Mossi, Nuer - Arabic dialects: Mesopotamian, Najdi, Moroccan, Egyptian, Tunisian, South/North Levantine - European regional: Asturian, Friulian, Latgalian, Ligurian, Limburgish, Lombard, Norwegian Nynorsk/Bokmål, Occitan, Sardinian, Sicilian, Silesian, Venetian - Other: Dzongkha, Fijian, Guarani, Kabyle, Kabuverdianu, Papiamento, Quechua, Samoan, Sango, Shan, Tamasheq, Tibetan, Tok Pisin **Updated Files** - app/translator.py: Complete NLLB-200 language mappings (204 languages) - app/main.py: Added display names for all 204+ language codes - README.md: Updated with dual model system, NLLB-200 details, license info - CLAUDE.md: Updated developer documentation with model architecture **Testing** - Verified M2M100: 105 languages working ✅ - Verified NLLB-200: 204 languages working ✅ - Tested NLLB-exclusive languages (Bemba, Fon, etc.) ✅ **License Information** - M2M100: Apache 2.0 - Commercial use allowed - NLLB-200: CC-BY-NC 4.0 - Non-commercial only 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
96
CLAUDE.md
96
CLAUDE.md
@ -4,7 +4,11 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
||||
|
||||
## Project Overview
|
||||
|
||||
This is a Malaysian language translation API service built with FastAPI and Hugging Face Transformers. It provides bidirectional translation between Malay (Bahasa Melayu) and English using Helsinki-NLP's OPUS-MT neural machine translation models.
|
||||
This is a multilingual translation API service built with FastAPI and Hugging Face Transformers. It provides any-to-any translation between up to 204 languages using Facebook's M2M100 and NLLB-200 models.
|
||||
|
||||
**Dual Model System:**
|
||||
- **M2M100 (default)**: 105 languages, Apache 2.0 License, commercial use allowed
|
||||
- **NLLB-200 (optional)**: 204 languages, CC-BY-NC 4.0 License, non-commercial only
|
||||
|
||||
## Development Commands
|
||||
|
||||
@ -43,17 +47,28 @@ docker-compose up -d --build
|
||||
|
||||
```bash
|
||||
# Health check
|
||||
curl http://localhost:8000/health
|
||||
curl http://localhost:8001/health
|
||||
|
||||
# Translate Malay to English
|
||||
curl -X POST "http://localhost:8000/api/translate" \
|
||||
# Translate Malay to English (M2M100, default)
|
||||
curl -X POST "http://localhost:8001/api/translate" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"text": "Selamat pagi", "source_lang": "ms", "target_lang": "en"}'
|
||||
|
||||
# Translate English to Malay
|
||||
curl -X POST "http://localhost:8000/api/translate" \
|
||||
# Translate English to Korean (M2M100)
|
||||
curl -X POST "http://localhost:8001/api/translate" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"text": "Good morning", "source_lang": "en", "target_lang": "ms"}'
|
||||
-d '{"text": "Good morning", "source_lang": "en", "target_lang": "ko", "model": "m2m100"}'
|
||||
|
||||
# Translate English to Bemba (NLLB-200 exclusive language)
|
||||
curl -X POST "http://localhost:8001/api/translate" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"text": "Welcome", "source_lang": "en", "target_lang": "bem", "model": "nllb200"}'
|
||||
|
||||
# Get supported languages for M2M100
|
||||
curl http://localhost:8001/api/supported-languages?model=m2m100
|
||||
|
||||
# Get supported languages for NLLB-200
|
||||
curl http://localhost:8001/api/supported-languages?model=nllb200
|
||||
```
|
||||
|
||||
## Architecture
|
||||
@ -63,18 +78,23 @@ curl -X POST "http://localhost:8000/api/translate" \
|
||||
1. **app/main.py** - FastAPI application with endpoint definitions
|
||||
- Lifespan events handle model preloading on startup
|
||||
- CORS middleware configured for cross-origin requests
|
||||
- Three main endpoints: root (`/`), health (`/health`), translate (`/api/translate`)
|
||||
- Main endpoints: root (`/`), health (`/health`), translate (`/api/translate`), supported-languages (`/api/supported-languages`)
|
||||
- Includes lang_names dictionary with display names for all 204+ language codes
|
||||
|
||||
2. **app/translator.py** - Translation service singleton
|
||||
- Manages loading and caching of translation models
|
||||
- Manages loading and caching of both M2M100 and NLLB-200 models
|
||||
- Automatically detects and uses GPU if available (CUDA)
|
||||
- Supports lazy loading - models are loaded on first use or preloaded at startup
|
||||
- Model naming convention: `Helsinki-NLP/opus-mt-{source}-{target}`
|
||||
- Model support:
|
||||
- M2M100: `facebook/m2m100_418M` (105 languages)
|
||||
- NLLB-200: `facebook/nllb-200-distilled-600M` (204 languages, FLORES-200 format)
|
||||
- Language code mapping for both models (m2m100_lang_codes, nllb200_lang_codes)
|
||||
|
||||
3. **app/models.py** - Pydantic schemas for request/response validation
|
||||
- `TranslationRequest`: Validates input (text, source_lang, target_lang)
|
||||
- `TranslationRequest`: Validates input (text, source_lang, target_lang, model)
|
||||
- `TranslationResponse`: Structured output with metadata
|
||||
- `LanguageCode` enum: Only "ms" and "en" are supported
|
||||
- `HealthResponse`: Health check response
|
||||
- Model parameter accepts "m2m100" (default) or "nllb200"
|
||||
|
||||
4. **app/config.py** - Configuration management using pydantic-settings
|
||||
- Loads settings from environment variables or `.env` file
|
||||
@ -83,18 +103,25 @@ curl -X POST "http://localhost:8000/api/translate" \
|
||||
### Translation Flow
|
||||
|
||||
1. Request received at `/api/translate` endpoint
|
||||
2. Pydantic validates request schema
|
||||
3. TranslationService determines appropriate model based on language pair
|
||||
4. Model is loaded if not already cached in memory
|
||||
5. Text is tokenized, translated, and decoded
|
||||
6. Response includes original text, translation, and model metadata
|
||||
2. Pydantic validates request schema (including optional model parameter)
|
||||
3. TranslationService selects model based on `model` parameter (m2m100 or nllb200)
|
||||
4. Language codes are validated against the selected model's supported languages
|
||||
5. Model is loaded if not already cached in memory
|
||||
6. Text is tokenized with model-specific language codes:
|
||||
- M2M100: Uses simple codes (e.g., "en", "ko")
|
||||
- NLLB-200: Uses FLORES-200 format (e.g., "eng_Latn", "kor_Hang")
|
||||
7. Translation generated using the model
|
||||
8. Response includes original text, translation, and model metadata
|
||||
|
||||
### Model Caching
|
||||
|
||||
- Models are downloaded to `MODEL_CACHE_DIR` (default: `./models/`)
|
||||
- Once downloaded, models persist across restarts
|
||||
- In Docker, use volume mount to persist models
|
||||
- First translation request may be slow due to model download (~300MB per model)
|
||||
- First translation request may be slow due to model download:
|
||||
- M2M100: ~1.6GB
|
||||
- NLLB-200: ~2.5GB
|
||||
- Both models can be cached simultaneously
|
||||
|
||||
### Device Selection
|
||||
|
||||
@ -112,20 +139,28 @@ Environment variables (see `.env.example`):
|
||||
|
||||
## Common Tasks
|
||||
|
||||
### Adding New Language Pairs
|
||||
### Adding New Language Codes
|
||||
|
||||
To add support for additional languages:
|
||||
The system currently supports all 105 M2M100 languages and all 204 NLLB-200 languages. To add new language code mappings:
|
||||
|
||||
1. Check if Helsinki-NLP has an OPUS-MT model for the language pair at https://huggingface.co/Helsinki-NLP
|
||||
2. Update `app/models.py` - Add new language code to `LanguageCode` enum
|
||||
3. Update `app/translator.py` - Add model mapping in `_get_model_name()` method
|
||||
4. Update `app/main.py` - Add language info to `/api/supported-languages` endpoint
|
||||
1. **For M2M100**: Update `m2m100_lang_codes` dictionary in `app/translator.py`
|
||||
- Format: `"user_code": "m2m100_code"` (e.g., `"en": "en"`)
|
||||
|
||||
2. **For NLLB-200**: Update `nllb200_lang_codes` dictionary in `app/translator.py`
|
||||
- Format: `"user_code": "flores_code"` (e.g., `"en": "eng_Latn"`)
|
||||
- Reference: https://github.com/facebookresearch/flores/blob/main/flores200/README.md
|
||||
|
||||
3. **Display Names**: Add entries to `lang_names` dictionary in `app/main.py`
|
||||
- Format: `"code": {"name": "English Name", "native": "Native Name"}`
|
||||
|
||||
### Modifying Translation Behavior
|
||||
|
||||
Translation parameters are in `app/translator.py` in the `translate()` method:
|
||||
- Adjust `max_length` in tokenizer call to handle longer texts
|
||||
- Modify generation parameters passed to `model.generate()` for different translation strategies
|
||||
- Model-specific behavior:
|
||||
- M2M100: Uses `tokenizer.get_lang_id()` for target language
|
||||
- NLLB-200: Uses `tokenizer.convert_tokens_to_ids()` for target language
|
||||
|
||||
### Production Deployment
|
||||
|
||||
@ -139,5 +174,14 @@ For production use:
|
||||
## API Documentation
|
||||
|
||||
When the server is running, interactive API documentation is available at:
|
||||
- Swagger UI: http://localhost:8000/docs
|
||||
- ReDoc: http://localhost:8000/redoc
|
||||
- Swagger UI: http://localhost:8001/docs (Docker) or http://localhost:8000/docs (local)
|
||||
- ReDoc: http://localhost:8001/redoc (Docker) or http://localhost:8000/redoc (local)
|
||||
|
||||
## Model Licenses
|
||||
|
||||
**IMPORTANT**: Be aware of licensing when deploying:
|
||||
|
||||
- **M2M100**: Apache 2.0 License - Commercial use allowed ✅
|
||||
- **NLLB-200**: CC-BY-NC 4.0 License - Non-commercial use only ⚠️
|
||||
|
||||
Always use M2M100 for commercial applications. Only use NLLB-200 for research, education, or personal non-commercial projects.
|
||||
|
||||
Reference in New Issue
Block a user