Updated dual model system to fully support both M2M100 and NLLB-200: **NLLB-200 Model (204 languages)** - Added all 204 FLORES-200 language codes to nllb200_lang_codes dictionary - Updated language code mappings with FLORES-200 format (xxx_Yyyy) - Added 24+ NLLB-exclusive languages including: - Southeast Asian: Acehnese, Balinese, Banjar, Buginese, Minangkabau - South Asian: Assamese, Awadhi, Bhojpuri, Chhattisgarhi, Magahi, Maithili, Meitei, Odia, Santali - African: Akan, Bambara, Bemba, Chokwe, Dyula, Fon, Kikuyu, Kimbundu, Kongo, Luba-Kasai, Luo, Mossi, Nuer - Arabic dialects: Mesopotamian, Najdi, Moroccan, Egyptian, Tunisian, South/North Levantine - European regional: Asturian, Friulian, Latgalian, Ligurian, Limburgish, Lombard, Norwegian Nynorsk/Bokmål, Occitan, Sardinian, Sicilian, Silesian, Venetian - Other: Dzongkha, Fijian, Guarani, Kabyle, Kabuverdianu, Papiamento, Quechua, Samoan, Sango, Shan, Tamasheq, Tibetan, Tok Pisin **Updated Files** - app/translator.py: Complete NLLB-200 language mappings (204 languages) - app/main.py: Added display names for all 204+ language codes - README.md: Updated with dual model system, NLLB-200 details, license info - CLAUDE.md: Updated developer documentation with model architecture **Testing** - Verified M2M100: 105 languages working ✅ - Verified NLLB-200: 204 languages working ✅ - Tested NLLB-exclusive languages (Bemba, Fon, etc.) ✅ **License Information** - M2M100: Apache 2.0 - Commercial use allowed - NLLB-200: CC-BY-NC 4.0 - Non-commercial only 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
188 lines
6.9 KiB
Markdown
188 lines
6.9 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Project Overview
|
|
|
|
This is a multilingual translation API service built with FastAPI and Hugging Face Transformers. It provides any-to-any translation between up to 204 languages using Facebook's M2M100 and NLLB-200 models.
|
|
|
|
**Dual Model System:**
|
|
- **M2M100 (default)**: 105 languages, Apache 2.0 License, commercial use allowed
|
|
- **NLLB-200 (optional)**: 204 languages, CC-BY-NC 4.0 License, non-commercial only
|
|
|
|
## Development Commands
|
|
|
|
### Local Development
|
|
|
|
```bash
|
|
# Setup virtual environment and install dependencies
|
|
python -m venv venv
|
|
source venv/bin/activate # Windows: venv\Scripts\activate
|
|
pip install -r requirements.txt
|
|
|
|
# Run the development server (with auto-reload)
|
|
python run.py
|
|
|
|
# Or run with uvicorn directly
|
|
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
|
|
```
|
|
|
|
### Docker Development
|
|
|
|
```bash
|
|
# Build and run with Docker Compose
|
|
docker-compose up -d
|
|
|
|
# View logs
|
|
docker-compose logs -f
|
|
|
|
# Stop services
|
|
docker-compose down
|
|
|
|
# Rebuild after code changes
|
|
docker-compose up -d --build
|
|
```
|
|
|
|
### Testing the API
|
|
|
|
```bash
|
|
# Health check
|
|
curl http://localhost:8001/health
|
|
|
|
# Translate Malay to English (M2M100, default)
|
|
curl -X POST "http://localhost:8001/api/translate" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"text": "Selamat pagi", "source_lang": "ms", "target_lang": "en"}'
|
|
|
|
# Translate English to Korean (M2M100)
|
|
curl -X POST "http://localhost:8001/api/translate" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"text": "Good morning", "source_lang": "en", "target_lang": "ko", "model": "m2m100"}'
|
|
|
|
# Translate English to Bemba (NLLB-200 exclusive language)
|
|
curl -X POST "http://localhost:8001/api/translate" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"text": "Welcome", "source_lang": "en", "target_lang": "bem", "model": "nllb200"}'
|
|
|
|
# Get supported languages for M2M100
|
|
curl http://localhost:8001/api/supported-languages?model=m2m100
|
|
|
|
# Get supported languages for NLLB-200
|
|
curl http://localhost:8001/api/supported-languages?model=nllb200
|
|
```
|
|
|
|
## Architecture
|
|
|
|
### Core Components
|
|
|
|
1. **app/main.py** - FastAPI application with endpoint definitions
|
|
- Lifespan events handle model preloading on startup
|
|
- CORS middleware configured for cross-origin requests
|
|
- Main endpoints: root (`/`), health (`/health`), translate (`/api/translate`), supported-languages (`/api/supported-languages`)
|
|
- Includes lang_names dictionary with display names for all 204+ language codes
|
|
|
|
2. **app/translator.py** - Translation service singleton
|
|
- Manages loading and caching of both M2M100 and NLLB-200 models
|
|
- Automatically detects and uses GPU if available (CUDA)
|
|
- Supports lazy loading - models are loaded on first use or preloaded at startup
|
|
- Model support:
|
|
- M2M100: `facebook/m2m100_418M` (105 languages)
|
|
- NLLB-200: `facebook/nllb-200-distilled-600M` (204 languages, FLORES-200 format)
|
|
- Language code mapping for both models (m2m100_lang_codes, nllb200_lang_codes)
|
|
|
|
3. **app/models.py** - Pydantic schemas for request/response validation
|
|
- `TranslationRequest`: Validates input (text, source_lang, target_lang, model)
|
|
- `TranslationResponse`: Structured output with metadata
|
|
- `HealthResponse`: Health check response
|
|
- Model parameter accepts "m2m100" (default) or "nllb200"
|
|
|
|
4. **app/config.py** - Configuration management using pydantic-settings
|
|
- Loads settings from environment variables or `.env` file
|
|
- Default values provided for all settings
|
|
|
|
### Translation Flow
|
|
|
|
1. Request received at `/api/translate` endpoint
|
|
2. Pydantic validates request schema (including optional model parameter)
|
|
3. TranslationService selects model based on `model` parameter (m2m100 or nllb200)
|
|
4. Language codes are validated against the selected model's supported languages
|
|
5. Model is loaded if not already cached in memory
|
|
6. Text is tokenized with model-specific language codes:
|
|
- M2M100: Uses simple codes (e.g., "en", "ko")
|
|
- NLLB-200: Uses FLORES-200 format (e.g., "eng_Latn", "kor_Hang")
|
|
7. Translation generated using the model
|
|
8. Response includes original text, translation, and model metadata
|
|
|
|
### Model Caching
|
|
|
|
- Models are downloaded to `MODEL_CACHE_DIR` (default: `./models/`)
|
|
- Once downloaded, models persist across restarts
|
|
- In Docker, use volume mount to persist models
|
|
- First translation request may be slow due to model download:
|
|
- M2M100: ~1.6GB
|
|
- NLLB-200: ~2.5GB
|
|
- Both models can be cached simultaneously
|
|
|
|
### Device Selection
|
|
|
|
The translator automatically detects GPU availability:
|
|
- CUDA GPU: Used automatically if available for faster inference
|
|
- CPU: Fallback option, slower but works everywhere
|
|
|
|
## Configuration
|
|
|
|
Environment variables (see `.env.example`):
|
|
- `API_HOST` / `API_PORT`: Server binding
|
|
- `MODEL_CACHE_DIR`: Where to store downloaded models
|
|
- `MAX_LENGTH`: Maximum token length for translation (default 512)
|
|
- `ALLOWED_ORIGINS`: CORS configuration
|
|
|
|
## Common Tasks
|
|
|
|
### Adding New Language Codes
|
|
|
|
The system currently supports all 105 M2M100 languages and all 204 NLLB-200 languages. To add new language code mappings:
|
|
|
|
1. **For M2M100**: Update `m2m100_lang_codes` dictionary in `app/translator.py`
|
|
- Format: `"user_code": "m2m100_code"` (e.g., `"en": "en"`)
|
|
|
|
2. **For NLLB-200**: Update `nllb200_lang_codes` dictionary in `app/translator.py`
|
|
- Format: `"user_code": "flores_code"` (e.g., `"en": "eng_Latn"`)
|
|
- Reference: https://github.com/facebookresearch/flores/blob/main/flores200/README.md
|
|
|
|
3. **Display Names**: Add entries to `lang_names` dictionary in `app/main.py`
|
|
- Format: `"code": {"name": "English Name", "native": "Native Name"}`
|
|
|
|
### Modifying Translation Behavior
|
|
|
|
Translation parameters are in `app/translator.py` in the `translate()` method:
|
|
- Adjust `max_length` in tokenizer call to handle longer texts
|
|
- Modify generation parameters passed to `model.generate()` for different translation strategies
|
|
- Model-specific behavior:
|
|
- M2M100: Uses `tokenizer.get_lang_id()` for target language
|
|
- NLLB-200: Uses `tokenizer.convert_tokens_to_ids()` for target language
|
|
|
|
### Production Deployment
|
|
|
|
For production use:
|
|
1. Set `reload=False` in `run.py` or use production-ready uvicorn command
|
|
2. Configure proper `ALLOWED_ORIGINS` instead of "*"
|
|
3. Add authentication middleware if needed
|
|
4. Consider using multiple workers: `uvicorn app.main:app --workers 4`
|
|
5. Mount persistent volume for `models/` directory in Docker
|
|
|
|
## API Documentation
|
|
|
|
When the server is running, interactive API documentation is available at:
|
|
- Swagger UI: http://localhost:8001/docs (Docker) or http://localhost:8000/docs (local)
|
|
- ReDoc: http://localhost:8001/redoc (Docker) or http://localhost:8000/redoc (local)
|
|
|
|
## Model Licenses
|
|
|
|
**IMPORTANT**: Be aware of licensing when deploying:
|
|
|
|
- **M2M100**: Apache 2.0 License - Commercial use allowed ✅
|
|
- **NLLB-200**: CC-BY-NC 4.0 License - Non-commercial use only ⚠️
|
|
|
|
Always use M2M100 for commercial applications. Only use NLLB-200 for research, education, or personal non-commercial projects.
|