Files

jungwoo choi 578be1fd55 Add complete NLLB-200 support with all 204 FLORES-200 languages

Updated dual model system to fully support both M2M100 and NLLB-200:

**NLLB-200 Model (204 languages)**
- Added all 204 FLORES-200 language codes to nllb200_lang_codes dictionary
- Updated language code mappings with FLORES-200 format (xxx_Yyyy)
- Added 24+ NLLB-exclusive languages including:
  - Southeast Asian: Acehnese, Balinese, Banjar, Buginese, Minangkabau
  - South Asian: Assamese, Awadhi, Bhojpuri, Chhattisgarhi, Magahi, Maithili, Meitei, Odia, Santali
  - African: Akan, Bambara, Bemba, Chokwe, Dyula, Fon, Kikuyu, Kimbundu, Kongo, Luba-Kasai, Luo, Mossi, Nuer
  - Arabic dialects: Mesopotamian, Najdi, Moroccan, Egyptian, Tunisian, South/North Levantine
  - European regional: Asturian, Friulian, Latgalian, Ligurian, Limburgish, Lombard, Norwegian Nynorsk/Bokmål, Occitan, Sardinian, Sicilian, Silesian, Venetian
  - Other: Dzongkha, Fijian, Guarani, Kabyle, Kabuverdianu, Papiamento, Quechua, Samoan, Sango, Shan, Tamasheq, Tibetan, Tok Pisin

**Updated Files**
- app/translator.py: Complete NLLB-200 language mappings (204 languages)
- app/main.py: Added display names for all 204+ language codes
- README.md: Updated with dual model system, NLLB-200 details, license info
- CLAUDE.md: Updated developer documentation with model architecture

**Testing**
- Verified M2M100: 105 languages working ✅
- Verified NLLB-200: 204 languages working ✅
- Tested NLLB-exclusive languages (Bemba, Fon, etc.) ✅

**License Information**
- M2M100: Apache 2.0 - Commercial use allowed
- NLLB-200: CC-BY-NC 4.0 - Non-commercial only

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-11 16:19:50 +09:00

6.9 KiB

Raw Permalink Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a multilingual translation API service built with FastAPI and Hugging Face Transformers. It provides any-to-any translation between up to 204 languages using Facebook's M2M100 and NLLB-200 models.

Dual Model System:

M2M100 (default): 105 languages, Apache 2.0 License, commercial use allowed
NLLB-200 (optional): 204 languages, CC-BY-NC 4.0 License, non-commercial only

Development Commands

Local Development

# Setup virtual environment and install dependencies
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

# Run the development server (with auto-reload)
python run.py

# Or run with uvicorn directly
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Docker Development

# Build and run with Docker Compose
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

# Rebuild after code changes
docker-compose up -d --build

Testing the API

# Health check
curl http://localhost:8001/health

# Translate Malay to English (M2M100, default)
curl -X POST "http://localhost:8001/api/translate" \
  -H "Content-Type: application/json" \
  -d '{"text": "Selamat pagi", "source_lang": "ms", "target_lang": "en"}'

# Translate English to Korean (M2M100)
curl -X POST "http://localhost:8001/api/translate" \
  -H "Content-Type: application/json" \
  -d '{"text": "Good morning", "source_lang": "en", "target_lang": "ko", "model": "m2m100"}'

# Translate English to Bemba (NLLB-200 exclusive language)
curl -X POST "http://localhost:8001/api/translate" \
  -H "Content-Type: application/json" \
  -d '{"text": "Welcome", "source_lang": "en", "target_lang": "bem", "model": "nllb200"}'

# Get supported languages for M2M100
curl http://localhost:8001/api/supported-languages?model=m2m100

# Get supported languages for NLLB-200
curl http://localhost:8001/api/supported-languages?model=nllb200

Architecture

Core Components

app/main.py - FastAPI application with endpoint definitions
- Lifespan events handle model preloading on startup
- CORS middleware configured for cross-origin requests
- Main endpoints: root (/), health (/health), translate (/api/translate), supported-languages (/api/supported-languages)
- Includes lang_names dictionary with display names for all 204+ language codes
app/translator.py - Translation service singleton
- Manages loading and caching of both M2M100 and NLLB-200 models
- Automatically detects and uses GPU if available (CUDA)
- Supports lazy loading - models are loaded on first use or preloaded at startup
- Model support:
  - M2M100: facebook/m2m100_418M (105 languages)
  - NLLB-200: facebook/nllb-200-distilled-600M (204 languages, FLORES-200 format)
- Language code mapping for both models (m2m100_lang_codes, nllb200_lang_codes)
app/models.py - Pydantic schemas for request/response validation
- TranslationRequest: Validates input (text, source_lang, target_lang, model)
- TranslationResponse: Structured output with metadata
- HealthResponse: Health check response
- Model parameter accepts "m2m100" (default) or "nllb200"
app/config.py - Configuration management using pydantic-settings
- Loads settings from environment variables or .env file
- Default values provided for all settings

Translation Flow

Request received at /api/translate endpoint
Pydantic validates request schema (including optional model parameter)
TranslationService selects model based on model parameter (m2m100 or nllb200)
Language codes are validated against the selected model's supported languages
Model is loaded if not already cached in memory
Text is tokenized with model-specific language codes:
- M2M100: Uses simple codes (e.g., "en", "ko")
- NLLB-200: Uses FLORES-200 format (e.g., "eng_Latn", "kor_Hang")
Translation generated using the model
Response includes original text, translation, and model metadata

Model Caching

Models are downloaded to MODEL_CACHE_DIR (default: ./models/)
Once downloaded, models persist across restarts
In Docker, use volume mount to persist models
First translation request may be slow due to model download:
- M2M100: ~1.6GB
- NLLB-200: ~2.5GB
Both models can be cached simultaneously

Device Selection

The translator automatically detects GPU availability:

CUDA GPU: Used automatically if available for faster inference
CPU: Fallback option, slower but works everywhere

Configuration

Environment variables (see .env.example):

API_HOST / API_PORT: Server binding
MODEL_CACHE_DIR: Where to store downloaded models
MAX_LENGTH: Maximum token length for translation (default 512)
ALLOWED_ORIGINS: CORS configuration

Common Tasks

Adding New Language Codes

The system currently supports all 105 M2M100 languages and all 204 NLLB-200 languages. To add new language code mappings:

For M2M100: Update m2m100_lang_codes dictionary in app/translator.py
- Format: "user_code": "m2m100_code" (e.g., "en": "en")
For NLLB-200: Update nllb200_lang_codes dictionary in app/translator.py
- Format: "user_code": "flores_code" (e.g., "en": "eng_Latn")
- Reference: https://github.com/facebookresearch/flores/blob/main/flores200/README.md
Display Names: Add entries to lang_names dictionary in app/main.py
- Format: "code": {"name": "English Name", "native": "Native Name"}

Modifying Translation Behavior

Translation parameters are in app/translator.py in the translate() method:

Adjust max_length in tokenizer call to handle longer texts
Modify generation parameters passed to model.generate() for different translation strategies
Model-specific behavior:
- M2M100: Uses tokenizer.get_lang_id() for target language
- NLLB-200: Uses tokenizer.convert_tokens_to_ids() for target language

Production Deployment

For production use:

Set reload=False in run.py or use production-ready uvicorn command
Configure proper ALLOWED_ORIGINS instead of "*"
Add authentication middleware if needed
Consider using multiple workers: uvicorn app.main:app --workers 4
Mount persistent volume for models/ directory in Docker

API Documentation

When the server is running, interactive API documentation is available at:

Swagger UI: http://localhost:8001/docs (Docker) or http://localhost:8000/docs (local)
ReDoc: http://localhost:8001/redoc (Docker) or http://localhost:8000/redoc (local)

Model Licenses

IMPORTANT: Be aware of licensing when deploying:

M2M100: Apache 2.0 License - Commercial use allowed ✅
NLLB-200: CC-BY-NC 4.0 License - Non-commercial use only ⚠️

Always use M2M100 for commercial applications. Only use NLLB-200 for research, education, or personal non-commercial projects.

6.9 KiB Raw Permalink Blame History