6 Commits

Author SHA1 Message Date
578be1fd55 Add complete NLLB-200 support with all 204 FLORES-200 languages
Updated dual model system to fully support both M2M100 and NLLB-200:

**NLLB-200 Model (204 languages)**
- Added all 204 FLORES-200 language codes to nllb200_lang_codes dictionary
- Updated language code mappings with FLORES-200 format (xxx_Yyyy)
- Added 24+ NLLB-exclusive languages including:
  - Southeast Asian: Acehnese, Balinese, Banjar, Buginese, Minangkabau
  - South Asian: Assamese, Awadhi, Bhojpuri, Chhattisgarhi, Magahi, Maithili, Meitei, Odia, Santali
  - African: Akan, Bambara, Bemba, Chokwe, Dyula, Fon, Kikuyu, Kimbundu, Kongo, Luba-Kasai, Luo, Mossi, Nuer
  - Arabic dialects: Mesopotamian, Najdi, Moroccan, Egyptian, Tunisian, South/North Levantine
  - European regional: Asturian, Friulian, Latgalian, Ligurian, Limburgish, Lombard, Norwegian Nynorsk/Bokmål, Occitan, Sardinian, Sicilian, Silesian, Venetian
  - Other: Dzongkha, Fijian, Guarani, Kabyle, Kabuverdianu, Papiamento, Quechua, Samoan, Sango, Shan, Tamasheq, Tibetan, Tok Pisin

**Updated Files**
- app/translator.py: Complete NLLB-200 language mappings (204 languages)
- app/main.py: Added display names for all 204+ language codes
- README.md: Updated with dual model system, NLLB-200 details, license info
- CLAUDE.md: Updated developer documentation with model architecture

**Testing**
- Verified M2M100: 105 languages working 
- Verified NLLB-200: 204 languages working 
- Tested NLLB-exclusive languages (Bemba, Fon, etc.) 

**License Information**
- M2M100: Apache 2.0 - Commercial use allowed
- NLLB-200: CC-BY-NC 4.0 - Non-commercial only

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 16:19:50 +09:00
5a99d081ab Fix NLLB-200 tokenizer and add .dockerignore
- Fixed NLLB-200 tokenizer forced_bos_token_id issue
  - Changed from lang_code_to_id to convert_tokens_to_ids
- Added .dockerignore to exclude models directory from Docker build
  - Prevents disk space issues during build
  - Models are loaded at runtime via volume mount
- Both M2M100 and NLLB-200 models tested and working

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 16:02:32 +09:00
28e26d19b6 Add dual model support: M2M100 and NLLB-200
- Added optional 'model' parameter to translation request (default: m2m100)
- M2M100: 105 languages, Apache 2.0 License (commercial OK)
- NLLB-200: 200 languages, CC-BY-NC 4.0 License (non-commercial only)
- Updated /api/translate endpoint to accept model selection
- Updated /api/supported-languages to show languages per model
- Added comprehensive language name mappings for all NLLB-200 languages
- Both models can be used independently with automatic model loading
- Model information includes license and commercial use status

Example usage:
- Default (M2M100): {"text": "Hello", "source_lang": "en", "target_lang": "ko"}
- NLLB-200: {"text": "Hello", "source_lang": "en", "target_lang": "ko", "model": "nllb200"}

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 15:57:00 +09:00
228f6c38e5 Update API metadata: Change to Multilingual Translation API
- Updated API title from "Malaysian Language Translation API" to "Multilingual Translation API"
- Updated API description to mention 105+ languages and M2M100 model
- Updated /api/translate endpoint docstring to reflect multilingual support
- Updated startup/shutdown log messages
- Added commercial license note (Apache 2.0) in API description

This ensures the Swagger UI (http://localhost:8001/docs) shows correct information.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 15:38:19 +09:00
6b869ec152 Update README: Expand to 105 languages with M2M100 model
- Updated title from "Malaysian Translation API" to "Multilingual Translation API"
- Added comprehensive list of 105 supported languages organized by region
- Updated model information: M2M100 (Apache 2.0 License) instead of Helsinki-NLP
- Emphasized commercial use permission (Apache 2.0 License)
- Updated port to 8001 in all examples
- Added multiple language pair examples (English↔Korean, English↔Bengali, Japanese↔English)
- Added technical stack section with detailed components
- Included model size (~1.5GB) and Python version requirements
- Added troubleshooting section
- Updated all API response examples to show M2M100 model

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-10 14:20:05 +09:00
f586f930b6 Initial commit: Multilingual Translation API
- Implemented REST API for 105+ language translation
- Used Facebook M2M100 model (Apache 2.0 License - Commercial use allowed)
- Supports any-to-any translation between 105 languages
- Major languages: English, Chinese, Spanish, Arabic, Russian, Japanese, Korean, etc.
- Southeast Asian: Malay, Indonesian, Thai, Vietnamese, Tagalog, Burmese, Khmer, Lao
- South Asian: Bengali, Hindi, Urdu, Tamil, Telugu, Marathi, Gujarati, etc.
- European: German, French, Italian, Spanish, Portuguese, Russian, etc.
- African: Swahili, Amharic, Hausa, Igbo, Yoruba, Zulu, Xhosa
- And many more languages

Tech Stack:
- FastAPI for REST API
- Transformers (Hugging Face) for ML model
- PyTorch for inference
- Docker for containerization
- M2M100 418M parameter model

Features:
- Health check endpoint
- Supported languages listing
- Dynamic language validation
- Model caching for performance
- GPU support (auto-detection)
- CORS enabled for web clients

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-10 14:11:20 +09:00